Welcome to MMEngine’s documentation!¶
You can switch between Chinese and English documents in the lower-left corner of the layout.
Introduction¶
Coming soon. Please refer to chinese documentation.
Installation¶
Prerequisites¶
Python 3.6+
PyTorch 1.6+
CUDA 9.2+
GCC 5.4+
Prepare the Environment¶
Use conda and activate the environment:
conda create -n open-mmlab python=3.7 -y conda activate open-mmlab
Install PyTorch
Before installing
MMEngine
, please make sure that PyTorch has been successfully installed in the environment. You can refer to PyTorch official installation documentation. Verify the installation with the following command:python -c 'import torch;print(torch.__version__)'
Install MMEngine¶
Install with mim¶
mim is a package management tool for OpenMMLab projects, which can be used to install the OpenMMLab project easily.
pip install -U openmim
mim install mmengine
Install with pip¶
pip install mmengine
Use docker images¶
Build the image
docker build -t mmengine https://github.com/open-mmlab/mmengine.git#main:docker/release
More information can be referred from mmengine/docker.
Run the image
docker run --gpus all --shm-size=8g -it mmengine
Build from source¶
# if cloning speed is too slow, you can switch the source to https://gitee.com/open-mmlab/mmengine.git
git clone https://github.com/open-mmlab/mmengine.git
cd mmengine
pip install -e . -v
Verify the Installation¶
To verify if MMEngine
and the necessary environment are successfully installed, we can run this command:
python -c 'import mmengine;print(mmengine.__version__)'
15 minutes to get started with MMEngine¶
In this tutorial, we’ll take training a ResNet-50 model on CIFAR-10 dataset as an example. We will build a complete and configurable pipeline for both training and validation in only 80 lines of code with MMEgnine
.
The whole process includes the following steps:
Build a Model¶
First, we need to build a model. In MMEngine, the model should inherit from BaseModel
. Aside from parameters representing inputs from the dataset, its forward
method needs to accept an extra argument called mode
:
for training, the value of
mode
is “loss,” and theforward
method should return adict
containing the key “loss”.for validation, the value of
mode
is “predict”, and the forward method should return results containing both predictions and labels.
import torch.nn.functional as F
import torchvision
from mmengine.model import BaseModel
class MMResNet50(BaseModel):
def __init__(self):
super().__init__()
self.resnet = torchvision.models.resnet50()
def forward(self, imgs, labels, mode):
x = self.resnet(imgs)
if mode == 'loss':
return {'loss': F.cross_entropy(x, labels)}
elif mode == 'predict':
return x, labels
Build a Dataset and DataLoader¶
Next, we need to create Dataset and DataLoader for training and validation. For basic training and validation, we can simply use built-in datasets supported in TorchVision.
import torchvision.transforms as transforms
from torch.utils.data import DataLoader
norm_cfg = dict(mean=[0.491, 0.482, 0.447], std=[0.202, 0.199, 0.201])
train_dataloader = DataLoader(batch_size=32,
shuffle=True,
dataset=torchvision.datasets.CIFAR10(
'data/cifar10',
train=True,
download=True,
transform=transforms.Compose([
transforms.RandomCrop(32, padding=4),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize(**norm_cfg)
])))
val_dataloader = DataLoader(batch_size=32,
shuffle=False,
dataset=torchvision.datasets.CIFAR10(
'data/cifar10',
train=False,
download=True,
transform=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize(**norm_cfg)
])))
Build a Evaluation Metrics¶
To validate and test the model, we need to define a Metric called accuracy to evaluate the model. This metric needs inherit from BaseMetric
and implements the process
and compute_metrics
methods where the process
method accepts the output of the dataset and other outputs when mode="predict"
. The output data at this scenario is a batch of data. After processing this batch of data, we save the information to self.results
property.
compute_metrics
accepts a results
parameter. The input results
of compute_metrics
is all the information saved in process
(In the case of a distributed environment, results
are the information collected from all process
in all the processes). Use these information to calculate and return a dict
that holds the results of the evaluation metrics
from mmengine.evaluator import BaseMetric
class Accuracy(BaseMetric):
def process(self, data_batch, data_samples):
score, gt = data_samples
# save the middle result of a batch to `self.results`
self.results.append({
'batch_size': len(gt),
'correct': (score.argmax(dim=1) == gt).sum().cpu(),
})
def compute_metrics(self, results):
total_correct = sum(item['correct'] for item in results)
total_size = sum(item['batch_size'] for item in results)
# return the dict containing the eval results
# the key is the name of the metric name
return dict(accuracy=100 * total_correct / total_size)
Build a Runner and Run the Task¶
Now we can build a Runner with previously defined Model
, DataLoader
, and Metrics
, and some other configs shown as follows:
from torch.optim import SGD
from mmengine.runner import Runner
runner = Runner(
# the model used for training and validation.
# Needs to meet specific interface requirements
model=MMResNet50(),
# working directory which saves training logs and weight files
work_dir='./work_dir',
# train dataloader needs to meet the PyTorch data loader protocol
train_dataloader=train_dataloader,
# optimize wrapper for optimization with additional features like
# AMP, gradtient accumulation, etc
optim_wrapper=dict(optimizer=dict(type=SGD, lr=0.001, momentum=0.9)),
# trainging coinfs for specifying training epoches, verification intervals, etc
train_cfg=dict(by_epoch=True, max_epochs=5, val_interval=1),
# validation dataloaer also needs to meet the PyTorch data loader protocol
val_dataloader=val_dataloader,
# validation configs for specifying additional parameters required for validation
val_cfg=dict(),
# validation evaluator. The default one is used here
val_evaluator=dict(type=Accuracy),
)
runner.train()
Finally, let’s put all the codes above together into a complete script that uses the MMEngine
executor for training and validation:
import torch.nn.functional as F
import torchvision
import torchvision.transforms as transforms
from torch.optim import SGD
from torch.utils.data import DataLoader
from mmengine.evaluator import BaseMetric
from mmengine.model import BaseModel
from mmengine.runner import Runner
class MMResNet50(BaseModel):
def __init__(self):
super().__init__()
self.resnet = torchvision.models.resnet50()
def forward(self, imgs, labels, mode):
x = self.resnet(imgs)
if mode == 'loss':
return {'loss': F.cross_entropy(x, labels)}
elif mode == 'predict':
return x, labels
class Accuracy(BaseMetric):
def process(self, data_batch, data_samples):
score, gt = data_samples
self.results.append({
'batch_size': len(gt),
'correct': (score.argmax(dim=1) == gt).sum().cpu(),
})
def compute_metrics(self, results):
total_correct = sum(item['correct'] for item in results)
total_size = sum(item['batch_size'] for item in results)
return dict(accuracy=100 * total_correct / total_size)
norm_cfg = dict(mean=[0.491, 0.482, 0.447], std=[0.202, 0.199, 0.201])
train_dataloader = DataLoader(batch_size=32,
shuffle=True,
dataset=torchvision.datasets.CIFAR10(
'data/cifar10',
train=True,
download=True,
transform=transforms.Compose([
transforms.RandomCrop(32, padding=4),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize(**norm_cfg)
])))
val_dataloader = DataLoader(batch_size=32,
shuffle=False,
dataset=torchvision.datasets.CIFAR10(
'data/cifar10',
train=False,
download=True,
transform=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize(**norm_cfg)
])))
runner = Runner(
model=MMResNet50(),
work_dir='./work_dir',
train_dataloader=train_dataloader,
optim_wrapper=dict(optimizer=dict(type=SGD, lr=0.001, momentum=0.9)),
train_cfg=dict(by_epoch=True, max_epochs=5, val_interval=1),
val_dataloader=val_dataloader,
val_cfg=dict(),
val_evaluator=dict(type=Accuracy),
)
runner.train()
Training log would be similar to this:
2022/08/22 15:51:53 - mmengine - INFO -
------------------------------------------------------------
System environment:
sys.platform: linux
Python: 3.8.12 (default, Oct 12 2021, 13:49:34) [GCC 7.5.0]
CUDA available: True
numpy_random_seed: 1513128759
GPU 0: NVIDIA GeForce GTX 1660 SUPER
CUDA_HOME: /usr/local/cuda
...
2022/08/22 15:51:54 - mmengine - INFO - Checkpoints will be saved to /home/mazerun/work_dir by HardDiskBackend.
2022/08/22 15:51:56 - mmengine - INFO - Epoch(train) [1][10/1563] lr: 1.0000e-03 eta: 0:18:23 time: 0.1414 data_time: 0.0077 memory: 392 loss: 5.3465
2022/08/22 15:51:56 - mmengine - INFO - Epoch(train) [1][20/1563] lr: 1.0000e-03 eta: 0:11:29 time: 0.0354 data_time: 0.0077 memory: 392 loss: 2.7734
2022/08/22 15:51:56 - mmengine - INFO - Epoch(train) [1][30/1563] lr: 1.0000e-03 eta: 0:09:10 time: 0.0352 data_time: 0.0076 memory: 392 loss: 2.7789
2022/08/22 15:51:57 - mmengine - INFO - Epoch(train) [1][40/1563] lr: 1.0000e-03 eta: 0:08:00 time: 0.0353 data_time: 0.0073 memory: 392 loss: 2.5725
2022/08/22 15:51:57 - mmengine - INFO - Epoch(train) [1][50/1563] lr: 1.0000e-03 eta: 0:07:17 time: 0.0347 data_time: 0.0073 memory: 392 loss: 2.7382
2022/08/22 15:51:57 - mmengine - INFO - Epoch(train) [1][60/1563] lr: 1.0000e-03 eta: 0:06:49 time: 0.0347 data_time: 0.0072 memory: 392 loss: 2.5956
2022/08/22 15:51:58 - mmengine - INFO - Epoch(train) [1][70/1563] lr: 1.0000e-03 eta: 0:06:28 time: 0.0348 data_time: 0.0072 memory: 392 loss: 2.7351
...
2022/08/22 15:52:50 - mmengine - INFO - Saving checkpoint at 1 epochs
2022/08/22 15:52:51 - mmengine - INFO - Epoch(val) [1][10/313] eta: 0:00:03 time: 0.0122 data_time: 0.0047 memory: 392
2022/08/22 15:52:51 - mmengine - INFO - Epoch(val) [1][20/313] eta: 0:00:03 time: 0.0122 data_time: 0.0047 memory: 308
2022/08/22 15:52:51 - mmengine - INFO - Epoch(val) [1][30/313] eta: 0:00:03 time: 0.0123 data_time: 0.0047 memory: 308
...
2022/08/22 15:52:54 - mmengine - INFO - Epoch(val) [1][313/313] accuracy: 35.7000
The corresponding implementation of PyTorch and MMEngine:
In addition to these basic components, you can also use executor to easily combine and configure various training techniques, such as enabling mixed-precision training and gradient accumulation (see OptimWrapper), configuring the learning rate decay curve (see Metrics & Evaluator), and etc.
Resume Training¶
Resuming training means continuing training from the state saved from some previous training, where the state includes the model’s weights, the state of the optimizer and the state of parameter scheduler.
Automatically resume training¶
Users can set the resume
parameter of Runner to enable automatic training resumption. When resume
is set to True
, the Runner will try to resume from the latest checkpoint in work_dir
automatically. If there is a latest checkpoint in work_dir
(e.g. the training was interrupted during the last training), the training will be resumed from that checkpoint, otherwise (e.g. the last training did not have time to save the checkpoint or a new training task is started) the training will restart. Here is an example of how to enable automatic training resumption.
runner = Runner(
model=ResNet18(),
work_dir='./work_dir',
train_dataloader=train_dataloader_cfg,
optim_wrapper=dict(optimizer=dict(type='SGD', lr=0.001, momentum=0.9)),
train_cfg=dict(by_epoch=True, max_epochs=3),
resume=True,
)
runner.train()
Specify the checkpoint path¶
If you want to specify the path to resume training, you need to set load_from
in addition to resume=True
. Note that if only load_from
is set without resume=True
, then only the weights in the checkpoint will be loaded and training will be restarted, instead of continuing with the previous state.
runner = Runner(
model=ResNet18(),
work_dir='./work_dir',
train_dataloader=train_dataloader_cfg,
optim_wrapper=dict(optimizer=dict(type='SGD', lr=0.001, momentum=0.9)),
train_cfg=dict(by_epoch=True, max_epochs=3),
load_from='./work_dir/epoch_2.pth',
resume=True,
)
runner.train()
Speed up Training¶
Distributed Training¶
MMEngine supports training models with CPU, single GPU, multiple GPUs in single machine and multiple machines. When multiple GPUs are available in the environment, we can use the following command to enable multiple GPUs in single machine or multiple machines to shorten the training time of the model.
multiple GPUs in single machine
Assuming the current machine has 8 GPUs, you can enable multiple GPUs training with the following command:
python -m torch.distributed.launch --nproc_per_node=8 examples/train.py --launcher pytorch
If you need to specify the GPU index, you can set the
CUDA_VISIBLE_DEVICES
environment variable, e.g. use the 0th and 3rd GPU.CUDA_VISIBLE_DEVICES=0,3 python -m torch.distributed.launch --nproc_per_node=2 examples/train.py --launcher pytorch
multiple machines
Assume that there are 2 machines connected with ethernet, you can simply run following commands.
On the first machine:
python -m torch.distributed.launch \ --nnodes 8 \ --node_rank 0 \ --master_addr 127.0.0.1 \ --master_port 29500 \ --nproc_per_node=8 \ examples/train.py --launcher pytorch
On the second machine:
python -m torch.distributed.launch \ --nnodes 8 \ --node_rank 1 \ --master_addr 127.0.0.1 \ --master_port 29500 \ --nproc_per_node=8 \
If you are running MMEngine in a slurm cluster, simply run the following command to enable training for 2 machines and 16 GPUs.
srun -p mm_dev \ --job-name=test \ --gres=gpu:8 \ --ntasks=16 \ --ntasks-per-node=8 \ --cpus-per-task=5 \ --kill-on-bad-exit=1 \ python examples/train.py --launcher="slurm"
Mixed Precision Training¶
Nvidia introduced the Tensor Core unit into the Volta and Turing architectures to support FP32 and FP16 mixed precision computing. With automatic mixed precision training enabled, some operators operate at FP16 and the rest operate at FP32, which reduces training time and storage requirements without changing the model or degrading its training precision, thus supporting training with larger batch sizes, larger models, and larger input sizes.
PyTorch officially supports amp from 1.6. If you are interested in the implementation of automatic mixing precision, you can refer to Mixed Precision Training.
MMEngine provides the wrapper AmpOptimWrapper for auto-mixing precision training, just set type='AmpOptimWrapper'
in optim_wrapper
to enable auto-mixing precision training, no other code changes are needed.
runner = Runner(
model=ResNet18(),
work_dir='./work_dir',
train_dataloader=train_dataloader_cfg,
optim_wrapper=dict(type='AmpOptimWrapper', optimizer=dict(type='SGD', lr=0.001, momentum=0.9)),
train_cfg=dict(by_epoch=True, max_epochs=3),
)
runner.train()
Save Memory on GPU¶
Memory capacity is critical in deep learning training and inference and determines whether the model can run successfully. Common memory saving approaches include:
Gradient Accumulation
Gradient accumulation is the mechanism that runs at a configured number of steps accumulating the gradients instead of updating parameters, after which the network parameters are updated and the gradients are cleared. With this technique of delayed parameter update, the result is similar to those scenarios using a large batch size, while the memory of activation can be saved. However, it should be noted that if the model contains a batch normalization layer, using gradient accumulation will impact performance.
Gradient Checkpointing
Gradient checkpointing is a time-for-space method that compresses the model by reducing the number of saved activations, however, the unstored activations must be recomputed when calculating the gradient. The corresponding functionality has been implemented in the
torch.utils.checkpoint
package. The implementation can be briefly concluded as that, in the forward phase, the forward function passed to the checkpoint runs intorch.no_grad
mode and saves only the input and the output of the forward function. Then recalculates its intermediate activations in the backward phase.Large Model Training Techniques
Recent research has shown that training a large model would be helpful to improve performance, but training a model at such a scale requires huge resources, and it is hard to store the entire model in the memory of a single graphics card. Therefore large model training techniques, typically such as DeepSpeed ZeRO and the Fully Shared Data Parallel (FSDP) technique introduced in FairScale are introduced. These techniques allow slicing the parameters, gradients, and optimizer states among the parallel processes, while still maintaining the simplicity of the data parallelism.
MMEngine now supports gradient accumulation and large model training FSDP techniques, and the usages are described as follows.
Gradient Accumulation¶
The configuration can be written in this way:
optim_wrapper_cfg = dict(
type='OptimWrapper',
optimizer=dict(type='SGD', lr=0.001, momentum=0.9),
# update every four times
accumulative_counts=4)
The full example working with Runner
is as follows.
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from mmengine.runner import Runner
from mmengine.model import BaseModel
train_dataset = [(torch.ones(1, 1), torch.ones(1, 1))] * 50
train_dataloader = DataLoader(train_dataset, batch_size=2)
class ToyModel(BaseModel):
def __init__(self) -> None:
super().__init__()
self.linear = nn.Linear(1, 1)
def forward(self, img, label, mode):
feat = self.linear(img)
loss1 = (feat - label).pow(2)
loss2 = (feat - label).abs()
return dict(loss1=loss1, loss2=loss2)
runner = Runner(
model=ToyModel(),
work_dir='tmp_dir',
train_dataloader=train_dataloader,
train_cfg=dict(by_epoch=True, max_epochs=1),
optim_wrapper=dict(optimizer=dict(type='SGD', lr=0.01),
accumulative_counts=4)
)
runner.train()
Large Model Training¶
FSDP
is officially supported from PyTorch 1.11. The config can be written in this way:
# located in cfg file
model_wrapper_cfg=dict(type='MMFullyShardedDataParallel', cpu_offload=True)
The full example working with Runner
is as follows.
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from mmengine.runner import Runner
from mmengine.model import BaseModel
train_dataset = [(torch.ones(1, 1), torch.ones(1, 1))] * 50
train_dataloader = DataLoader(train_dataset, batch_size=2)
class ToyModel(BaseModel):
def __init__(self) -> None:
super().__init__()
self.linear = nn.Linear(1, 1)
def forward(self, img, label, mode):
feat = self.linear(img)
loss1 = (feat - label).pow(2)
loss2 = (feat - label).abs()
return dict(loss1=loss1, loss2=loss2)
runner = Runner(
model=ToyModel(),
work_dir='tmp_dir',
train_dataloader=train_dataloader,
train_cfg=dict(by_epoch=True, max_epochs=1),
optim_wrapper=dict(optimizer=dict(type='SGD', lr=0.01)),
cfg=dict(model_wrapper_cfg=dict(type='MMFullyShardedDataParallel', cpu_offload=True))
)
runner.train()
Please be noted that FSDP
works only in distributed training environments.
Train a GAN¶
Coming soon. Please refer to chinese documentation.
Runner¶
Welcome to the tutorial of runner, the core of MMEngine’s user interface!
The runner, as an “integrator” in MMEngine, covers all aspects of the framework and shoulders the responsibility of organizing and scheduling nearly all modules. Therefore, the code logic in it has to take into account various situations, making it relatively hard to understand. But don’t worry! In this tutorial, we will leave out some messy details and have a quick overview of commonly used APIs, functionalities, and examples. Hopefully, this should provide you with a clear and easy-to-understand user interface. After reading through this tutorial, you will be able to:
Master the common usage and configuration of the runner
Learn the best practice - writing config files - of the runner
Know about the basic dataflow and execution order
Feel by yourself the advantages of using runner (perhaps)
Example codes of the runner¶
To build your training pipeline with a runner, there are typically two ways to get started:
Refer to runner’s API documentation for argument-by-argument configuration
Make your custom modifications based on some existing configurations, such as Getting started in 15 minutes and downstream repositories like MMDet
Pros and cons lie in both approaches. For the former one, beginners may be lost in a vast number of configurable arguments. For the latter one, beginners may find it hard to get a good reference, since neither an over-simplified nor an over-detailed reference is conducive to them.
We argue that the key to learning runner is using it as a memo. You should remember its most commonly used arguments and only focus on those less used when in need, since default values usually work fine. In the following, we will provide a beginner-friendly example to illustrate the most commonly used arguments of the runner, along with advanced guidelines for those less used.
A beginer-friendly example¶
Hint
In this tutorial, we hope you can focus more on overall architecture instead of implementation details. This “top-down” way of thinking is exactly what we advocate. Don’t worry, you will definitely have plenty of opportunities and guidance afterward to focus on modules you want to improve.
Before running the actual example below, you should first run this piece of code for the preparation of the model, dataset, and metric. However, these implementations are not important in this tutorial and you can simply look through
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset
from mmengine.model import BaseModel
from mmengine.evaluator import BaseMetric
from mmengine.registry import MODELS, DATASETS, METRICS
@MODELS.register_module()
class MyAwesomeModel(BaseModel):
def __init__(self, layers=4, activation='relu') -> None:
super().__init__()
if activation == 'relu':
act_type = nn.ReLU
elif activation == 'silu':
act_type = nn.SiLU
elif activation == 'none':
act_type = nn.Identity
else:
raise NotImplementedError
sequence = [nn.Linear(2, 64), act_type()]
for _ in range(layers-1):
sequence.extend([nn.Linear(64, 64), act_type()])
self.mlp = nn.Sequential(*sequence)
self.classifier = nn.Linear(64, 2)
def forward(self, data, labels, mode):
x = self.mlp(data)
x = self.classifier(x)
if mode == 'tensor':
return x
elif mode == 'predict':
return F.softmax(x, dim=1), labels
elif mode == 'loss':
return {'loss': F.cross_entropy(x, labels)}
@DATASETS.register_module()
class MyDataset(Dataset):
def __init__(self, is_train, size):
self.is_train = is_train
if self.is_train:
torch.manual_seed(0)
self.labels = torch.randint(0, 2, (size,))
else:
torch.manual_seed(3407)
self.labels = torch.randint(0, 2, (size,))
r = 3 * (self.labels+1) + torch.randn(self.labels.shape)
theta = torch.rand(self.labels.shape) * 2 * torch.pi
self.data = torch.vstack([r*torch.cos(theta), r*torch.sin(theta)]).T
def __getitem__(self, index):
return self.data[index], self.labels[index]
def __len__(self):
return len(self.data)
@METRICS.register_module()
class Accuracy(BaseMetric):
def __init__(self):
super().__init__()
def process(self, data_batch, data_samples):
score, gt = data_samples
self.results.append({
'batch_size': len(gt),
'correct': (score.argmax(dim=1) == gt).sum().cpu(),
})
def compute_metrics(self, results):
total_correct = sum(r['correct'] for r in results)
total_size = sum(r['batch_size'] for r in results)
return dict(accuracy=100*total_correct/total_size)
Click to show a long example. Be well prepared
from torch.utils.data import DataLoader, default_collate
from torch.optim import Adam
from mmengine.runner import Runner
runner = Runner(
# your model
model=MyAwesomeModel(
layers=2,
activation='relu'),
# work directory for saving checkpoints and logs
work_dir='exp/my_awesome_model',
# training data
train_dataloader=DataLoader(
dataset=MyDataset(
is_train=True,
size=10000),
shuffle=True,
collate_fn=default_collate,
batch_size=64,
pin_memory=True,
num_workers=2),
# training configurations
train_cfg=dict(
by_epoch=True, # display in epoch number instead of iterations
max_epochs=10,
val_begin=2, # start validation from the 2nd epoch
val_interval=1), # do validation every 1 epoch
# OptimizerWrapper, new concept in MMEngine for richer optimization options
# Default value works fine for most cases. You may check our documentations
# for more details, e.g. 'AmpOptimWrapper' for enabling mixed precision
# training.
optim_wrapper=dict(
optimizer=dict(
type=Adam,
lr=0.001)),
# ParamScheduler to adjust learning rates or momentums during training
param_scheduler=dict(
type='MultiStepLR',
by_epoch=True,
milestones=[4, 8],
gamma=0.1),
# validation data
val_dataloader=DataLoader(
dataset=MyDataset(
is_train=False,
size=1000),
shuffle=False,
collate_fn=default_collate,
batch_size=1000,
pin_memory=True,
num_workers=2),
# validation configurations, usually leave it an empty dict
val_cfg=dict(),
# evaluation metrics and evaluator
val_evaluator=dict(type=Accuracy),
# following are advanced configurations, try to default when not in need
# hooks are advanced usage, try to default when not in need
default_hooks=dict(
# the most commonly used hook for modifying checkpoint saving interval
checkpoint=dict(type='CheckpointHook', interval=1)),
# `luancher` and `env_cfg` responsible for distributed environment
launcher='none',
env_cfg=dict(
cudnn_benchmark=False, # whether enable cudnn_benchmark
backend='nccl', # distributed communication backend
mp_cfg=dict(mp_start_method='fork')), # multiprocessing configs
log_level='INFO',
# load model weights from given path. None for no loading.
load_from=None
# resume training from the given path
resume=False
)
# start training your model
runner.train()
Explanations on example codes¶
Really a long piece of code, isn’t it! However, if you read through the above example, you may have already understood the training process in general even without knowing any implementation details, thanks to the compactness and readability of runner codes (probably). This is what MMEngine expects: a structured, modular, and standardized training process that allows for more reliable reproductions and clearer comparisons.
The above example may lead you to the following confusion:
There are too many arguments!
Don’t worry. As we mentioned before, use runner as a memo. The runner covers all aspects just to ensure you won’t miss something important. You don’t actually need to configure everything. The simple example in 15 minutes still works fine, and it can be even more simplified by removing val_evaluator
, val_dataloader
, and val_cfg
without breaking down. All configurable arguments are driven by your demands. Those not in your focus usually work fine by default.
Why are some arguments passed as dicts?
Well, this is related to MMEngine’s style. In MMEngine, we provide 2 different styles of runner construction: a) manual construction and b) construction via registry. If you are confused, the following example will give a good illustration:
from mmengine.model import BaseModel
from mmengine.runner import Runner
from mmengine.registry import MODELS # root registry for your custom model
@MODELS.register_module() # decorator for registration
class MyAwesomeModel(BaseModel): # your custom model
def __init__(self, layers=18, activation='silu'):
...
# An example of manual construction
runner = Runner(
model=dict(
type='MyAwesomeModel',
layers=50,
activation='relu'),
...
)
# An example of construction via registry
model = MyAwesomeModel(layers=18, activation='relu')
runner = Runner(
model=model,
...
)
Similar to the above example, most arguments in the runner accept both 2 types of inputs. They are conceptually equivalent. The difference is, in the former style, the module (passed in as a dict
) will be built in the runner when actually needed, while in the latter style, the module has been built before being passed to the runner. The following figure illustrates the core idea of registry: it maintains the mapping between a module’s build method and its registry name. If you want to learn more about the full usage of the registry, you are recommended to read Registry tutorial.
You might still be confused after the explanation. Why should we let the Runner build modules from dicts? What are the benefits? If you have such questions, then we are proud to answer: “Absolutely - no benefits!” In fact, module construction via registry only works to its best advantage when combined with a configuration file. It is still far from the best practice to write as the above example. We provide it here just to make sure you can read and get used to this writing style, which may facilitate your understanding of the actual best practice we will soon talk about - the configuration file. Stay tuned!
If you as a beginner do not immediately understand, it doesn’t matter too much, because manual construction is still a good choice, especially for small-scale development and trial-and-error due to its being IDE friendly. However, you are still expected to read and get used to the writing style via registry, so that you can avoid being unnecessarily confused and puzzled in subsequent tutorials.
Where can I find the possible configuration options for the xxx argument?
You will find extensive instructions and examples in those tutorials of the corresponding modules. You can also find all possible arguments in Runner’s API documentation. If neither of the above resolves your query, you are always encouraged to start a topic in our discussion forum. It also helps us improve documentations.
I come from repositoried like MMDet/MMCls... Why does this example differ from what I've been exposed to?
Downstream repositories in OpenMMLab have widely adopted the writing style of config files. In the following chapter, we will show the usage of config files, the best practice of the runner in MMEngine, based on the above example with a slight variation.
Best practice of the Runner - config files¶
MMEngine provides a powerful config file system that supports Python syntax. You can almost seamlessly (which we will illustrate below) convert from the previous sample code to a config file. Here is an example:
# Save the following codes in example_config.py
# Almost copied from the above example, with some commas removed
model = dict(type='MyAwesomeModel',
layers=2,
activation='relu')
work_dir = 'exp/my_awesome_model'
train_dataloader = dict(
dataset=dict(type='MyDataset',
is_train=True,
size=10000),
sampler=dict(
type='DefaultSampler',
shuffle=True),
collate_fn=dict(type='default_collate'),
batch_size=64,
pin_memory=True,
num_workers=2)
train_cfg = dict(
by_epoch=True,
max_epochs=10,
val_begin=2,
val_interval=1)
optim_wrapper = dict(
optimizer=dict(
type='Adam',
lr=0.001))
param_scheduler = dict(
type='MultiStepLR',
by_epoch=True,
milestones=[4, 8],
gamma=0.1)
val_dataloader = dict(
dataset=dict(type='MyDataset',
is_train=False,
size=1000),
sampler=dict(
type='DefaultSampler',
shuffle=False),
collate_fn=dict(type='default_collate'),
batch_size=1000,
pin_memory=True,
num_workers=2)
val_cfg = dict()
val_evaluator = dict(type='Accuracy')
default_hooks = dict(
checkpoint=dict(type='CheckpointHook', interval=1))
launcher = 'none'
env_cfg = dict(
cudnn_benchmark=False,
backend='nccl',
mp_cfg=dict(mp_start_method='fork'))
log_level = 'INFO'
load_from = None
resume = False
Given the above config file, we can simply load configurations and run the training pipeline in a few lines of codes as follows:
from mmengine.config import Config
from mmengine.runner import Runner
config = Config.fromfile('example_config.py')
runner = Runner.from_cfg(config)
runner.train()
Note
Although it supports Python syntax, a valid config file needs to meet the condition that all variables must be Python built-in types such as str
, dict
and int
. Therefore, the config system is highly dependent on the registry mechanism to enable construction from built-in types to other types such as nn.Module
.
Note
When using config files, you typically don’t need to manually register every module. For instance, all optimizers in torch.optim
including Adam
and SGD
have already been registered in mmengine.optim
. The rule of thumb is, try to directly access modules provided by PyTorch, and only start to register them manually after error occurs.
Note
When using config files, the implementations of your custom modules may be stored in separate files and thus not registered properly, which will lead to errors in the build process. You may find solutions in Registry tutorial by searching for custom_imports
.
Writing config files of the runner has been widely adopted in downstream repositories in OpenMMLab projects. It has been a de facto convention and best practice. The config files are far more featured than illustrated above. You can refer to Config tutorial for more advanced features including keywords inheriting and overriding.
Basic dataflow¶
Hint
In this chapter, we’ll dive deeper into the runner to illustrate dataflow and data format convention between modules managed by the runner. It may be relatively abstract and dry if you haven’t built a training pipeline with MMEngine. Therefore, you are free to skip for now and read it in conjunction with practice in the future when in need.
Now let’s dive slightly deeper into the runner, and illustrate the dataflow and data format convention under the hood (or, under the engine)!
The diagram above illustrates the basic dataflow of the runner, where the dashed border, gray filled shapes represent different data formats, while solid boxes represent modules/methods. Due to the great flexibility and extensibility of MMEngine, you can always inherit some key base classes and override their methods, so the above diagram doesn’t always hold. It only holds when you are not customizing your own Runner
or TrainLoop
, and you are not overriding train_step
, val_step
or test_step
method in your custom model. Actually, this is common for most tasks like detection and segmentation, as referred to Model tutorial.
Can you state the exact type of each data item shown in the diagram?
Unfortunately, this is not possible. Although we did heavy type annotations in MMEngine, Python is still a highly dynamic programming language, and deep learning as a data-centric system needs to be flexible enough to deal with a wide range of complex data sources. You always have full freedom to decide when you need (and sometimes must) break type conventions. Therefore, when you are customizing your module (e.g. val_evaluator
), you need to make sure its input is compatible with upstream (e.g. model
) output and its output can be parsed by downstream. MMEngine puts the flexibility of handling data in the hands of the user, and thus also requires the user to ensure compatibility of dataflow, which, in fact, is not that difficult once you get started.
The uniformity of data formats has always been a problem in deep learning. We are trying to improve it in MMEngine in our own way. If you are interested, you can refer to BaseDataset and BaseDataElement - but please note that they are mainly geared towards advanced users.
What's the data format convention between dataloader, model and evaluator?
For the basic dataflow shown in the diagram above, the data transfer between the above three modules can be represented by the following pseudo-code:
# training
for data_batch in train_dataloader:
data_batch = data_preprocessor(data_batch)
if isinstance(data_batch, dict):
losses = model.forward(**data_batch, mode='loss')
elif isinstance(data_batch, (list, tuple)):
losses = model.forward(*data_batch, mode='loss')
else:
raise TypeError()
# validation
for data_batch in val_dataloader:
data_batch = data_preprocessor(data_batch)
if isinstance(data_batch, dict):
outputs = model.forward(**data_batch, mode='predict')
elif isinstance(data_batch, (list, tuple)):
outputs = model.forward(**data_batch, mode='predict')
else:
raise TypeError()
evaluator.process(data_samples=outputs, data_batch=data_batch)
metrics = evaluator.evaluate(len(val_dataloader.dataset))
The key points of the above pseudo-code is:
Outputs of data_preprocessor are passed to model after unpacking
The
data_samples
argument of the evaluator receives the prediction results of the model, while thedata_batch
argument receives the raw data coming from dataloader
What is data_preprocessor? Can I do image pre-processing such as crop and resize in it?
Though drawn separately in the diagram, data_preprocessor is a part of the model and thus can be found in Model tutorial in DataPreprocessor chapter.
In most cases, data_preprocessor needs no special attention or manual configuration. The default data_preprocessor will only do data transfer between host and GPU devices. However, if your model has incompatible inputs format with dataloader’s output, you can also customize you own data_preprocessor for data formatting.
Image pre-processing such as crop and resize is more recommended in data transforms module, but for batch-related data transforms (e.g. batch-resize), it can be implemented here.
Why does module produce 3 different outputs? What is the meaning of "loss", "predict" and "tensor"?
As described in get started in 15 minutes, you need to implement 3 data paths in your custom model’s forward
function to suit different pipelines for training, validation and testing. This is further discussed in Model tutorial.
I can see that the red line is for training process and the blue line for validation/testing, but what is the green line?
Currently model outputs in “tensor” mode has not been officially used in runner. The “tensor” mode can output some intermediate results and thus facilitating debugging process.
What if I override methods such as train_step? Will the diagram totally fail?
The behavior of default train_step
, val_step
and test_step
covers the dataflow from data_preprocessor to model outputs and optim_wrapper. The rest of the diagram will not be spoiled.
Why use the runner? (Optional reading)¶
Hint
Contents in this chapter will not teach you how to use the runner and MMEngine. If you are being pushed by your employer/advisor/DDL to work out a result in a few hours, it may not help you and you can feel free to skip it. However, we highly recommend taking time to read through this chapter, since it will help you better understand the aim and style of MMEngine.
Relax, time for some philosophy
Congratulations for reading through the runner tutorial, a long, long but kind of interesting (hope so) tutorial! Please believe that all of these - this tutorial, the runner, MMEngine - are intended to make things easier for you.
The runner is the “manager” of all modules in MMEngine. In the runner, all the distinct modules - whether visible ones like model and dataset, or obscure ones like logging, distributed environment and random seed - are getting organized and scheduled. The runner deals with the complex relationship between different modules and provides you with a clear, easy-to-understand and configurable interface. The benefits of this design are:
You can modify or add your codes without spoiling your whole codebase. For example, you may start with single GPU training and you can always add a few lines of configuration codes to enable multi GPUs or even multi nodes training.
You can continuously benefit from new features without worrying about backward compatibility. Mixed precision training, visualization, state of the art distributed training methods, various device backends… We will continue to absorb the best suggestions and cutting-edge technologies from the community while ensuring backward compatibility, and provide them to you in a clear interface.
You can focus on your own awesome ideas without being bothered by other annoying and irrelevant details. The default values will handle most cases.
So, MMEngine and the runner will truly make things easier for you. With only a little effort on migration, your code and experiments will evolve with MMEngine. With a little more effort, the config file system allows you to manage your data, model, and experiments more efficiently. Convenience and reliability are the aims we strive for.
The blue one, or the red one - are you prepared to use MMEngine?
Suggestions on next steps¶
If you want to:
Write your own model structure
Refer to Model tutorial
Use your own datasets
Refer to Dataset and DataLoader tutorial
Change evaluation metrics
Refer to Evaluation tutorial
Do something related to optimizers or mixed-precision training
Refer to OptimWrapper tutorial
Schedule learning rates or other parameters during training
Refer to Parameter Scheduler tutorial
Something not mentioned above
“Common Usage” section to the left contains more example codes
“Advanced tutorials” to the left consists of more contents for experienced developers to make more flexible extensions to the training pipeline
Hook provides some flexible modifications without spoiling your codes
If none of the above solves your problem, you are always welcome to start a topic in our discussion forum!
Dataset and DataLoader¶
Hint
If you have never been exposed to PyTorch’s Dataset and DataLoader classes, you are recommended to read through PyTorch official tutorial to get familiar with some basic concepts.
Datasets and DataLoaders are necessary components in MMEngine’s training pipeline. They are conceptually derived from and consistent with PyTorch. Typically, a dataset defines the quantity, parsing, and pre-processing of the data, while a dataloader iteratively loads data according to settings such as batch_size
, shuffle
, num_workers
, etc. Datasets are encapsulated with dataloaders and they together constitute the data source.
In this tutorial, we will step through their usage in MMEngine runner from the outside (dataloader) to the inside (dataset) and give some practical examples. After reading through this tutorial, you will be able to:
Master the configuration of dataloaders in MMEngine
Learn to use existing datasets (e.g. those from
torchvision
) from config filesKnow about building and using your own dataset
Details on dataloader¶
Dataloaders can be configured in MMEngine’s Runner
with 3 arguments:
train_dataloader
: Used inRunner.train()
to provide training data for modelsval_dataloader
: Used inRunner.val()
or inRunner.train()
at regular intervals for model evaluationtest_dataloader
: Used inRunner.test()
for the final test
MMEngine has full support for PyTorch native DataLoader
objects. Therefore, you can simply pass your valid, already built dataloaders to the runner, as shown in getting started in 15 minutes. Meanwhile, thanks to the Registry Mechanism of MMEngine, those arguments also accept dict
s as inputs, as illustrated in the following example (referred to as example 1). The keys in the dictionary correspond to arguments in DataLoader’s init function.
runner = Runner(
train_dataloader=dict(
batch_size=32,
sampler=dict(
type='DefaultSampler',
shuffle=True),
dataset=torchvision.datasets.CIFAR10(...),
collate_fn=dict(type='default_collate')
)
)
When passed to the runner in the form of a dict, the dataloader will be lazily built in the runner when actually needed.
Note
For more configurable arguments of the DataLoader
, please refer to PyTorch API documentation
Note
If you are interested in the details of the building procedure, you may refer to build_dataloader
You may find example 1 differs from that in getting started in 15 minutes in some arguments. Indeed, due to some obscure conventions in MMEngine, you can’t seamlessly switch it to a dict by simply replacing DataLoader
with dict
. We will discuss the differences between our convention and PyTorch’s in the following sections, in case you run into trouble when using config files.
sampler and shuffle¶
One obvious difference is that we add a sampler
argument to the dict. This is because we require sampler
to be explicitly specified when using a dict as a dataloader. Meanwhile, shuffle
is also removed from DataLoader
arguments, because it conflicts with sampler
in PyTorch, as referred to in PyTorch DataLoader API documentation.
Note
In fact, shuffle
is just a notation for convenience in PyTorch implementation. If shuffle
is set to True
, the dataloader will automatically switch to RandomSampler
With a sampler
argument, codes in example 1 is nearly equivalent to code block below
from mmengine.dataset import DefaultSampler
dataset = torchvision.datasets.CIFAR10(...)
sampler = DefaultSampler(dataset, shuffle=True)
runner = Runner(
train_dataloader=DataLoader(
batch_size=32,
sampler=sampler,
dataset=dataset,
collate_fn=default_collate
)
)
Warning
The equivalence of the above codes holds only if: 1) you are training with a single process, and 2) no randomness
argument is passed to the runner. This is due to the fact that sampler
should be built after distributed environment setup to be correct. The runner will guarantee the correct order and proper random seed by applying lazy initialization techniques, which is only possible for dict inputs. Instead, when building a sampler manually, it requires extra work and is highly error-prone. Therefore, the code block above is just for illustration and definitely not recommended. We strongly suggest passing sampler
as a dict
to avoid potential problems.
DefaultSampler¶
The above example may make you wonder what a DefaultSampler
is, why use it and whether there are other options. In fact, DefaultSampler
is a built-in sampler in MMEngine which eliminates the gap between distributed and non-distributed training and thus enabling a seamless conversion between them. If you have the experience of using DistributedDataParallel
in PyTorch, you may be impressed by having to change the sampler
argument to make it correct. However, in MMEngine, you don’t need to bother with this DefaultSampler
.
DefaultSampler
accepts the following arguments:
shuffle
: Set toTrue
to load data in the dataset in random orderseed
: Random seed used to shuffle the dataset. Typically it doesn’t require manual configuration here because the runner will handle it withrandomness
configurationround_up
: When set this toTrue
, this is the same behavior as settingdrop_last=False
in PyTorchDataLoader
. You should take care of it when doing migration from PyTorch.
Note
For more details about DefaultSampler
, please refer to its API docs
DefaultSampler
handles most of the cases. We ensure that error-prone details such as random seeds are handled properly when you are using it in a runner. This prevents you from getting into troubles with distributed training. Apart from DefaultSampler
, you may also be interested in InfiniteSampler for iteration-based training pipelines. If you have more advanced demands, you may want to refer to the codes of these two built-in samplers to implement your own one and register it to DATA_SAMPLERS
registry.
@DATA_SAMPLERS.register_module()
class MySampler(Sampler):
pass
runner = Runner(
train_dataloader=dict(
sampler=dict(type='MySampler'),
...
)
)
The obscure collate_fn¶
Among the arguments of PyTorch DataLoader
, collate_fn
is often ignored by users, but in MMEngine you must pay special attention to it. When you pass the dataloader argument as a dict, MMEngine will use the built-in pseudo_collate by default, which is significantly different from that, default_collate, in PyTorch. Therefore, when doing a migration from PyTorch, you have to explicitly specify the collate_fn
in config files to be consistent in behavior.
Note
MMEngine uses pseudo_collate
as default value is mainly due to historical compatibility reasons. You don’t have to look deeply into it. You can just know about it and avoid potential errors.
MMEngine provides 2 built-in collate_fn
:
pseudo_collate
: Default value in MMEngine. It won’t concatenate data throughbatch
index. Detailed explanations can be found in pseudo_collate API docdefault_collate
: It behaves almost identically to PyTorch’sdefault_collate
. It will transfer data intoTensor
and concatenate them throughbatch
index. More details and slight differences from PyTorch can be found in default_collate API doc
If you want to use a custom collate_fn
, you can register it to COLLATE_FUNCTIONS
registry.
@COLLATE_FUNCTIONS.register_module()
def my_collate_func(data_batch: Sequence) -> Any:
pass
runner = Runner(
train_dataloader=dict(
...
collate_fn=dict(type='my_collate_func')
)
)
Details on dataset¶
Typically, datasets define the quantity, parsing, and pre-processing of the data. It is encapsulated in dataloader, allowing the latter to load data in batches. Since we fully support PyTorch DataLoader
, the dataset is also compatible. Meanwhile, thanks to the registry mechanism, when a dataloader is given as a dict, its dataset
argument can also be given as a dict, which enables lazy initialization in the runner. This mechanism allows for writing config files.
Use torchvision datasets¶
torchvision
provides various open datasets. They can be directly used in MMEngine as shown in getting started in 15 minutes, where a CIFAR10
dataset is used together with torchvision’s built-in data transforms.
However, if you want to use the dataset in config files, registration is needed. What’s more, if you also require data transforms in torchvision, some more registrations are required. The following example illustrates how to do it.
import torchvision.transforms as tvt
from mmengine.registry import DATASETS, TRANSFORMS
from mmengine.dataset.base_dataset import Compose
# register CIFAR10 dataset in torchvision
# data transforms should also be built here
@DATASETS.register_module(name='Cifar10', force=False)
def build_torchvision_cifar10(transform=None, **kwargs):
if isinstance(transform, dict):
transform = [transform]
if isinstance(transform, (list, tuple)):
transform = Compose(transform)
return torchvision.datasets.CIFAR10(**kwargs, transform=transform)
# register data transforms in torchvision
DATA_TRANSFORMS.register_module('RandomCrop', module=tvt.RandomCrop)
DATA_TRANSFORMS.register_module('RandomHorizontalFlip', module=tvt.RandomHorizontalFlip)
DATA_TRANSFORMS.register_module('ToTensor', module=tvt.ToTensor)
DATA_TRANSFORMS.register_module('Normalize', module=tvt.Normalize)
# specify in runner
runner = Runner(
train_dataloader=dict(
batch_size=32,
sampler=dict(
type='DefaultSampler',
shuffle=True),
dataset=dict(type='Cifar10',
root='data/cifar10',
train=True,
download=True,
transform=[
dict(type='RandomCrop', size=32, padding=4),
dict(type='RandomHorizontalFlip'),
dict(type='ToTensor'),
dict(type='Normalize', **norm_cfg)])
)
)
Note
The above example makes extensive use of the registry mechanism and borrows the Compose module from MMEngine. If you urge to use torchvision dataset in your config files, you can refer to it and make some slight modifications. However, we recommend you borrow datasets from downstream repos such as MMDet, MMCls, etc. This may give you a better experience.
Customize your dataset¶
You are free to customize your own datasets, as you would with PyTorch. You can also copy existing datasets from your previous PyTorch projects. If you want to learn how to customize your dataset, please refer to PyTorch official tutorials
Use MMEngine BaseDataset¶
Apart from directly using PyTorch native Dataset
class, you can also use MMEngine’s built-in class BaseDataset
to customize your own one, as referred to BaseDataset tutorial. It makes some conventions on the format of annotation files, which makes the data interface more unified and multi-task training more convenient. Meanwhile, BaseDataset
can easily cooperate with built-in data transforms in MMEngine, which releases you from writing one from scratch.
Currently, BaseDataset
has been widely used in downstream repos of OpenMMLab 2.0 projects.
Model¶
Runner and model¶
As mentioned in basic dataflow, the dataflow between DataLoader, model and evaluator follows some rules. Don’t remember clearly? Let’s review it:
# Training process
for data_batch in train_dataloader:
data_batch = model.data_preprocessor(data_batch, training=True)
if isinstance(data_batch, dict):
losses = model(**data_batch, mode='loss')
elif isinstance(data_batch, (list, tuple)):
losses = model(*data_batch, mode='loss')
else:
raise TypeError()
# Validation process
for data_batch in val_dataloader:
data_batch = model.data_preprocessor(data_batch, training=False)
if isinstance(data_batch, dict):
outputs = model(**data_batch, mode='predict')
elif isinstance(data_batch, (list, tuple)):
outputs = model(**data_batch, mode='predict')
else:
raise TypeError()
evaluator.process(data_samples=outputs, data_batch=data_batch)
metrics = evaluator.evaluate(len(val_dataloader.dataset))
In runner tutorial, we simply mentioned the relationship between DataLoader, model and evaluator, and introduced the concept of data_preprocessor
. You may have a certain understanding of the model. However, during the running of Runner, the situation is far more complex than the above pseudo-code.
In order to focus your attention on the algorithm itself, and ignore the complex relationship between the model, DataLoader and evaluator, we designed BaseModel. In most cases, the only thing you need to do is to make your model inherit from BaseModel
, and implement the forward
as required to perform the training, testing, and validation process.
Before continuing reading the model tutorial, let’s throw out two questions that we hope you will find the answers after reading the model tutorial:
When do we update the parameters of model? and how to update the parameters by a custom optimization process?
Why is the concept of data_preprocessor necessary? What functions can it perform?
Interface introduction¶
Usually, we should define a model to implement the body of the algorithm. In MMEngine, model will be managed by Runner, and need to implement some interfaces, such as train_step
, val_step
, and test_step
. For high-level tasks like detection, classification, and segmentation, the interfaces mentioned above commonly implement a standard workflow. For example, train_step
will calculate the loss and update the parameters of the model, and val_step
/test_step
will calculate the metrics and return the predictions. Therefore, MMEnine abstracts the BaseModel to implement the common workflow.
Benefits from the BaseModel
, we only need to make the model inherit from BaseModel
, and implement the forward
function to perform the training, testing, and validation process.
Note
BaseModel inherits from BaseModule,which can be used to initialize the model parameters dynamically.
forward: The arguments of forward
need to match with the data given by DataLoader. If the DataLoader samples a tuple data
, forward
needs to accept the value of unpacked *data
. If DataLoader returns a dict data
, forward
needs to accept the key-value of unpacked **data
. forward
also accepts mode
parameter, which is used to control the running branch:
mode='loss'
:loss
mode is enabled in training process, andforward
returns a differentiable lossdict
. Each key-value pair in lossdict
will be used to log the training status and optimize the parameters of model. This branch will be called bytrain_step
mode='predict'
:predict
mode is enabled in validation/testing process, andforward
will return predictions, which matches with arguments of process. Repositories of OpenMMLab have a more strict rules. The predictions must be a list and each element of it must be a BaseDataElement. This branch will be called byval_step
mode='tensor'
: Intensor
andpredict
modes,forward
will return the predictions. The difference is thatforward
will return atensor
or a container ortensor
which has not been processed by a series of post-process methods, such as non-maximum suppression (NMS). You can customize your post-process method after getting the result oftensor
mode.
train_step: Get the loss dict
by calling forward
with loss
mode. BaseModel
implements a standard optimization process as follows:
def train_step(self, data, optim_wrapper):
# See details in the next section
data = self.data_preprocessor(data, training=True)
# `loss` mode, return a loss dict. Actually train_step accepts
# both tuple dict input, and unpack it with ** or *
loss = self(**data, mode='loss')
# Parse the loss dict and return the parsed losses for optimization
# and log_vars for logging
parsed_losses, log_vars = self.parse_losses()
optim_wrapper.update_params(parsed_losses) # 更新参数
return log_vars
val_step: Get the predictions by calling forward
with predict
mode.
def val_step(self, data, optim_wrapper):
data = self.data_preprocessor(data, training=False)
outputs = self(**data, mode='predict')
return outputs
test_step: There is no difference between val_step
and test_step
in BaseModel
. But we can customize it in the subclasses, for example, you can get validation loss in val_step
.
Understand the interfaces of BaseModel
, now we are able to come up with a more complete pseudo-code:
# training
for data_batch in train_dataloader:
loss_dict = model.train_step(data_batch)
# validation
for data_batch in val_dataloader:
preds = model.test_step(data_batch)
evaluator.process(data_samples=outputs, data_batch=data_batch)
metrics = evaluator.evaluate(len(val_dataloader.dataset))
Great!, ignoring Hook
, the pseudo-code above almost implements the main logic in loop! Let’s go back to 15 minutes to get started with MMEngine, we may truly understand what MMResNet
has done:
import torch.nn.functional as F
import torchvision
from mmengine.model import BaseModel
class MMResNet50(BaseModel):
def __init__(self):
super().__init__()
self.resnet = torchvision.models.resnet50()
def forward(self, imgs, labels, mode):
x = self.resnet(imgs)
if mode == 'loss':
return {'loss': F.cross_entropy(x, labels)}
elif mode == 'predict':
return x, labels
# train_step, val_step and test_step have been implemented in BaseModel.
# We list the equivalent code here for better understanding
def train_step(self, data, optim_wrapper):
data = self.data_preprocessor(data)
loss = self(*data, mode='loss')
parsed_losses, log_vars = self.parse_losses()
optim_wrapper.update_params(parsed_losses)
return log_vars
def val_step(self, data, optim_wrapper):
data = self.data_preprocessor(data)
outputs = self(*data, mode='predict')
return outputs
def test_step(self, data, optim_wrapper):
data = self.data_preprocessor(data)
outputs = self(*data, mode='predict')
return outputs
Now, you may have a deeper understanding of dataflow, and can answer the first question in Runner and model.
BaseModel.train_step
implements the standard optimization, and if we want to customize a new optimization process, we can override it in the subclass. However, it is important to note that we need to make sure that train_step
returns a loss dict.
DataPreprocessor¶
If your computer is equipped with a GPU (or other hardware that can accelerate training, such as MPS, IPU, etc.), when you run the 15 minutes tutorial, you will see that the program is running on the GPU, but, when does MMEngine
move the data and model from the CPU to the GPU?
In fact, the Runner will move the model to the specified device during the construction, while the data will be moved to the specified device at the self.data_preprocessor(data)
mentioned in the code snippet of the previous section. The moved data will be further passed to the model.
Makes sense but it’s weird, isn’t it? At this point you may be wondering:
MMResNet50
does not definedata_preprocessor
, but why it can still accessdata_preprocessor
and move data to GPU?Why
BaseModel
does not move data bydata = data.to(device)
, but needs theDataPreprocessor
to move data?
The answer to the first question is that: MMResNet50
inherit from BaseModel
, and super().__init__
will build a default data_preprocessor
for it. The equivalent implementation of the default one is like this:
class BaseDataPreprocessor(nn.Module):
def forward(self, data, training=True): # ignore the training parameter here
# suppose data given by CIFAR10 is a tuple. Actually
# BaseDataPreprocessor could move various type of data
# to target device.
return tuple(_data.cuda() for _data in data)
BaseDataPreprocessor
will move the data to the specified device.
Before answering the second question, let’s think about a few more questions
Where should we perform normalization? transform or
Model
?It sounds reasonable to put it in transform to take advantage of Dataloader’s multi-process acceleration, and in the model to move it to GPU to use GPU resources to accelerate normalization. However, while we are debating whether CPU normalization is faster than GPU normalization, the time of data moving from CPU to GPU is much longer than the former.
In fact, for less computationally intensive operations like normalization, it takes much less time than data transferring, which has a higher priority for being optimized. If I could move the data to the specified device while it is still in
uint8
and before it is normalized (the size of normalizedfloat
data is 4 times larger than that of unit8), it would reduce the bandwidth and greatly improve the efficiency of data transferring. This “lagged” normalization behavior is one of the main reasons why we designed theDataPreprocessor
. The data preprocessor moves the data first and then normalizes it.How we implement the data augmentation like MixUp and Mosaic?
Although it seems that MixUp and Mosaic are just special data transformations that should be implemented in transform. However, considering that these two transformations involve fusing multiple images into one, it would be very difficult to implement them in transform since the current paradigm of transform is to do various enhancements on one image. It would be hard to read additional images from dataset because the dataset is not accessible in the transform. However, if we implement Mosaic or Mixup based on the
batch_data
sampled from Dataloader, everything becomes easy. We can access multiple images at the same time, and we can easily perform the image fusion operation.class MixUpDataPreprocessor(nn.Module): def __init__(self, num_class, alpha): self.alpha = alpha def forward(self, data, training=True): data = tuple(_data.cuda() for _data in data) # Only perform MixUp in training mode if not training: return data label = F.one_hot(label) # label to OneHot batch_size = len(label) index = torch.randperm(batch_size) # Get the index of fused image img, label = data lam = np.random.beta(self.alpha, self.alpha) # Fusion factor # MixUp img = lam * img + (1 - lam) * img[index, :] label = lam * batch_scores + (1 - lam) * batch_scores[index, :] # Since the returned label is onehot encoded, the `forward` of the # model should also be adjusted. return tuple(img, label)
Therefore, besides data transferring and normalization, another major function of
data_preprocessor
is BatchAugmentation. The modularity of the data preprocessor also helps us to achieve a free combination between algorithms and data augmentation.What should we do if the data sampled from the DataLoader does not match the model input, should I modify the DataLoader or the model interface?
The answer is: neither is appropriate. The ideal solution is to do the adaptation without breaking the existing interface between the model and the DataLoader.
DataPreprocessor
could also handle this, you can customize yourDataPreprocessor
to convert the incoming to the target type.
By now, You must understand the rationale of the data preprocessor and can confidently answer the two questions posed at the beginning of the tutorial! But you may still wonder what is the optim_wrapper
passed to train_step
, and how do the predictions returned by test_step
and val_step
relate to the evaluator. You will find more introduction in the evaluation tutorial and the optimizer wrapper tutorial.
Evaluation¶
Coming soon. Please refer to chinese documentation.
OptimWrapper¶
In previous tutorials of runner and model, we have more or less mentioned the concept of OptimWrapper
, but we have not introduced why we need it and what are the advantages of OptimWrapper
compared to Pytorch’s native optimizer. In this tutorial, we will help you understand the advantages and demonstrate how to use the wrapper.
As its name suggests, OptimWrapper
is a high-level abstraction of PyTorch’s native optimizer, which provides a unified set of interfaces while adding more functionality. OptimWrapper
supports different training strategies, including mixed precision training, gradient accumulation, and gradient clipping. We can choose the appropriate training strategy according to our needs. OptimWrapper
also defines a standard process for parameter updating based on which users can switch between different training strategies for the same set of code.
OptimWrapper vs Optimizer¶
Now we use both the native optimizer of PyTorch and the OptimWrapper in MMEngine to perform single-precision training, mixed-precision training, and gradient accumulation to show the difference in implementations.
Model training¶
1.1 Single-precision training with SGD in PyTorch
import torch
from torch.optim import SGD
import torch.nn as nn
import torch.nn.functional as F
inputs = [torch.zeros(10, 1, 1)] * 10
targets = [torch.ones(10, 1, 1)] * 10
model = nn.Linear(1, 1)
optimizer = SGD(model.parameters(), lr=0.01)
optimizer.zero_grad()
for input, target in zip(inputs, targets):
output = model(input)
loss = F.l1_loss(output, target)
loss.backward()
optimizer.step()
optimizer.zero_grad()
1.2 Single-precision training with OptimWrapper in MMEngine
from mmengine.optim import OptimWrapper
optim_wrapper = OptimWrapper(optimizer=optimizer)
for input, target in zip(inputs, targets):
output = model(input)
loss = F.l1_loss(output, target)
optim_wrapper.update_params(loss)
The OptimWrapper.update_params
achieves the standard process for gradient computation, parameter updating, and gradient zeroing, which can be used to update the model parameters directly.
2.1 Mixed-precision training with SGD in PyTorch
from torch.cuda.amp import autocast
model = model.cuda()
inputs = [torch.zeros(10, 1, 1, 1)] * 10
targets = [torch.ones(10, 1, 1, 1)] * 10
for input, target in zip(inputs, targets):
with autocast():
output = model(input.cuda())
loss = F.l1_loss(output, target.cuda())
loss.backward()
optimizer.step()
optimizer.zero_grad()
2.2 Mixed-precision training with OptimWrapper in MMEngine
from mmengine.optim import AmpOptimWrapper
optim_wrapper = AmpOptimWrapper(optimizer=optimizer)
for input, target in zip(inputs, targets):
with optim_wrapper.optim_context(model):
output = model(input.cuda())
loss = F.l1_loss(output, target.cuda())
optim_wrapper.update_params(loss)
To enable mixed precision training, users need to use AmpOptimWrapper.optim_context
which is similar to the autocast
for enabling the context for mixed precision training. In addition, AmpOptimWrapper.optim_context
can accelerate the gradient accumulation during the distributed training, which will be introduced in the next example.
3.1 Mixed-precision training and gradient accumulation with SGD in PyTorch
for idx, (input, target) in enumerate(zip(inputs, targets)):
with autocast():
output = model(input.cuda())
loss = F.l1_loss(output, target.cuda())
loss.backward()
if idx % 2 == 0:
optimizer.step()
optimizer.zero_grad()
3.2 Mixed-precision training and gradient accumulation with OptimWrapper in MMEngine
optim_wrapper = AmpOptimWrapper(optimizer=optimizer, accumulative_counts=2)
for input, target in zip(inputs, targets):
with optim_wrapper.optim_context(model):
output = model(input.cuda())
loss = F.l1_loss(output, target.cuda())
optim_wrapper.update_params(loss)
We only need to configure the accumulative_counts
parameter and call the update_params
interface to achieve the gradient accumulation function. Besides, in the distributed training scenario, if we configure the gradient accumulation with optim_context
context enabled, we can avoid unnecessary gradient synchronization during the gradient accumulation step.
The OptimWrapper also provides a more fine-grained interface for users to customize with their own parameter update logics.
backward
: Accept aloss
dictionary, and compute the gradient of parameters.step
: Same asoptimizer.step
, and update the parameters.zero_grad
: Same asoptimizer.zero_grad
, and zero the gradient of parameters
We can use the above interface to implement the same logic of parameters updating as the Pytorch optimizer.
for idx, (input, target) in enumerate(zip(inputs, targets)):
optimizer.zero_grad()
with optim_wrapper.optim_context(model):
output = model(input.cuda())
loss = F.l1_loss(output, target.cuda())
optim_wrapper.backward(loss)
if idx % 2 == 0:
optim_wrapper.step()
optim_wrapper.zero_grad()
We can also configure a gradient clipping strategy for the OptimWrapper.
# based on torch.nn.utils.clip_grad_norm_ method
optim_wrapper = AmpOptimWrapper(
optimizer=optimizer, clip_grad=dict(max_norm=1))
# based on torch.nn.utils.clip_grad_value_ method
optim_wrapper = AmpOptimWrapper(
optimizer=optimizer, clip_grad=dict(clip_value=0.2))
Get learning rate/momentum¶
The OptimWrapper provides the get_lr
and get_momentum
for the convenience of getting the learning rate and momentum of the first parameter group in the optimizer.
import torch.nn as nn
from torch.optim import SGD
from mmengine.optim import OptimWrapper
model = nn.Linear(1, 1)
optimizer = SGD(model.parameters(), lr=0.01)
optim_wrapper = OptimWrapper(optimizer)
print(optimizer.param_groups[0]['lr']) # 0.01
print(optimizer.param_groups[0]['momentum']) # 0
print(optim_wrapper.get_lr()) # {'lr': [0.01]}
print(optim_wrapper.get_momentum()) # {'momentum': [0]}
0.01
0
{'lr': [0.01]}
{'momentum': [0]}
Export/load state dicts¶
Similar to the optimizer, the OptimWrapper provides the state_dict
and load_state_dict
interfaces for exporting and loading the optimizer states. For the AmpOptimWrapper
, it can export mixed-precision training parameters as well.
import torch.nn as nn
from torch.optim import SGD
from mmengine.optim import OptimWrapper, AmpOptimWrapper
model = nn.Linear(1, 1)
optimizer = SGD(model.parameters(), lr=0.01)
optim_wrapper = OptimWrapper(optimizer=optimizer)
amp_optim_wrapper = AmpOptimWrapper(optimizer=optimizer)
# export state dicts
optim_state_dict = optim_wrapper.state_dict()
amp_optim_state_dict = amp_optim_wrapper.state_dict()
print(optim_state_dict)
print(amp_optim_state_dict)
optim_wrapper_new = OptimWrapper(optimizer=optimizer)
amp_optim_wrapper_new = AmpOptimWrapper(optimizer=optimizer)
# load state dicts
amp_optim_wrapper_new.load_state_dict(amp_optim_state_dict)
optim_wrapper_new.load_state_dict(optim_state_dict)
{'state': {}, 'param_groups': [{'lr': 0.01, 'momentum': 0, 'dampening': 0, 'weight_decay': 0, 'nesterov': False, 'maximize': False, 'foreach': None, 'params': [0, 1]}]}
{'state': {}, 'param_groups': [{'lr': 0.01, 'momentum': 0, 'dampening': 0, 'weight_decay': 0, 'nesterov': False, 'maximize': False, 'foreach': None, 'params': [0, 1]}], 'loss_scaler': {'scale': 65536.0, 'growth_factor': 2.0, 'backoff_factor': 0.5, 'growth_interval': 2000, '_growth_tracker': 0}}
Use multiple optimizers¶
Considering that algorithms like GANs usually need to use multiple optimizers to train the generator and the discriminator, MMEngine provides a container class called OptimWrapperDict
to manage them. OptimWrapperDict
stores the sub-OptimWrapper in the form of dict
, and can be accessed and traversed just like a dict
.
Unlike regular OptimWrapper, OptimWrapperDict
does not provide methods such as update_prarms
, optim_context
, backward
, step
, etc. Therefore, it cannot be used directly to train models. We suggest implementing the logic of parameter updating by accessing the sub-OptimWarpper in OptimWrapperDict
directly.
Users may wonder why not just use dict
to manage multiple optimizers since OptimWrapperDict
does not have training capabilities. Actually, the core function of OptimWrapperDict
is to support exporting or loading the state dictionary of all sub-OptimWrapper and to support getting learning rates and momentums as well. Without OptimWrapperDict
, MMEngine needs to do a lot of if-else
in OptimWrapper to get the states of the OptimWrappers
.
from torch.optim import SGD
import torch.nn as nn
from mmengine.optim import OptimWrapper, OptimWrapperDict
gen = nn.Linear(1, 1)
disc = nn.Linear(1, 1)
optimizer_gen = SGD(gen.parameters(), lr=0.01)
optimizer_disc = SGD(disc.parameters(), lr=0.01)
optim_wapper_gen = OptimWrapper(optimizer=optimizer_gen)
optim_wapper_disc = OptimWrapper(optimizer=optimizer_disc)
optim_dict = OptimWrapperDict(gen=optim_wapper_gen, disc=optim_wapper_disc)
print(optim_dict.get_lr()) # {'gen.lr': [0.01], 'disc.lr': [0.01]}
print(optim_dict.get_momentum()) # {'gen.momentum': [0], 'disc.momentum': [0]}
{'gen.lr': [0.01], 'disc.lr': [0.01]}
{'gen.momentum': [0], 'disc.momentum': [0]}
As shown in the above example, OptimWrapperDict
exports learning rates and momentums for all OptimWrappers easily, and OptimWrapperDict
can export and load all the state dicts in a similar way.
Configure the OptimWapper in Runner¶
We first need to configure the optimizer
for the OptimWrapper. MMEngine automatically adds all optimizers in PyTorch to the OPTIMIZERS
registry, and users can specify the optimizers they need in the form of a dict
. All supported optimizers in PyTorch are listed here.
Now we take setting up a SGD OptimWrapper as an example.
optimizer = dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0001)
optim_wrapper = dict(type='OptimWrapper', optimizer=optimizer)
Here we have set up an OptimWrapper with a SGD optimizer with the learning rate and momentum parameters as specified. Since OptimWrapper is designed for standard single precision training, we can also omit the type
field in the configuration:
optimizer = dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0001)
optim_wrapper = dict(optimizer=optimizer)
To enable mixed-precision training and gradient accumulation, we change type
to AmpOptimWrapper
and specify the accumulative_counts
parameter.
optimizer = dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0001)
optim_wrapper = dict(type='AmpOptimWrapper', optimizer=optimizer, accumulative_counts=2)
Note
If you are new to reading the MMEngine tutorial and are not familiar with concepts such as configs and registries, it is recommended to skip the following advanced tutorials for now and read other documents first. Of course, if you already have a good understanding of this prerequisite knowledge, we highly recommend reading the advanced part which covers:
How to customize the learning rate, decay coefficient, and other parameters of the model parameters in the configuration of OptimWrapper.
how to customize the construction policy of the optimizer.
Apart from the pre-requisite knowledge of the configs and the registries, it is recommended to have a thorough understanding of the native construction of PyTorch optimizer before starting the advanced tutorials.
Advanced usages¶
PyTorch’s optimizer allows different hyperparameters to be set for each parameter in the model, such as using different learning rates for the backbone and head for a classification model.
from torch.optim import SGD
import torch.nn as nn
model = nn.ModuleDict(dict(backbone=nn.Linear(1, 1), head=nn.Linear(1, 1)))
optimizer = SGD([{'params': model.backbone.parameters()},
{'params': model.head.parameters(), 'lr': 1e-3}],
lr=0.01,
momentum=0.9)
In the above example, we set a learning rate of 0.01 for the backbone, while another learning rate of 1e-3 for the head. Users can pass a list of dictionaries containing the different parts of the model’s parameters and their corresponding hyperparameters to the optimizer, allowing for fine-grained adjustment of the model optimization.
In MMEngine, the optimizer wrapper constructor allows users to set hyperparameters in different parts of the model directly by setting the paramwise_cfg
in the configuration file rather than by modifying the code of building the optimizer.
Set different hyperparamters for different types of parameters¶
The default optimizer wrapper constructor in MMEngine supports setting different hyperparameters for different types of parameters in the model. For example, we can set norm_decay_mult=0
for paramwise_cfg
to set the weight decay factor to 0 for the weight and bias of the normalization layer to implement the trick of not decaying the weight of the normalization layer as mentioned in the Bag of Tricks.
Here, we set the weight decay coefficient in all normalization layers (head.bn
) in ToyModel
to 0 as follows.
from mmengine.optim import build_optim_wrapper
from collections import OrderedDict
class ToyModel(nn.Module):
def __init__(self):
super().__init__()
self.backbone = nn.ModuleDict(
dict(layer0=nn.Linear(1, 1), layer1=nn.Linear(1, 1)))
self.head = nn.Sequential(
OrderedDict(
linear=nn.Linear(1, 1),
bn=nn.BatchNorm1d(1)))
optim_wrapper = dict(
optimizer=dict(type='SGD', lr=0.01, weight_decay=0.0001),
paramwise_cfg=dict(norm_decay_mult=0))
optimizer = build_optim_wrapper(ToyModel(), optim_wrapper)
08/23 22:02:43 - mmengine - INFO - paramwise_options -- backbone.layer0.bias:lr=0.01
08/23 22:02:43 - mmengine - INFO - paramwise_options -- backbone.layer0.bias:weight_decay=0.0001
08/23 22:02:43 - mmengine - INFO - paramwise_options -- backbone.layer1.bias:lr=0.01
08/23 22:02:43 - mmengine - INFO - paramwise_options -- backbone.layer1.bias:weight_decay=0.0001
08/23 22:02:43 - mmengine - INFO - paramwise_options -- head.linear.bias:lr=0.01
08/23 22:02:43 - mmengine - INFO - paramwise_options -- head.linear.bias:weight_decay=0.0001
08/23 22:02:43 - mmengine - INFO - paramwise_options -- head.bn.weight:weight_decay=0.0
08/23 22:02:43 - mmengine - INFO - paramwise_options -- head.bn.bias:weight_decay=0.0
In addition to configuring the weight decay, paramwise_cfg
of MMEngine’s default optimizer wrapper constructor supports the following hyperparameters as well.
lr_mult
: Learning rate for all parameters.
decay_mult
: Decay coefficient for all parameters.
bias_lr_mult
: Learning rate coefficient of the bias (excluding bias of normalization layer and offset of the deformable convolution).
bias_decay_mult
: Weight decay coefficient of the bias (excluding bias of normalization layer and offset of the deformable convolution).
norm_decay_mult
: Weight decay coefficient for weights and bias of the normalization layer.
flat_decay_mult
: Weight decay coefficient of the one-dimension parameters.
dwconv_decay_mult
: Decay coefficient of the depth-wise convolution.
bypass_duplicate
: Whether to skip duplicate parameters, default to False
.
dcn_offset_lr_mult
: Learning rate of the deformable convolution.
Set different hyperparamters for different model modules¶
In addition, as shown in the PyTorch code above, in MMEngine we can also set different hyperparameters for any module in the model by setting custom_keys
in paramwise_cfg
.
If we want to set the learning rate and the decay coefficient to 0 for backbone.layer0
, and set the learning rate to 0.001 for the rest of the modules in the backbone
. At the same time, we want to keep all the learning rate to 0.001 for the head
module. We can do it in this way:
optim_wrapper = dict(
optimizer=dict(type='SGD', lr=0.01, weight_decay=0.0001),
paramwise_cfg=dict(
custom_keys={
'backbone.layer0': dict(lr_mult=0, decay_mult=0),
'backbone': dict(lr_mult=1),
'head': dict(lr_mult=0.1)
}))
optimizer = build_optim_wrapper(ToyModel(), optim_wrapper)
08/23 22:02:43 - mmengine - INFO - paramwise_options -- backbone.layer0.weight:lr=0.0
08/23 22:02:43 - mmengine - INFO - paramwise_options -- backbone.layer0.weight:weight_decay=0.0
08/23 22:02:43 - mmengine - INFO - paramwise_options -- backbone.layer0.weight:lr_mult=0
08/23 22:02:43 - mmengine - INFO - paramwise_options -- backbone.layer0.weight:decay_mult=0
08/23 22:02:43 - mmengine - INFO - paramwise_options -- backbone.layer0.bias:lr=0.0
08/23 22:02:43 - mmengine - INFO - paramwise_options -- backbone.layer0.bias:weight_decay=0.0
08/23 22:02:43 - mmengine - INFO - paramwise_options -- backbone.layer0.bias:lr_mult=0
08/23 22:02:43 - mmengine - INFO - paramwise_options -- backbone.layer0.bias:decay_mult=0
08/23 22:02:43 - mmengine - INFO - paramwise_options -- backbone.layer1.weight:lr=0.01
08/23 22:02:43 - mmengine - INFO - paramwise_options -- backbone.layer1.weight:weight_decay=0.0001
08/23 22:02:43 - mmengine - INFO - paramwise_options -- backbone.layer1.weight:lr_mult=1
08/23 22:02:43 - mmengine - INFO - paramwise_options -- backbone.layer1.bias:lr=0.01
08/23 22:02:43 - mmengine - INFO - paramwise_options -- backbone.layer1.bias:weight_decay=0.0001
08/23 22:02:43 - mmengine - INFO - paramwise_options -- backbone.layer1.bias:lr_mult=1
08/23 22:02:43 - mmengine - INFO - paramwise_options -- head.linear.weight:lr=0.001
08/23 22:02:43 - mmengine - INFO - paramwise_options -- head.linear.weight:weight_decay=0.0001
08/23 22:02:43 - mmengine - INFO - paramwise_options -- head.linear.weight:lr_mult=0.1
08/23 22:02:43 - mmengine - INFO - paramwise_options -- head.linear.bias:lr=0.001
08/23 22:02:43 - mmengine - INFO - paramwise_options -- head.linear.bias:weight_decay=0.0001
08/23 22:02:43 - mmengine - INFO - paramwise_options -- head.linear.bias:lr_mult=0.1
08/23 22:02:43 - mmengine - INFO - paramwise_options -- head.bn.weight:lr=0.001
08/23 22:02:43 - mmengine - INFO - paramwise_options -- head.bn.weight:weight_decay=0.0001
08/23 22:02:43 - mmengine - INFO - paramwise_options -- head.bn.weight:lr_mult=0.1
08/23 22:02:43 - mmengine - INFO - paramwise_options -- head.bn.bias:lr=0.001
08/23 22:02:43 - mmengine - INFO - paramwise_options -- head.bn.bias:weight_decay=0.0001
08/23 22:02:43 - mmengine - INFO - paramwise_options -- head.bn.bias:lr_mult=0.1
The state dictionary of the above model can be printed as the following:
for name, val in ToyModel().named_parameters():
print(name)
backbone.layer0.weight
backbone.layer0.bias
backbone.layer1.weight
backbone.layer1.bias
head.linear.weight
head.linear.bias
head.bn.weight
head.bn.bias
Each field in custom_keys
is defined as follows.
'backbone': dict(lr_mult=1)
: Set the learning rate of the parameter whose name is prefixed withbackbone
to 1.'backbone.layer0': dict(lr_mult=0, decay_mult=0)
: Set the learning rate of the parameter with the prefixbackbone.layer0
to 0 and the decay coefficient to 0. This configuration has a higher priority than the first one.'head': dict(lr_mult=0.1)
: Set the learning rate of the parameter whose name is prefixed withhead
to 0.1.
Customize optimizer construction policies¶
Like other modules in MMEngine, the optimizer wrapper constructor is also managed by the registry. We can customize the hyperparameter policies by implementing custom optimizer wrapper constructors.
For example, we can implement an optimizer wrapper constructor called LayerDecayOptimWrapperConstructor
that automatically set decreasing learning rates for layers of different depths of the model.
from mmengine.optim import DefaultOptimWrapperConstructor
from mmengine.registry import OPTIM_WRAPPER_CONSTRUCTORS
from mmengine.logging import print_log
@OPTIM_WRAPPER_CONSTRUCTORS.register_module(force=True)
class LayerDecayOptimWrapperConstructor(DefaultOptimWrapperConstructor):
def __init__(self, optim_wrapper_cfg, paramwise_cfg=None):
super().__init__(optim_wrapper_cfg, paramwise_cfg=None)
self.decay_factor = paramwise_cfg.get('decay_factor', 0.5)
super().__init__(optim_wrapper_cfg, paramwise_cfg)
def add_params(self, params, module, prefix='' ,lr=None):
if lr is None:
lr = self.base_lr
for name, param in module.named_parameters(recurse=False):
param_group = dict()
param_group['params'] = [param]
param_group['lr'] = lr
params.append(param_group)
full_name = f'{prefix}.{name}' if prefix else name
print_log(f'{full_name} : lr={lr}', logger='current')
for name, module in module.named_children():
chiled_prefix = f'{prefix}.{name}' if prefix else name
self.add_params(
params, module, chiled_prefix, lr=lr * self.decay_factor)
class ToyModel(nn.Module):
def __init__(self) -> None:
super().__init__()
self.layer = nn.ModuleDict(dict(linear=nn.Linear(1, 1)))
self.linear = nn.Linear(1, 1)
model = ToyModel()
optim_wrapper = dict(
optimizer=dict(type='SGD', lr=0.01, weight_decay=0.0001),
paramwise_cfg=dict(decay_factor=0.5),
constructor='LayerDecayOptimWrapperConstructor')
optimizer = build_optim_wrapper(model, optim_wrapper)
08/23 22:20:26 - mmengine - INFO - layer.linear.weight : lr=0.0025
08/23 22:20:26 - mmengine - INFO - layer.linear.bias : lr=0.0025
08/23 22:20:26 - mmengine - INFO - linear.weight : lr=0.005
08/23 22:20:26 - mmengine - INFO - linear.bias : lr=0.005
When add_params
is called for the first time, the params
argument is an empty list
and the module
is the ToyModel
instance. Please refer to the Optimizer Wrapper Constructor Documentation for detailed explanations on overloading.
Similarly, if we want to construct multiple optimizers, we also need to implement a custom constructor.
@OPTIM_WRAPPER_CONSTRUCTORS.register_module()
class MultipleOptimiWrapperConstructor:
...
Adjust hyperparameters during training¶
The hyperparameters in the optimizer can only be set to a fixed value at the time it is constructed, and you cannot adjust parameters such as the learning rate during training by just using the optimizer wrapper. In MMEngine, we have implemented a parameter scheduler that allows the tuning of parameters during training. For the usage of the parameter scheduler, please refer to the Parameter Scheduler
Parameter Scheduler¶
During neural network training, optimization hyperparameters (e.g. learning rate) are usually adjusted along with the training process. One of the simplest and most common learning rate adjustment strategies is multi-step learning rate decay, which reduces the learning rate to a fraction at regular intervals. PyTorch provides LRScheduler to implement various learning rate adjustment strategies. In MMEngine, we have extended it and implemented a more general ParamScheduler. It can adjust optimization hyperparameters such as learning rate and momentum. It also supports the combination of multiple schedulers to create more complex scheduling strategies.
Usage¶
We first introduce how to use PyTorch’s torch.optim.lr_scheduler
to adjust learning rate.
How to use PyTorch's builtin learning rate scheduler?
Here is an example which refers from PyTorch official documentation:
Initialize an ExponentialLR object, and call the step
method after each training epoch.
import torch
from torch.optim import SGD
from torch.optim.lr_scheduler import ExponentialLR
model = torch.nn.Linear(1, 1)
dataset = [torch.randn((1, 1, 1)) for _ in range(20)]
optimizer = SGD(model, 0.1)
scheduler = ExponentialLR(optimizer, gamma=0.9)
for epoch in range(10):
for data in dataset:
optimizer.zero_grad()
output = model(data)
loss = 1 - output
loss.backward()
optimizer.step()
scheduler.step()
mmengine.optim.scheduler
supports most of PyTorch’s learning rate schedulers such as ExponentialLR
, LinearLR
, StepLR
, MultiStepLR
, etc. Please refer to parameter scheduler API documentation for all of the supported schedulers.
MMEngine also supports adjusting momentum with parameter schedulers. To use momentum schedulers, replace LR
in the class name to Momentum
, such as ExponentialMomentum
,LinearMomentum
. Further, we implement the general parameter scheduler ParamScheduler, which is used to adjust the specified hyperparameters in the optimizer, such as weight_decay, etc. This feature makes it easier to apply some complex hyperparameter tuning strategies.
Different from the above example, MMEngine usually does not need to manually implement the training loop and call optimizer.step()
. The runner will automatically manage the training progress and control the execution of the parameter scheduler through ParamSchedulerHook
.
Use a single LRScheduler¶
If only one scheduler needs to be used for the entire training process, there is no difference with PyTorch’s learning rate scheduler.
# build the scheduler manually
from torch.optim import SGD
from mmengine.runner import Runner
from mmengine.optim.scheduler import MultiStepLR
optimizer = SGD(model.parameters(), lr=0.01, momentum=0.9)
param_scheduler = MultiStepLR(optimizer, milestones=[8, 11], gamma=0.1)
runner = Runner(
model=model,
optim_wrapper=dict(
optimizer=optimizer),
param_scheduler=param_scheduler,
...
)
If using the runner with the registry and config file, we can specify the scheduler by setting the param_scheduler
field in the config. The runner will automatically build a parameter scheduler based on this field:
# build the scheduler with config file
param_scheduler = dict(type='MultiStepLR', by_epoch=True, milestones=[8, 11], gamma=0.1)
Note that the parameter by_epoch
is added here, which controls the frequency of learning rate adjustment. When set to True, it means adjusting by epoch. When set to False, it means adjusting by iteration. The default value is True.
In the above example, it means to adjust according to epochs. At this time, the unit of the parameters is epoch. For example, [8, 11] in milestones
means that the learning rate will be multiplied by 0.1 at the end of the 8 and 11 epoch.
When the frequency is modified, the meaning of the count-related settings of the scheduler will be changed accordingly. When by_epoch=True
, the numbers in milestones indicate at which epoch the learning rate decay is performed, and when by_epoch=False
it indicates at which iteration the learning rate decay is performed.
Here is an example of adjusting by iterations: At the end of the 600th and 800th iterations, the learning rate will be multiplied by 0.1 times.
param_scheduler = dict(type='MultiStepLR', by_epoch=False, milestones=[600, 800], gamma=0.1)
If users want to use the iteration-based frequency while filling the scheduler config settings by epoch, MMEngine’s scheduler also provides an automatic conversion method. Users can call the build_iter_from_epoch
method and provide the number of iterations for each training epoch to construct a scheduler object updated by iterations:
epoch_length = len(train_dataloader)
param_scheduler = MultiStepLR.build_iter_from_epoch(optimizer, milestones=[8, 11], gamma=0.1, epoch_length=epoch_length)
If using config to build a scheduler, just add convert_to_iter_based=True
to the field. The runner will automatically call build_iter_from_epoch
to convert the epoch-based config to an iteration-based scheduler object:
param_scheduler = dict(type='MultiStepLR', by_epoch=True, milestones=[8, 11], gamma=0.1, convert_to_iter_based=True)
Below is a Cosine Annealing learning rate scheduler that is updated by epoch, where the learning rate is only modified after each epoch:
param_scheduler = dict(type='CosineAnnealingLR', by_epoch=True, T_max=12)
After automatically conversion, the learning rate is updated by iteration. As you can see from the graph below, the learning rate changes more smoothly.
param_scheduler = dict(type='CosineAnnealingLR', by_epoch=True, T_max=12, convert_to_iter_based=True)
Combine multiple LRSchedulers (e.g. learning rate warm-up)¶
In the training process of some algorithms, the learning rate is not adjusted according to a certain scheduling strategy from beginning to end. The most common example is learning rate warm-up.
For example, in the first few iterations, a linear strategy is used to increase the learning rate from a small value to normal, and then another strategy is applied.
MMEngine supports combining multiple schedulers together. Just modify the param_scheduler
field in the config file to a list of scheduler config, and the ParamSchedulerHook can automatically process the scheduler list. The following example implements learning rate warm-up.
param_scheduler = [
# Linear learning rate warm-up scheduler
dict(type='LinearLR',
start_factor=0.001,
by_epoch=False, # Updated by iterations
begin=0,
end=50), # Warm up for the first 50 iterations
# The main LRScheduler
dict(type='MultiStepLR',
by_epoch=True, # Updated by epochs
milestones=[8, 11],
gamma=0.1)
]
Note that the begin
and end
parameters are added here. These two parameters specify the valid interval of the scheduler. The valid interval usually only needs to be set when multiple schedulers are combined, and can be ignored when using a single scheduler. When the begin
and end
parameters are specified, it means that the scheduler only takes effect in the [begin, end) interval, and the unit is determined by the by_epoch
parameter.
In the above example, the by_epoch
of LinearLR
in the warm-up phase is False, which means that the scheduler only takes effect in the first 50 iterations. After more than 50 iterations, the scheduler will no longer take effect, and the second scheduler, which is MultiStepLR
, will control the learning rate. When combining different schedulers, the by_epoch
parameter does not have to be the same for each scheduler.
Here is another example:
param_scheduler = [
# Use a linear warm-up at [0, 100) iterations
dict(type='LinearLR',
start_factor=0.001,
by_epoch=False,
begin=0,
end=100),
# Use a cosine learning rate at [100, 900) iterations
dict(type='CosineAnnealingLR',
T_max=800,
by_epoch=False,
begin=100,
end=900)
]
The above example uses a linear learning rate warm-up for the first 100 iterations, and then uses a cosine annealing learning rate scheduler with a period of 800 from the 100th to the 900th iteration.
Users can combine any number of schedulers. If the valid intervals of two schedulers are not connected to each other which leads to an interval that is not covered, the learning rate of this interval remains unchanged. If the valid intervals of the two schedulers overlap, the adjustment of the learning rate will be triggered in the order of the scheduler config (similar with ChainedScheduler
).
We recommend using different learning rate scheduling strategies in different stages of training to avoid overlapping of the valid intervals. Be careful If you really need to stack two schedulers overlapped. We recommend using learning rate visualization tool to visualize the learning rate after stacking, to avoid the adjustment not as expected.
How to adjust other hyperparameters¶
Momentum¶
Like learning rate, momentum is a schedulable hyperparameter in the optimizer’s parameter group. The momentum scheduler is used in exactly the same way as the learning rate scheduler. Just add the momentum scheduler config to the list in the param_scheduler
field.
Example:
param_scheduler = [
# the lr scheduler
dict(type='LinearLR', ...),
# the momentum scheduler
dict(type='LinearMomentum',
start_factor=0.001,
by_epoch=False,
begin=0,
end=1000)
]
Generic parameter scheduler¶
MMEngine also provides a set of generic parameter schedulers for scheduling other hyperparameters in the param_groups
of the optimizer. Change LR
in the class name of the learning rate scheduler to Param
, such as LinearParamScheduler
. Users can schedule the specific hyperparameters by setting the param_name
variable of the scheduler.
Here is an example:
param_scheduler = [
dict(type='LinearParamScheduler',
param_name='lr', # adjust the 'lr' in `optimizer.param_groups`
start_factor=0.001,
by_epoch=False,
begin=0,
end=1000)
]
By setting the param_name
to 'lr'
, this parameter scheduler is equivalent to LinearLRScheduler
.
In addition to learning rate and momentum, users can also schedule other parameters in optimizer.param_groups
. The schedulable parameters depend on the optimizer used. For example, when using the SGD optimizer with weight_decay
, the weight_decay
can be adjusted as follows:
param_scheduler = [
dict(type='LinearParamScheduler',
param_name='weight_decay', # adjust 'weight_decay' in `optimizer.param_groups`
start_factor=0.001,
by_epoch=False,
begin=0,
end=1000)
]
Hook¶
Hook programming is a programming pattern in which a mount point is set in one or more locations of a program. When the program runs to a mount point, all methods registered to it at runtime are automatically called. Hook programming can increase the flexibility and extensibility of the program, since users can register custom methods to the mount point to be called without modifying the code in the program.
Built-in Hooks¶
MMEngine encapsules many ultilities as built-in hooks. These hooks are divided into two categories, namely default hooks and custom hooks. The former refers to those registered with the Runner by default, while the latter refers to those registered by the user on demand.
Each hook has a corresponding priority. At each mount point, hooks with higher priority are called earlier by the Runner
. When sharing the same priority, the hooks are called in their registration order. The priority list is as follows.
HIGHEST (0)
VERY_HIGH (10)
HIGH (30)
ABOVE_NORMAL (40)
NORMAL (50)
BELOW_NORMAL (60)
LOW (70)
VERY_LOW (90)
LOWEST (100)
default hooks
Name |
Function |
Priority |
---|---|---|
update runtime information into message hub |
VERY_HIGH (10) |
|
Update the time spent during iteration into message hub |
NORMAL (50) |
|
Ensure distributed Sampler shuffle is active |
NORMAL (50) |
|
LoggerHook |
Collect logs from different components of |
BELOW_NORMAL (60) |
update some hyper-parameters of optimizer |
LOW (70) |
|
Save checkpoints periodically |
VERY_LOW (90) |
custom hooks
Name |
Function |
Priority |
---|---|---|
apply Exponential Moving Average (EMA) on the model during training |
NORMAL (50) |
|
Releases all unoccupied cached GPU memory during the process of training |
NORMAL (50) |
|
Synchronize model buffers at the end of each epoch |
NORMAL (50) |
Note
It is not recommended to modify the priority of the default hooks, as hooks with lower priority may depend on hooks with higher priority. For example, CheckpointHook
needs to have a lower priority than ParamSchedulerHook so that the saved optimizer state is correct. Also, the priority of custom hooks defaults to NORMAL (50)
.
The two types of hooks are set differently in the Runner, with the configuration of default hooks being passed to the default_hooks
parameter of the Runner and the configuration of custom hooks being passed to the custom_hooks
parameter, as follows.
from mmengine.runner import Runner
default_hooks = dict(
runtime_info=dict(type='RuntimeInfoHook'),
timer=dict(type='IterTimerHook'),
sampler_seed=dict(type='DistSamplerSeedHook'),
logger=dict(type='LoggerHook'),
param_scheduler=dict(type='ParamSchedulerHook'),
checkpoint=dict(type='CheckpointHook', interval=1),
)
custom_hooks = [dict(type='EmptyCacheHook')]
runner = Runner(default_hooks=default_hooks, custom_hooks=custom_hooks, ...)
runner.train()
CheckpointHook¶
CheckpointHook saves the checkpoints at a given interval. In the case of distributed training, only the master process will save the checkpoints. The main features of CheckpointHook
is as follows.
Save checkpoints by interval, and support saving them by epoch or iteration
Save the most recent checkpoints
Save the best checkpoints
Specify the path to save the checkpoints
For more features, please read the CheckpointHook API documentation.
The four features mentioned above are described below.
Save checkpoints by interval, and support saving them by epoch or iteration
Suppose we train a total of 20 epochs and want to save the checkpoints every 5 epochs, the following configuration will help us achieve this requirement.
# the default value of by_epoch is True default_hooks = dict(checkpoint=dict(type='CheckpointHook', interval=5, by_epoch=True))
If you want to save checkpoints by iteration, you can set
by_epoch
to False andinterval=5
to save them every 5 iterations.default_hooks = dict(checkpoint=dict(type='CheckpointHook', interval=5, by_epoch=False))
Save the most recent checkpoints
If you only want to keep a certain number of checkpoints, you can set the
max_keep_ckpts
parameter. When the number of checkpoints saved exceedsmax_keep_ckpts
, the previous checkpoints will be deleted.default_hooks = dict(checkpoint=dict(type='CheckpointHook', interval=5, max_keep_ckpts=2))
The above config shows that if a total of 20 epochs are trained, the model will be saved at epochs 5, 10, 15, and 20, but the checkpoint
epoch_5.pth
will be deleted at epoch 15, and at epoch 20 the checkpointepoch_10.pth
will be deleted, so that only theepoch_15.pth
andepoch_20.pth
will be saved.Save the best checkpoints
If you want to save the best checkpoints of the validation set for the training process, you can set the
save_best
parameter. If set to'auto'
, the current checkpoint are judged to be best based on the first evaluation metric of the validation set (the evaluation metrics returned by evaluator are an ordered dictionary).default_hooks = dict(checkpoint=dict(type='CheckpointHook', save_best='auto'))
You can also directly specify the value of
save_best
as the evaluation metric, for example, in a classification task, you can specifysave_best='top-1'
, then the current checkpoint will be judged as best based on the value of'top-1'
.In addition to the
save_best
parameter, other parameters related to saving the best checkpoint arerule
,greater_keys
andless_keys
, which are used to imply whether its good to have large value or not. For example, if you specifysave_best='top-1'
, you can specifyrule='greater'
to imply that the larger the value, the better the checkpoint.Specify the path to save the checkpoints
The checkpoints are saved in
work_dir
by default, but the path can be changed by settingout_dir
.default_hooks = dict(checkpoint=dict(type='CheckpointHook', interval=5, out_dir='/path/of/directory'))
LoggerHook collects logs from different components of Runner
and write them to terminal, JSON file, tensorboard and wandb .etc.
If we want to output (or save) the logs every 20 iterations, we can set the interval
parameter and configure it as follows.
default_hooks = dict(logger=dict(type='LoggerHook', interval=20))
If you are interested in how MMEngine manages logging, you can refer to logging.
ParamSchedulerHook¶
ParamSchedulerHook iterates through all optimizer parameter schedulers of the Runner and calls their step
method to update the optimizer parameters in order. See Parameter Schedulers for more details about what are parameter schedulers.
ParamSchedulerHook
is registered to the Runner by default and has no configurable parameters, so there is no need to configure it.
IterTimerHook¶
IterTimerHook is used to record the time taken to load data and iterate once.
IterTimerHook
is registered to the Runner by default and has no configurable parameters, so there is no need to configure it.
DistSamplerSeedHook¶
DistSamplerSeedHook calls the step
method of the Sampler during distributed training to ensure that the shuffle operation takes effect.
DistSamplerSeedHook
is registered to the Runner by default and has no configurable parameters, so there is no need to configure it.
RuntimeInfoHook¶
RuntimeInfoHook will update the current runtime information (e.g. epoch, iter, max_epochs, max_iters, lr, metrics, etc.) to the message hub at different mount points in the Runner so that other modules without access to the Runner can obtain this information.
RuntimeInfoHook
is registered to the Runner by default and has no configurable parameters, so there is no need to configure it.
EMAHook¶
EMAHook performs an exponential moving average operation on the model during training, with the aim of improving the robustness of the model. Note that the model generated by exponential moving average is only used for validation and testing, and does not affect training.
custom_hooks = [dict(type='EMAHook')]
runner = Runner(custom_hooks=custom_hooks, ...)
runner.train()
EMAHook
uses ExponentialMovingAverage by default, with optional values of StochasticWeightAverage and MomentumAnnealingEMA. Other averaging strategies can be used by setting ema_type
.
custom_hooks = [dict(type='EMAHook', ema_type='StochasticWeightAverage')]
See EMAHook API Reference for more usage.
EmptyCacheHook¶
EmptyCacheHook calls torch.cuda.empty_cache()
to release all unoccupied cached GPU memory. The timing of releasing memory can be controlled by setting parameters like before_epoch
, after_iter
, and after_epoch
, meaning before the start of each epoch, after each iteration, and after each epoch respectively.
# The release operation is performed at the end of each epoch
custom_hooks = [dict(type='EmptyCacheHook', after_epoch=True)]
runner = Runner(custom_hooks=custom_hooks, ...)
runner.train()
SyncBuffersHook¶
SyncBuffersHook synchronizes the buffer of the model at the end of each epoch during distributed training, e.g. running_mean
and running_var
of the BN layer.
custom_hooks = [dict(type='SyncBuffersHook')]
runner = Runner(custom_hooks=custom_hooks, ...)
runner.train()
Customize Your Hooks¶
If the built-in hooks provided by MMEngine do not cover your demands, you are encouraged to customize your own hooks by simply inheriting the base hook class and overriding the corresponding mount point methods.
For example, if you want to check whether the loss value is valid, i.e. not infinite, during training, you can simply override the after_train_iter
method as below. The check will be performed after each training iteration.
import torch
from mmengine.registry import HOOKS
from mmengine.hooks import Hook
@HOOKS.register_module()
class CheckInvalidLossHook(Hook):
"""Check invalid loss hook.
This hook will regularly check whether the loss is valid
during training.
Args:
interval (int): Checking interval (every k iterations).
Defaults to 50.
"""
def __init__(self, interval=50):
self.interval = interval
def after_train_iter(self, runner, batch_idx, data_batch=None, outputs=None):
"""All subclasses should override this method, if they need any
operations after each training iteration.
Args:
runner (Runner): The runner of the training process.
batch_idx (int): The index of the current batch in the train loop.
data_batch (dict or tuple or list, optional): Data from dataloader.
outputs (dict, optional): Outputs from model.
"""
if self.every_n_train_iters(runner, self.interval):
assert torch.isfinite(outputs['loss']),\
runner.logger.info('loss become infinite or NaN!')
We simply pass the hook config to the custom_hooks
parameter of the Runner, which will register the hooks when the Runner is initialized.
from mmengine.runner import Runner
custom_hooks = dict(
dict(type='CheckInvalidLossHook', interval=50)
)
runner = Runner(custom_hooks=custom_hooks, ...)
runner.train() # start training
Then the loss value are checked after iteration.
Note that the priority of the custom hook is NORMAL (50)
by default, if you want to change the priority of the hook, then you can set the priority key in the config.
custom_hooks = dict(
dict(type='CheckInvalidLossHook', interval=50, priority='ABOVE_NORMAL')
)
You can also set priority when defining classes.
@HOOKS.register_module()
class CheckInvalidLossHook(Hook):
priority = 'ABOVE_NORMAL'
Registry¶
OpenMMLab supports a rich collection of algorithms and datasets, therefore, many modules with similar functionality are implemented. For example, the implementations of ResNet
and SE-ResNet
are based on the classes ResNet
and SEResNet
, respectively, which have similar functions and interfaces and belong to the model components of the algorithm library. To manage these functionally similar modules, MMEngine implements the registry. Most of the algorithm libraries in OpenMMLab use registry
to manage their modules, including MMDetection, MMDetection3D, MMClassification and MMEditing, etc.
What is a registry¶
The registry in MMEngine can be considered as a union of a mapping table and a build function of modules. The mapping table maintains a mapping from strings to classes or functions, allowing the user to find the corresponding class or function with its name/notation. For example, the mapping from the string "ResNet"
to the ResNet
class. The module build function defines how to find the corresponding class or function based on a string and how to instantiate the class or call the function. For example, finding nn.BatchNorm2d
and instantiating the BatchNorm2d
module by the string "bn"
, or finding the build_batchnorm2d
function by the string "build_batchnorm2d"
and then returning the result. The registries in MMEngine use the build_from_cfg function by default to find and instantiate the class or function corresponding to the string.
The classes or functions managed by a registry usually have similar interfaces and functionality, so the registry can be treated as an abstraction of those classes or functions. For example, the registry MODELS
can be treated as an abstraction of all models, which manages classes such as ResNet
, SEResNet
and RegNetX
and constructors such as build_ResNet
, build_SEResNet
and build_RegNetX
.
Getting started¶
There are three steps required to use the registry to manage modules in the codebase.
Create a registry.
Create a build method for instantiating the class (optional because in most cases you can just use the default method).
Add the module to the registry
Suppose we want to implement a series of activation modules and want to be able to switch to different modules by just modifying the configuration without modifying the code.
Let’s create a registry first.
from mmengine import Registry
# `scope` represents the domain of the registry. If not set, the default value is the package name.
# e.g. in mmdetection, the scope is mmdet
# `locations` indicates the location where the modules in this registry are defined.
# The Registry will automatically import the modules when building them according to these predefined locations.
ACTIVATION = Registry('activation', scope='mmengine', locations=['mmengine.models.activations'])
The module mmengine.models.activations
specified by locations
corresponds to the mmengine/models/activations.py
file. When building modules with registry, the ACTIVATION registry will automatically import implemented modules from this file. Therefore, we can implement different activation layers in the mmengine/models/activations.py
file, such as Sigmoid
, ReLU
, and Softmax
.
import torch.nn as nn
# use the register_module
@ACTIVATION.register_module()
class Sigmoid(nn.Module):
def __init__(self):
super().__init__()
def forward(self, x):
print('call Sigmoid.forward')
return x
@ACTIVATION.register_module()
class ReLU(nn.Module):
def __init__(self, inplace=False):
super().__init__()
def forward(self, x):
print('call ReLU.forward')
return x
@ACTIVATION.register_module()
class Softmax(nn.Module):
def __init__(self):
super().__init__()
def forward(self, x):
print('call Softmax.forward')
return x
The key of using the registry module is to register the implemented modules into the ACTIVATION
registry. With the @ACTIVATION.register_module()
decorator added before the implemented module, the mapping between strings and classes or functions can be built and maintained by ACTIVATION
. We can achieve the same functionality with ACTIVATION.register_module(module=ReLU)
as well.
By registering, we can create a mapping between strings and classes or functions via ACTIVATION
.
print(ACTIVATION.module_dict)
# {
# 'Sigmoid': __main__.Sigmoid,
# 'ReLU': __main__.ReLU,
# 'Softmax': __main__.Softmax
# }
Note
The key to trigger the registry mechanism is to make the module imported. There are three ways to register a module into the registry
Implement the module in the
locations
. The registry will automatically import modules in the predefined locations. This is to ease the usage of algorithm libraries so that users can directly useREGISTRY.build(cfg)
.Import the file manually. This is common when developers implement a new module in/out side the algorithm library.
Use
custom_imports
field in config. Please refer to Importing custom Python modules for more details.
Once the implemented module is successfully registered, we can use the activation module in the configuration file.
import torch
input = torch.randn(2)
act_cfg = dict(type='Sigmoid')
activation = ACTIVATION.build(act_cfg)
output = activation(input)
# call Sigmoid.forward
print(output)
We can switch to ReLU
by just changing this configuration.
act_cfg = dict(type='ReLU', inplace=True)
activation = ACTIVATION.build(act_cfg)
output = activation(input)
# call ReLU.forward
print(output)
If we want to check the type of input parameters (or any other operations) before creating an instance, we can implement a build method and pass it to the registry to implement a custom build process.
Create a build_activation
function.
def build_activation(cfg, registry, *args, **kwargs):
cfg_ = cfg.copy()
act_type = cfg_.pop('type')
print(f'build activation: {act_type}')
act_cls = registry.get(act_type)
act = act_cls(*args, **kwargs, **cfg_)
return act
Pass the buid_activation
to build_func
.
ACTIVATION = Registry('activation', build_func=build_activation, scope='mmengine', locations=['mmengine.models.activations'])
@ACTIVATION.register_module()
class Tanh(nn.Module):
def __init__(self):
super().__init__()
def forward(self, x):
print('call Tanh.forward')
return x
act_cfg = dict(type='Tanh')
activation = ACTIVATION.build(act_cfg)
output = activation(input)
# build activation: Tanh
# call Tanh.forward
print(output)
Note
In the above example, we demonstrate how to customize the method of building an instance of a class using the build_func
.
This is similar to the default build_from_cfg
method. In most cases, using the default method will be fine.
MMEngine’s registry can register classes as well as functions.
FUNCTION = Registry('function', scope='mmengine')
@FUNCTION.register_module()
def print_args(**kwargs):
print(kwargs)
func_cfg = dict(type='print_args', a=1, b=2)
func_res = FUNCTION.build(func_cfg)
Advanced usage¶
The registry in MMEngine supports hierarchical registration, which enables cross-project calls, meaning that modules from one project can be used in another project. Though there are other ways to implement this, the registry provides a much easier solution.
To easily make cross-library calls, MMEngine provides twenty root registries, including:
RUNNERS: the registry for Runner.
RUNNER_CONSTRUCTORS: the constructors for Runner.
LOOPS: manages training, validation and testing processes, such as
EpochBasedTrainLoop
.HOOKS: the hooks, such as
CheckpointHook
, andParamSchedulerHook
.DATASETS: the datasets.
DATA_SAMPLERS:
Sampler
ofDataLoader
, used to sample the data.TRANSFORMS: various data preprocessing methods, such as
Resize
, andReshape
.MODELS: various modules of the model.
MODEL_WRAPPERS: model wrappers for parallelizing distributed data, such as
MMDistributedDataParallel
.WEIGHT_INITIALIZERS: the tools for weight initialization.
OPTIMIZERS: registers all
Optimizers
and customOptimizers
in PyTorch.OPTIM_WRAPPER: the wrapper for Optimizer-related operations such as
OptimWrapper
, andAmpOptimWrapper
.OPTIM_WRAPPER_CONSTRUCTORS: the constructors for optimizer wrappers.
PARAM_SCHEDULERS: various parameter schedulers, such as
MultiStepLR
.METRICS: the evaluation metrics for computing model accuracy, such as
Accuracy
.EVALUATOR: one or more evaluation metrics used to calculate the model accuracy.
TASK_UTILS: the task-intensive components, such as
AnchorGenerator
, andBboxCoder
.VISUALIZERS: the management drawing module that draws prediction boxes on images, such as
DetVisualizer
.VISBACKENDS: the backend for storing training logs, such as
LocalVisBackend
, andTensorboardVisBackend
.LOG_PROCESSORS: controls the log statistics window and statistics methods, by default we use
LogProcessor
. You may customizeLogProcessor
if you have special needs.
Use the module of the parent node¶
Let’s define a RReLU
module in MMEngine
and register it to the MODELS
root registry.
import torch.nn as nn
from mmengine import Registry, MODELS
@MODELS.register_module()
class RReLU(nn.Module):
def __init__(self, lower=0.125, upper=0.333, inplace=False):
super().__init__()
def forward(self, x):
print('call RReLU.forward')
return x
Now suppose there is a project called MMAlpha
, which also defines a MODELS
and sets its parent node to the MODELS
of MMEngine
, which creates a hierarchical structure.
from mmengine import Registry, MODELS as MMENGINE_MODELS
MODELS = Registry('model', parent=MMENGINE_MODELS, scope='mmalpha', locations=['mmalpha.models'])
The following figure shows the hierarchy of MMEngine
and MMAlpha
.

The count_registered_modules function can be used to print the modules that have been registered to MMEngine and their hierarchy.
from mmengine.registry import count_registered_modules
count_registered_modules()
We define a customized LogSoftmax
module in MMAlpha
and register it to the MODELS
in MMAlpha
.
@MODELS.register_module()
class LogSoftmax(nn.Module):
def __init__(self, dim=None):
super().__init__()
def forward(self, x):
print('call LogSoftmax.forward')
return x
Here we use the LogSoftmax
in the configuration of MMAlpha
.
model = MODELS.build(cfg=dict(type='LogSoftmax'))
We can also use the modules of the parent node MMEngine
here in the MMAlpha
.
model = MODELS.build(cfg=dict(type='RReLU', lower=0.2))
# scope is optional
model = MODELS.build(cfg=dict(type='mmengine.RReLU'))
If no prefix is added, the build
method will first find out if the module exists in the current node and return it if there is one. Otherwise, it will continue to look up the parent nodes or even the ancestor node until it finds the module. If the same module exists in both the current node and the parent nodes, we need to specify the scope
prefix to indicate that we want to use the module of the parent nodes.
import torch
input = torch.randn(2)
output = model(input)
# call RReLU.forward
print(output)
Use the module of a sibling node¶
In addition to using the module of the parent nodes, users can also call the module of a sibling node.
Suppose there is another project called MMBeta
, which, like MMAlpha
, defines MODELS
and set its parent node to MMEngine
.
from mmengine import Registry, MODELS as MMENGINE_MODELS
MODELS = Registry('model', parent=MMENGINE_MODELS, scope='mmbeta')
The following figure shows the registry structure of MMAlpha
and MMBeta
.

Now we call the modules of MMAlpha
in MMBeta
.
model = MODELS.build(cfg=dict(type='mmalpha.LogSoftmax'))
output = model(input)
# call LogSoftmax.forward
print(output)
Calling a module of a sibling node requires the scope
prefix to be specified in type
, so the above configuration requires the prefix mmalpha
.
However, if you need to call several modules of a sibling node, each with a prefix, this requires a lot of modification. Therefore, MMEngine
introduces the DefaultScope, with which Registry
can easily support temporary switching of the current node to the specified node.
If you need to switch the current node to the specified node temporarily, just set _scope_
to the scope of the specified node in cfg
.
model = MODELS.build(cfg=dict(type='LogSoftmax', _scope_='mmalpha'))
output = model(input)
# call LogSoftmax.forward
print(output)
Config¶
MMEngine implements an abstract configuration class (Config
) to provide a unified configuration access interface for users. Config
supports different type of configuration file, including python
, json
and yaml
, and you can choose the type according to your preference. Config
overrides some magic method, which could help you access the data stored in Config
just like getting values from dict
, or getting attributes from instances. Besides, Config
also provides an inheritance mechanism, which could help you better organize and manage the configuration files.
Before starting the tutorial, let’s download the configuration files needed in the tutorial (it is recommended to execute them in a temporary directory to facilitate deleting these files latter.):
wget https://raw.githubusercontent.com/open-mmlab/mmengine/main/docs/resources/config/config_sgd.py
wget https://raw.githubusercontent.com/open-mmlab/mmengine/main/docs/resources/config/cross_repo.py
wget https://raw.githubusercontent.com/open-mmlab/mmengine/main/docs/resources/config/custom_imports.py
wget https://raw.githubusercontent.com/open-mmlab/mmengine/main/docs/resources/config/demo_train.py
wget https://raw.githubusercontent.com/open-mmlab/mmengine/main/docs/resources/config/example.py
wget https://raw.githubusercontent.com/open-mmlab/mmengine/main/docs/resources/config/learn_read_config.py
wget https://raw.githubusercontent.com/open-mmlab/mmengine/main/docs/resources/config/my_module.py
wget https://raw.githubusercontent.com/open-mmlab/mmengine/main/docs/resources/config/optimizer_cfg.py
wget https://raw.githubusercontent.com/open-mmlab/mmengine/main/docs/resources/config/predefined_var.py
wget https://raw.githubusercontent.com/open-mmlab/mmengine/main/docs/resources/config/refer_base_var.py
wget https://raw.githubusercontent.com/open-mmlab/mmengine/main/docs/resources/config/resnet50_delete_key.py
wget https://raw.githubusercontent.com/open-mmlab/mmengine/main/docs/resources/config/resnet50_lr0.01.py
wget https://raw.githubusercontent.com/open-mmlab/mmengine/main/docs/resources/config/resnet50_runtime.py
wget https://raw.githubusercontent.com/open-mmlab/mmengine/main/docs/resources/config/resnet50.py
wget https://raw.githubusercontent.com/open-mmlab/mmengine/main/docs/resources/config/runtime_cfg.py
wget https://raw.githubusercontent.com/open-mmlab/mmengine/main/docs/resources/config/modify_base_var.py
Read the configuration file¶
Config
provides a uniform interface Config.fromfile()
to read and parse configuration files.
A valid configuration file should define a set of key-value pairs, and here are a few examples:
Python:
test_int = 1
test_list = [1, 2, 3]
test_dict = dict(key1='value1', key2=0.1)
Json:
{
"test_int": 1,
"test_list": [1, 2, 3],
"test_dict": {"key1": "value1", "key2": 0.1}
}
YAML:
test_int: 1
test_list: [1, 2, 3]
test_dict:
key1: "value1"
key2: 0.1
For the above three formats, assuming the file names are config.py
, config.json
, and config.yml
. Loading these files with Config.fromfile('config.xxx')
will return the same result, which contain test_int
, test_list
and test_dict
3 variables.
Let’s take config.py
as an example:
from mmengine.config import Config
cfg = Config.fromfile('learn_read_config.py')
print(cfg)
Config (path: learn_read_config.py): {'test_int': 1, 'test_list': [1, 2, 3], 'test_dict': {'key1': 'value1', 'key2': 0.1}}
How to use Config
¶
After loading the configuration file, we can access the data stored in Config
instance just like getting/setting values from dict
, or getting/setting attributes from instances.
print(cfg.test_int)
print(cfg.test_list)
print(cfg.test_dict)
cfg.test_int = 2
print(cfg['test_int'])
print(cfg['test_list'])
print(cfg['test_dict'])
cfg['test_list'][1] = 3
print(cfg['test_list'])
1
[1, 2, 3]
{'key1': 'value1', 'key2': 0.1}
2
[1, 2, 3]
{'key1': 'value1', 'key2': 0.1}
[1, 3, 3]
Note
The dict
object parsed by Config
will be converted to ConfigDict
, and then we can access the value of the dict
the same as accessing the attribute of an instance.
We can use the Config
combination with the Registry to build registered instance easily.
Here is an example of defining optimizers in a configuration file.
config_sgd.py
:
optimizer = dict(type='SGD', lr=0.1, momentum=0.9, weight_decay=0.0001)
Suppose we have defined a registry OPTIMIZERS
, which includes various optimizers. Then we can build the optimizer as below
from mmengine import Config, optim
from mmengine.registry import OPTIMIZERS
import torch.nn as nn
cfg = Config.fromfile('config_sgd.py')
model = nn.Conv2d(1, 1, 1)
cfg.optimizer.params = model.parameters()
optimizer = OPTIMIZERS.build(cfg.optimizer)
print(optimizer)
SGD (
Parameter Group 0
dampening: 0
foreach: None
lr: 0.1
maximize: False
momentum: 0.9
nesterov: False
weight_decay: 0.0001
)
Inheritance between configuration files¶
Sometimes, the difference between two different configuration files is so small that only one field may be changed. Therefore, it’s unwise to copy and paste everything only to modify one line, which makes it hard for us to locate the specific difference after a long time.
In another case, multiple configuration files may have the same batch of fields, and we have to copy and paste them in different configuration files. It will also be hard to maintain these fields in a long time.
We address these issues with inheritance mechanism, detailed as below.
Overview of inheritance mechanism¶
Here is an example to illustrate the inheritance mechanism.
optimizer_cfg.py
:
optimizer = dict(type='SGD', lr=0.02, momentum=0.9, weight_decay=0.0001)
resnet50.py
:
_base_ = ['optimizer_cfg.py']
model = dict(type='ResNet', depth=50)
Although we don’t define optimizer
in resnet50.py
, since we wrote _base_ = ['optimizer_cfg.py']
, it will inherit the fields defined in optimizer_cfg.py
.
cfg = Config.fromfile('resnet50.py')
print(cfg.optimizer)
{'type': 'SGD', 'lr': 0.02, 'momentum': 0.9, 'weight_decay': 0.0001}
_base_
is a reserved field for the configuration file. It specifies the inherited base files for the current file. Inheriting multiple files will get all the fields at the same time, but it requires that there are no repeated fields defined in all base files.
runtime_cfg.py
:
gpu_ids = [0, 1]
resnet50_runtime.py
:
_base_ = ['optimizer_cfg.py', 'runtime_cfg.py']
model = dict(type='ResNet', depth=50)
In this case, reading the resnet50_runtime.py
will give you 3 fields model
, optimizer
, and gpu_ids
.
cfg = Config.fromfile('resnet50_runtime.py')
print(cfg.optimizer)
{'type': 'SGD', 'lr': 0.02, 'momentum': 0.9, 'weight_decay': 0.0001}
By this way, we can disassemble the configuration file, define some general configuration files, and inherit them in the specific configuration file. This could avoid defining a lot of duplicated contents in multiple configuration files.
Modify the inherited fields¶
Sometimes, we want to modify some of the fields in the inherited files. For example we want to modify the learning rate from 0.02 to 0.01 after inheriting optimizer_cfg.py
.
In this case, you can simply redefine the fields in the new configuration file. Note that since the optimizer field is a dictionary, we only need to redefine the modified fields. This rule also applies to adding fields.
resnet50_lr0.01.py
:
_base_ = ['optimizer_cfg.py', 'runtime_cfg.py']
model = dict(type='ResNet', depth=50)
optimizer = dict(lr=0.01)
After reading this configuration file, you can get the desired result.
cfg = Config.fromfile('resnet50_lr0.01.py')
print(cfg.optimizer)
{'type': 'SGD', 'lr': 0.01, 'momentum': 0.9, 'weight_decay': 0.0001}
For non-dictionary fields, such as integers, strings, lists, etc., they can be completely overwritten by redefining them. For example, the code block below will change the value of the gpu_ids
to [0]
.
_base_ = ['optimizer_cfg.py', 'runtime_cfg.py']
model = dict(type='ResNet', depth=50)
gpu_ids = [0]
Delete key in dict
¶
Sometimes we not only want to modify or add the keys, but also want to delete them. In this case, we need to set _delete_=True
in the target field(dict
) to delete all the keys that do not appear in the newly defined dictionary.
resnet50_delete_key.py
:
_base_ = ['optimizer_cfg.py', 'runtime_cfg.py']
model = dict(type='ResNet', depth=50)
optimizer = dict(_delete_=True, type='SGD', lr=0.01)
At this point, optimizer
will only have the keys type
and lr
. momentum
and weight_decay
will no longer exist.
cfg = Config.fromfile('resnet50_delete_key.py')
print(cfg.optimizer)
{'type': 'SGD', 'lr': 0.01}
Reference of the inherited file¶
Sometimes we want to reuse the field defined in _base_
, we can get a copy of the corresponding variable by using {{_base_.xxxx}}
:
refer_base_var.py
_base_ = ['resnet50.py']
a = {{_base_.model}}
After parsing, the value of a
becomes model
defined in resnet50.py
cfg = Config.fromfile('refer_base_var.py')
print(cfg.a)
{'type': 'ResNet', 'depth': 50}
We can use this way to get the variables defined in _base_
in the json
, yaml
, and python
configuration files.
Although this way is general for all types of files, there are some syntactic limitations that prevent us from taking full advantage of the dynamic nature of the python
configuration file. For example, if we want to modify a variable defined in _base_
:
_base_ = ['resnet50.py']
a = {{_base_.model}}
a['type'] = 'MobileNet'
The Config
is not able to parse such a configuration file (it will raise an error when parsing). The Config
provides a more pythonic
way to modify base variables for python
configuration files.
modify_base_var.py
:
_base_ = ['resnet50.py']
a = _base_.model
a.type = 'MobileNet'
cfg = Config.fromfile('modify_base_var.py')
print(cfg.a)
{'type': 'MobileNet', 'depth': 50}
Dump the configuration file¶
The user may pass some parameters to modify some fields of the configuration file at the entry point of the training script. Therefore, we provide the dump
method to export the changed configuration file.
Similar to reading the configuration file, the user can choose the format of the dumped file by using cfg.dump('config.xxx')
. dump
can also export configuration files with inheritance relationships, and the dumped files can be used independently without the files defined in _base_
.
Based on the resnet50.py
defined above, we can load and dump it like this:
cfg = Config.fromfile('resnet50.py')
cfg.dump('resnet50_dump.py')
resnet50_dump.py
optimizer = dict(type='SGD', lr=0.02, momentum=0.9, weight_decay=0.0001)
model = dict(type='ResNet', depth=50)
Similarly, we can dump configuration files in json
, yaml
format:
resnet50_dump.yaml
model:
depth: 50
type: ResNet
optimizer:
lr: 0.02
momentum: 0.9
type: SGD
weight_decay: 0.0001
resnet50_dump.json
{"optimizer": {"type": "SGD", "lr": 0.02, "momentum": 0.9, "weight_decay": 0.0001}, "model": {"type": "ResNet", "depth": 50}}
In addition, `dump` can also dump `cfg` loaded from a dictionary.
```python
cfg = Config(dict(a=1, b=2))
cfg.dump('dump_dict.py')
dump_dict.py
a=1
b=2
Advanced usage¶
In this section, we’ll introduce some advanced usage of the Config
, and some tips that could make it easier for users to develop and use downstream repositories.
Predefined fields¶
Sometimes we need some fields in the configuration file, which are related to the path to the workspace. For example, we define a working directory in the configuration file that holds the models and logs for this set of experimental configurations. We expect to have different working directories for different configuration files. A common choice is to use the configuration file name directly as part of the working directory name.
Taking predefined_var.py
as an example:
work_dir = './work_dir/{{fileBasenameNoExtension}}'
Here {{fileBasenameNoExtension}}
means the filename without suffix .py
of the config file, and the variable in {{}}
will be interpreted as predefined_var
cfg = Config.fromfile('./predefined_var.py')
print(cfg.work_dir)
./work_dir/predefined_var
Currently, there are 4 predefined fields referenced from the relevant fields defined in VS Code.
{{fileDirname}}
- the directory name of the current file, e.g./home/your-username/your-project/folder
{{fileBasename}}
- the filename of the current file, e.g.file.py
{{fileBasenameNoExtension}}
- the filename of the current file without the extension, e.g.file
{{fileExtname}}
- the extension of the current file, e.g..py
Modify the fields in command line¶
Sometimes we only want to modify part of the configuration and do not want to modify the configuration file itself. For example, if we want to change the learning rate during the experiment but do not want to write a new configuration file, the common practice is to pass the parameters at the command line to override the relevant configuration.
If we want to modify some internal parameters, such as the learning rate of the optimizer, the number of channels in the convolution layer etc., Config
provides a standard procedure that allows us to modify the parameters at any level easily from the command line.
Training script:
demo_train.py
import argparse
from mmengine.config import Config, DictAction
def parse_args():
parser = argparse.ArgumentParser(description='Train a model')
parser.add_argument('config', help='train config file path')
parser.add_argument(
'--cfg-options',
nargs='+',
action=DictAction,
help='override some settings in the used config, the key-value pair '
'in xxx=yyy format will be merged into config file. If the value to '
'be overwritten is a list, it should be like key="[a,b]" or key=a,b '
'It also allows nested list/tuple values, e.g. key="[(a,b),(c,d)]" '
'Note that the quotation marks are necessary and that no white space '
'is allowed.')
args = parser.parse_args()
return args
def main():
args = parse_args()
cfg = Config.fromfile(args.config)
if args.cfg_options is not None:
cfg.merge_from_dict(args.cfg_options)
print(cfg)
if __name__ == '__main__':
main()
The sample configuration file is as follows.
example.py
model = dict(type='CustomModel', in_channels=[1, 2, 3])
optimizer = dict(type='SGD', lr=0.01)
We can modify the internal fields from the command line by .
For example, if we want to modify the learning rate, we only need to execute the script like this:
python demo_train.py ./example.py --cfg-options optimizer.lr=0.1
Config (path: ./example.py): {'model': {'type': 'CustomModel', 'in_channels': [1, 2, 3]}, 'optimizer': {'type': 'SGD', 'lr': 0.1}}
We successfully modified the learning rate from 0.01 to 0.1. If we want to change a list or a tuple, such as in_channels
in the above example. We need to put double quotes around ()
, []
when assigning the value on the command line.
python demo_train.py ./example.py --cfg-options model.in_channels="[1, 1, 1]"
Config (path: ./example.py): {'model': {'type': 'CustomModel', 'in_channels': [1, 1, 1]}, 'optimizer': {'type': 'SGD', 'lr': 0.01}}
Note
The standard procedure only supports modifying String, Integer, Floating Point, Boolean, None, List, and Tuple fields from the command line. For the elements of list and tuple instance, each of them must be one of the above seven types.
Note
The behavior of DictAction
is similar with "extend"
. It stores a list, and extends each argument value to the list, like:
python demo_train.py ./example.py --cfg-options optimizer.type="Adam" --cfg-options model.in_channels="[1, 1, 1]"
Config (path: ./example.py): {'model': {'type': 'CustomModel', 'in_channels': [1, 1, 1]}, 'optimizer': {'type': 'Adam', 'lr': 0.01}}
import the custom module¶
If we customize a module and register it into the corresponding registry, could we directly build it from the configuration file as the previous section does? The answer is “I don’t know” since I’m not sure the registration process has been triggered. To solve this “unknown” case, Config
provides the custom_imports
function, to make sure your module could be registered as expected.
For example, we customize an optimizer:
from mmengine.registry import OPTIMIZERS
@OPTIMIZERS.register_module()
class CustomOptim:
pass
A matched config file:
my_module.py
optimizer = dict(type='CustomOptim')
To make sure CustomOptim
will be registered, we should set the custom_imports
field like this:
custom_imports.py
custom_imports = dict(imports=['my_module'], allow_failed_imports=False)
optimizer = dict(type='CustomOptim')
And then, once the custom_imports
can be loaded successfully, we can build the CustomOptim
from the custom_imports.py
.
cfg = Config.fromfile('custom_imports.py')
from mmengine.registry import OPTIMIZERS
custom_optim = OPTIMIZERS.build(cfg.optimizer)
print(custom_optim)
<my_module.CustomOptim object at 0x7f6983a87970>
Inherit configuration files across repository¶
It is annoying to copy a large number of configuration files when developing a new repository based on some existing repositories. To address this issue, Config
support inherit configuration files from other repositories. For example, based on MMDetection, we want to develop a repository, we can use the MMDetection configuration file like this:
cross_repo.py
_base_ = [
'mmdet::_base_/schedules/schedule_1x.py',
'mmdet::_base_/datasets/coco_instance.py',
'mmdet::_base_/default_runtime.py',
'mmdet::_base_/models/faster_rcnn_r50_fpn.py',
]
cfg = Config.fromfile('cross_repo.py')
print(cfg.train_cfg)
{'type': 'EpochBasedTrainLoop', 'max_epochs': 12, 'val_interval': 1, '_scope_': 'mmdet'}
Config
will parse mmdet::
to find mmdet package and inherits the specified configuration file. Actually, as long as the setup.py
of the repository(package) conforms to MMEngine Installation specification, Config
can use {package_name}::
to inherit the specific configuration file.
Get configuration files across repository¶
Config
also provides get_config
and get_model
to get the configuration file and the trained model from the downstream repositories.
The usage of get_config
and get_model
are similar to the previous section:
An example of get_config
:
from mmengine.hub import get_config
cfg = get_config(
'mmdet::faster_rcnn/faster_rcnn_r50_fpn_1x_coco.py', pretrained=True)
print(cfg.model_path)
https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_fpn_1x_coco/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth
An example of get_model
:
from mmengine.hub import get_model
model = get_model(
'mmdet::faster_rcnn/faster_rcnn_r50_fpn_1x_coco.py', pretrained=True)
print(type(model))
http loads checkpoint from path: https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_fpn_1x_coco/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth
<class 'mmdet.models.detectors.faster_rcnn.FasterRCNN'>
BaseDataset¶
Introduction¶
The Dataset class in the algorithm toolbox is responsible for providing input data for the model during the training/testing process. The Dataset class in each algorithm toolbox under OpenMMLab projects has some common characteristics and requirements, such as the need for efficient internal data storage format, support for the concatenation of different datasets, dataset repeated sampling, and so on.
Therefore, MMEngine implements BaseDataset which provides some basic interfaces and implements some DatasetWrappers with the same interfaces. Most of the Dataset Classes in the OpenMMLab algorithm toolbox meet the interface defined by the BaseDataset
and use the same DatasetWrappers.
The basic function of the BaseDataset is to load the dataset information. Here, we divide the dataset information into two categories. One is meta information, which represents the information related to the dataset itself and sometimes needs to be obtained by the model or other external components. For example, the meta information of the dataset generally includes the category information classes
in the image classification task, since the classification model usually needs to record the category information of the dataset. The other is data information, which defines the file path and corresponding label information of specific data info. In addition, another function of the BaseDataset is to continuously send data into the data pipeline for data preprocessing.
The standard data annotation file¶
In order to unify the dataset interface of different tasks and facilitate multiple tasks training in one model, OpenMMLab formulate the OpenMMLab 2.0 dataset format specification. Dataset annotation files should conform to this specification, and the BaseDataset
reads and parses data annotation files based on this specification. If the data annotation file provided by the user does not conform to the specified format, the user can choose to convert it to the specified format and use OpenMMLab’s algorithm toolbox to conduct algorithm training and testing based on the converted data annotation file.
The OpenMMLab 2.0 dataset format specification states that annotation files must be in the format of json
or yaml
, yml
or pickle
, pkl
. The dictionary stored in the annotation file must contain two fields, metainfo
and data_list
. The metainfo
is a dictionary containing meta information about the dataset. The data_list
is a list in which each element is a dictionary and the dictionary defines a raw data info. Each raw data info contains one or more training/test samples.
Here is an example of a JSON annotation file (where each raw data info contains only one training/test sample):
{
'metainfo':
{
'classes': ('cat', 'dog'),
...
},
'data_list':
[
{
'img_path': "xxx/xxx_0.jpg",
'img_label': 0,
...
},
{
'img_path': "xxx/xxx_1.jpg",
'img_label': 1,
...
},
...
]
}
We assume that the data is stored in the following path:
data
├── annotations
│ ├── train.json
├── train
│ ├── xxx/xxx_0.jpg
│ ├── xxx/xxx_1.jpg
│ ├── ...
The initialization process of the BaseDataset¶
The initialization process of the BaseDataset
is shown as follows:

load metainfo
: Obtain the meta information of the dataset. The meta information can be obtained from three sources with the priority from high to low:
The dict of
metainfo
passed by the user in the__init__()
function. The priority is high since the user can pass this argument when theBaseDataset
is instantiated;The dict of
BaseDataset.METAINFO
in the class attributes of BaseDataset. The priority is medium since the user can change the class attributesBaseDataset.METAINFO
in the custom dataset class;The dict of
metainfo
included in the annotation file. The priority is low since the annotation file is generally not changed.
If three sources have the same field, the source with the highest priority determines the value of the field. The priority comparison of these fields is: The fields in the metainfo
dictionary passed by the user > The fields in the BaseDataset.METAINFO
of BaseDataset > the fields in the metainfo
of annotation file.
join path
: Process the path of datainfo and annotating files;build pipeline
: Build data pipeline for the data preprocessing and data preparation;full init
: Fully initializes the BaseDataset. This step mainly includes the following operations:
load data list
: Read and parse the annotation files that meet the OpenMMLab 2.0 dataset format specification. In this step, theparse_data_info()
method is called. This method is responsible for parsing each raw data info in the annotation file;filter data
(optional): Filters unnecessary data based onfilter_cfg
, such as data samples that do not contain annotations. By default, there is no filtering operation, and downstream subclasses can override it according to their own needs.get subset
(optional): Sample a subset of dataset based on a given index or an integer value, such as only the first 10 samples for training/testing. By default, all data samples are used.serialize data
(optional): Serialize all data samples to save memory. Please see Save memory for more details. we serialize all data samples by default.
The parse_data_info()
method in the BaseDataset is used to process a raw data info in the annotation file into one or more training/test data samples. The user needs to implement the parse_data_info()
method if they want to customize dataset class.
The interface of BaseDataset¶
Once the BaseDataset is initialized, it supports __getitem__
method to index a data info and __len__
method to get the length of dataset, just like torch.utils.data.Dataset
. The Basedataset provides the following interfaces:
metainfo
: Return the meta information with a dictionary value.get_data_info(idx)
: Return the full data information of the givenidx
, and the return value is a dictionary.__getitem__(idx)
: Return the results of data pipeline(The input data of model) of the given ‘idx’, and the return value is a dictionary.__len__()
: Return the length of the dataset. The return value is an integer.get_subset_(indices)
: Modify the original dataset class in inplace according toindices
. Ifindices
isint
, then the original dataset class contains only the first few data samples. Ifindices
isSequence[int]
, the raw dataset class contains data samples specified according toSequence[int]
.get_subset(indices)
: Return a new sub-dataset class according to indices, i.e., re-copies a sub-dataset. Ifindices
isint
, the returned sub-dataset object contains only the first few data samples. Ifindices
isSequence[int]
, the returned sub-dataset object contains the data samples specified according toSequence[int]
.
Customize dataset class based on BaseDataset¶
We can customize the dataset class based on BaseDataset, after we understand the initialization process of BaseDataset and the provided interfaces of BaseDataset.
Annotation files that meet the OpenMMLab 2.0 dataset format specification¶
As mentioned above, users can overload parse_data_info()
to load annotation files that meet the OpenMMLab 2.0 dataset format specification. Here is an example of using BaseDataset to implement a specific dataset.
import os.path as osp
from mmengine.dataset import BaseDataset
class ToyDataset(BaseDataset):
# Take the above annotation file as example. The raw_data_info represents a dictionary in the data_list list:
# {
# 'img_path': "xxx/xxx_0.jpg",
# 'img_label': 0,
# ...
# }
def parse_data_info(self, raw_data_info):
data_info = raw_data_info
img_prefix = self.data_prefix.get('img_path', None)
if img_prefix is not None:
data_info['img_path'] = osp.join(
img_prefix, data_info['img_path'])
return data_info
Using Customized dataset class¶
The ToyDataset
can be instantiated with the following configuration, once it has been defined:
class LoadImage:
def __call__(self, results):
results['img'] = cv2.imread(results['img_path'])
return results
class ParseImage:
def __call__(self, results):
results['img_shape'] = results['img'].shape
return results
pipeline = [
LoadImage(),
ParseImage(),
]
toy_dataset = ToyDataset(
data_root='data/',
data_prefix=dict(img_path='train/'),
ann_file='annotations/train.json',
pipeline=pipeline)
At the same time, the external interface provided by the BaseDataset can be used to access specific data sample information:
toy_dataset.metainfo
# dict(classes=('cat', 'dog'))
toy_dataset.get_data_info(0)
# {
# 'img_path': "data/train/xxx/xxx_0.jpg",
# 'img_label': 0,
# ...
# }
len(toy_dataset)
# 2
toy_dataset[0]
# {
# 'img_path': "data/train/xxx/xxx_0.jpg",
# 'img_label': 0,
# 'img': a ndarray with shape (H, W, 3), which denotes the value of the image,
# 'img_shape': (H, W, 3) ,
# ...
# }
# The `get_subset` interface does not modify the original dataset class, i.e. make a complete copy of it
sub_toy_dataset = toy_dataset.get_subset(1)
len(toy_dataset), len(sub_toy_dataset)
# 2, 1
# The `get_subset_` interface modify the original dataset class in inplace
toy_dataset.get_subset_(1)
len(toy_dataset)
# 1
Following the above steps, we can see how to customize a dataset based on the BaseDataset and how to use the customized dataset.
Customize dataset for videos¶
In the above examples, each raw data info of the annotation file contains only one training/test sample (usually in the image field). If each raw data info contains several training/test samples (usually in the video domain), we only need to ensure that the return value of parse_data_info()
is list[dict]
:
from mmengine.dataset import BaseDataset
class ToyVideoDataset(BaseDataset):
# raw_data_info is still a dict, but it contains multiple samples
def parse_data_info(self, raw_data_info):
data_list = []
...
for ... :
data_info = dict()
...
data_list.append(data_info)
return data_list
The usage of ToyVideoDataset
is similar to that of ToyDataset
, which will not be repeated here.
Annotation files that do not meet the OpenMMLab 2.0 dataset format specification¶
For annotated files that do not meet the OpenMMLab 2.0 dataset format specification, there are two ways to use:
Convert the annotation files that do not meet the specifications into the annotation files that do meet the specifications, and then use the BaseDataset in the above way.
Implement a new dataset class that inherits from the
BaseDataset
and overloads theload_data_list(self):
function of theBaseDataset
to handle annotation files that don’t meet the specification and guarantee a return value oflist[dict]
, where eachdict
represents a data sample.
Other features of BaseDataset¶
The BaseDataset also contains the following features:
lazy init¶
When the BaseDataset is instantiated, the annotation file needs to be read and parsed, therefore it will take some time. However, in some cases, such as the visualization of prediction, only the meta information of the BaseDataset is required, and reading and parsing the annotation file may not be necessary. To save time on instantiating the BaseDataset in this case, the BaseDataset supports lazy init:
pipeline = [
LoadImage(),
ParseImage(),
]
toy_dataset = ToyDataset(
data_root='data/',
data_prefix=dict(img_path='train/'),
ann_file='annotations/train.json',
pipeline=pipeline,
# Pass the lazy_init variable in here
lazy_init=True)
When lazy_init=True
, the initialization of ToyDataset’s only performs steps 1, 2, and 3 of the BaseDataset initialization process. At this time, toy_dataset
was not fully initialized, since toy_dataset
will not read and parse the annotation file. The toy_dataset
only set the meta information of the dataset (metainfo
).
Naturally, if you need to access specific data information later, you can manually call the toy_dataset.full_init()
interface to perform the complete initialization process, during which the data annotation file will be read and parsed. Calling the get_data_info (independence idx)
, __len__ ()
, __getitem__ (independence idx)
, get_subset_ (indices)
and get_subset(indices)
interface will also automatically call the full_init()
interface to perform the full initialization process (only on the first call, later calls will not call the full_init()
interface repeatedly):
# Full initialization
toy_dataset.full_init()
# After initialization, you can now get the data info
len(toy_dataset)
# 2
toy_dataset[0]
# {
# 'img_path': "data/train/xxx/xxx_0.jpg",
# 'img_label': 0,
# 'img': a ndarray with shape (H, W, 3), which denotes the value the image,
# 'img_shape': (H, W, 3) ,
# ...
# }
Notice:
Performing full initialization by calling the __getitem__()
interface directly carries some risks: If a dataset object is not fully initialized by setting lazy_init=True
firstly, then it is directly sent to the dataloader. Different dataloader workers will read and parse the annotation file at the same time in the subsequent data reading process. Although this may work normally, it consumes a lot of time and memory. Therefore, it is recommended to manually call the full_init()
interface to perform the full initialization process before you need to access specific data.
The above is not fully initialized by setting lazy_init=True
, and then complete initialization according to the demand, called lazy init.
Save memory¶
In the specific process of reading data, the dataloader will usually prefetch data from multiple dataloader workers, and multiple workers have complete dataset object backup, so there will be multiple copies of the same data_list
in the memory. In order to save this part of memory consumption, The BaseDataset
can serialize data_list
into memory in advance, so that multiple workers can share the same copy of data_list
, so as to save memory.
By default, the BaseDataset stores the serialization of data_list
into memory. It is also possible to control whether the data will be serialized into memory ahead of time by using the serialize_data
argument (default is True
) :
pipeline = [
LoadImage(),
ParseImage(),
]
toy_dataset = ToyDataset(
data_root='data/',
data_prefix=dict(img_path='train/'),
ann_file='annotations/train.json',
pipeline=pipeline,
# Pass the serialize data argument in here
serialize_data=False)
The above example does not store the data_list
serialization into memory in advance, so it is not recommended to instantiate the dataset class, when using the dataloader to open multiple dataloader workers to load the data.
DatasetWrappers¶
In addition to BaseDataset, MMEngine also provides several DatasetWrappers: ConcatDataset
, RepeatDataset
, ClassBalancedDataset
. These dataset wrappers also support lazy init and have memory-saving features.
ConcatDataset¶
MMEngine provides a ConcatDataset
wrapper to concatenate datasets in the following way:
from mmengine.dataset import ConcatDataset
pipeline = [
LoadImage(),
ParseImage(),
]
toy_dataset_1 = ToyDataset(
data_root='data/',
data_prefix=dict(img_path='train/'),
ann_file='annotations/train.json',
pipeline=pipeline)
toy_dataset_2 = ToyDataset(
data_root='data/',
data_prefix=dict(img_path='val/'),
ann_file='annotations/val.json',
pipeline=pipeline)
toy_dataset_12 = ConcatDataset(datasets=[toy_dataset_1, toy_dataset_2])
The above example combines the train
set and the val
set of the dataset into one large dataset.
RepeatDataset¶
MMEngine provides RepeatDataset
wrapper to repeat a dataset several times, as follows:
from mmengine.dataset import RepeatDataset
pipeline = [
LoadImage(),
ParseImage(),
]
toy_dataset = ToyDataset(
data_root='data/',
data_prefix=dict(img_path='train/'),
ann_file='annotations/train.json',
pipeline=pipeline)
toy_dataset_repeat = RepeatDataset(dataset=toy_dataset, times=5)
The above example samples the train
set of the dataset five times.
ClassBalancedDataset¶
MMEngine provides ClassBalancedDataset
wrapper to repeatedly sample the corresponding samples based on the frequency of category occurrence in the dataset.
Notice:
The ClassBalancedDataset
wrapper assumes that the wrapped dataset class supports the get_cat_ids(idx)
method, which returns a list. The list contains the categories of data_info
given by ‘idx’. The usage is as follows:
from mmengine.dataset import BaseDataset, ClassBalancedDataset
class ToyDataset(BaseDataset):
def parse_data_info(self, raw_data_info):
data_info = raw_data_info
img_prefix = self.data_prefix.get('img_path', None)
if img_prefix is not None:
data_info['img_path'] = osp.join(
img_prefix, data_info['img_path'])
return data_info
# The necessary method that needs to return the category of data sample
def get_cat_ids(self, idx):
data_info = self.get_data_info(idx)
return [int(data_info['img_label'])]
pipeline = [
LoadImage(),
ParseImage(),
]
toy_dataset = ToyDataset(
data_root='data/',
data_prefix=dict(img_path='train/'),
ann_file='annotations/train.json',
pipeline=pipeline)
toy_dataset_repeat = ClassBalancedDataset(dataset=toy_dataset, oversample_thr=1e-3)
The above example resamples the train
set of the dataset with oversample_thr=1e-3
. Specifically, for categories whose frequency is less than 1e-3
in the dataset, samples corresponding to this category will be sampled repeatedly; otherwise, samples will not be sampled repeatedly. Please refer to the API documentation of ClassBalancedDataset
for specific sampling policies.
Customize DatasetWrapper¶
Since the BaseDataset support lazy init, some rules need to be followed when customizing the DatasetWrapper. Here is an example to show how to customize the DatasetWrapper:
from mmengine.dataset import BaseDataset
from mmengine.registry import DATASETS
@DATASETS.register_module()
class ExampleDatasetWrapper:
def __init__(self, dataset, lazy_init=False, ...):
# Build the source dataset(self.dataset)
if isinstance(dataset, dict):
self.dataset = DATASETS.build(dataset)
elif isinstance(dataset, BaseDataset):
self.dataset = dataset
else:
raise TypeError(
'elements in datasets sequence should be config or '
f'`BaseDataset` instance, but got {type(dataset)}')
# Record the meta information of source dataset
self._metainfo = self.dataset.metainfo
'''
1. Implement some code here to record some of the hyperparameters used to wrap the dataset.
'''
self._fully_initialized = False
if not lazy_init:
self.full_init()
def full_init(self):
if self._fully_initialized:
return
# Initialize the source dataset completely
self.dataset.full_init()
'''
2. Implement some code here to wrap the source dataset.
'''
self._fully_initialized = True
@force_full_init
def _get_ori_dataset_idx(self, idx: int):
'''
3. Implement some code here to map the wrapped index `idx` to the index of the source dataset 'ori_idx'.
'''
ori_idx = ...
return ori_idx
# Provide the same external interface as `self.dataset `.
@force_full_init
def get_data_info(self, idx):
sample_idx = self._get_ori_dataset_idx(idx)
return self.dataset.get_data_info(sample_idx)
# Provide the same external interface as `self.dataset `.
def __getitem__(self, idx):
if not self._fully_initialized:
warnings.warn('Please call `full_init` method manually to '
'accelerate the speed.')
self.full_init()
sample_idx = self._get_ori_dataset_idx(idx)
return self.dataset[sample_idx]
# Provide the same external interface as `self.dataset `.
@force_full_init
def __len__(self):
'''
4. Implement some code here to calculate the length of the wrapped dataset.
'''
len_wrapper = ...
return len_wrapper
# Provide the same external interface as `self.dataset `.
@property
def metainfo(self)
return copy.deepcopy(self._metainfo)
Data transform¶
In the OpenMMLab repositories, dataset construction and data preparation are decoupled from each other. Usually, the dataset construction only parses the dataset and records the basic information of each sample, while the data preparation is performed by a series of data transforms, such as data loading, preprocessing, and formatting based on the basic information of the samples.
To use Data Transforms¶
In MMEngine, we use various callable data transforms classes to perform data manipulation. These data transformation classes can accept several configuration parameters for instantiation and then process the input data dictionary by calling. Also, all data transforms accept a dictionary as input and output the processed data as a dictionary. A simple example is as belows:
Note
In MMEngine, we don’t have the implementations of data transforms. you can find the base data transform class and many other data transforms in MMCV. So you need to install MMCV before learning this tutorial, see the MMCV installation guild.
>>> import numpy as np
>>> from mmcv.transforms import Resize
>>>
>>> transform = Resize(scale=(224, 224))
>>> data_dict = {'img': np.random.rand(256, 256, 3)}
>>> data_dict = transform(data_dict)
>>> print(data_dict['img'].shape)
(224, 224, 3)
To use in Config Files¶
In config files, we can compose multiple data transforms as a list, called a data pipeline. And the data pipeline is an argument of the dataset.
Usually, a data pipeline consists of the following parts:
Data loading, use
LoadImageFromFile
to load image files.Label loading, use
LoadAnnotations
to load the bboxes, semantic segmentation and keypoint annotations.Data processing and augmentation, like
RandomResize
.Data formatting, we use different data transforms for different tasks. And the data transform for specified task is implemented in the corresponding repository. For example, the data formatting transform for image classification task is
PackClsInputs
and it’s in MMClassification.
Here, taking the classification task as an example, we show a typical data pipeline in the figure below. For each sample, the basic information stored in the dataset is a dictionary as shown on the far left side of the figure, after which, every blue block represents a data transform, and in every data transform, we add some new fields (marked in green) or update some existing fields (marked in orange) in the data dictionary.

If want to use the above data pipeline in our config file, use the below settings:
test_dataloader = dict(
batch_size=32,
dataset=dict(
type='ImageNet',
data_root='data/imagenet',
pipeline = [
dict(type='LoadImageFromFile'),
dict(type='Resize', size=256, keep_ratio=True),
dict(type='CenterCrop', crop_size=224),
dict(type='PackClsInputs'),
]
)
)
Common Data Transforms¶
According to the functionality, the data transform classes can be divided into data loading, data pre-processing & augmentation and data formatting.
Data Loading¶
To support loading large-scale dataset, usually we won’t load all dense data during dataset construction, but only load the file path of these data. Therefore, we need to load these data in the data pipeline.
Data Transforms |
Functionality |
---|---|
Load images according to the path. |
|
Load and format annotations information, including bbox, segmentation map and others. |
Data Pre-processing & Augmentation¶
Data transforms for pre-processing and augmentation usually manipulate the image and annotation data, like cropping, padding, resizing and others.
Data Transforms |
Functionality |
---|---|
Pad the margin of images. |
|
Crop the image and keep the center part. |
|
Normalize the image pixels. |
|
Resize images to the specified scale or ratio. |
|
Resize images to a random scale in the specified range. |
|
Resize images to a random scale from several specified scales. |
|
Randomly grayscale images. |
|
Randomly flip images. |
Data Formatting¶
Data formatting transforms will convert the data to some specified type.
Data Transforms |
Functionality |
---|---|
|
Convert the data of specified field to |
|
Convert images to |
Custom Data Transform Classes¶
To implement a new data transform class, the class needs to inherit BaseTransform
and implement transform
method. Here, we use a simple flip transforms (MyFlip
) as example:
import random
import mmcv
from mmcv.transforms import BaseTransform, TRANSFORMS
@TRANSFORMS.register_module()
class MyFlip(BaseTransform):
def __init__(self, direction: str):
super().__init__()
self.direction = direction
def transform(self, results: dict) -> dict:
img = results['img']
results['img'] = mmcv.imflip(img, direction=self.direction)
return results
Then, we can instantiate a MyFlip
object and use it to process our data dictionary.
import numpy as np
transform = MyFlip(direction='horizontal')
data_dict = {'img': np.random.rand(224, 224, 3)}
data_dict = transform(data_dict)
processed_img = data_dict['img']
Or, use it in the data pipeline by modifying our config file:
pipeline = [
...
dict(type='MyFlip', direction='horizontal'),
...
]
Please note that to use the class in our config file, we need to confirm the MyFlip
class will be imported
during running.
Initialization¶
Usually, we’ll customize our module based on nn.Module, which is implemented by Native PyTorch. Also, torch.nn.init could help us initialize the parameters of the model easily. To simplify the process of model construction and initialization, MMEngine designed the BaseModule to help us define and initialize the model from config easily.
Initialize the model from config¶
The core function of BaseModule
is that it could help us to initialize the model from config. Subclasses inherited from BaseModule
could define the init_cfg
in the __init__
function, and we can choose the method of initialization by configuring init_cfg
.
Currently, we support the following initialization methods:
Initializer | Registered name | Function |
---|---|---|
ConstantInit | Constant | Initialize the weight and bias with a constant, commonly used for Convolution |
XavierInit | Xavier | Initialize the weight by Xavier initialization, and initialize the bias with a constant |
NormalInit | Normal | Initialize the weight by normal distribution, and initialize the bias with a constant |
TruncNormalInit | TruncNormal | Initialize the weight by truncated normal distribution, and initialize the bias with a constant,commonly used for Transformer |
UniformInit | Uniform | Initialize the weight by uniform distribution, and initialize the bias with a constant,commonly used for convolution |
KaimingInit | Kaiming | Initialize the weight by Kaiming initialization, and initialize the bias with a constant. Commonly used for convolution |
Caffe2XavierInit | Caffe2Xavier | Xavier initialization in Caffe2, and Kaiming initialization in PyTorh with "fan_in" and "normal" mode. Commonly used for convolution |
PretrainedInit | Pretrained | Initialize the model with the pretrained model |
Initialize the model with pretrained model¶
Defining the ToyNet
as below:
import torch
import torch.nn as nn
from mmengine.model import BaseModule
class ToyNet(BaseModule):
def __init__(self, init_cfg=None):
super().__init__(init_cfg)
self.conv1 = nn.Linear(1, 1)
# Save the checkpoint.
toy_net = ToyNet()
torch.save(toy_net.state_dict(), './pretrained.pth')
pretrained = './pretrained.pth'
toy_net = ToyNet(init_cfg=dict(type='Pretrained', checkpoint=pretrained))
and then we can configure the init_cfg
to make it load the pretrained model by calling initi_weights()
after its construction.
# Initialize the model with the saved checkpoint.
toy_net.init_weights()
08/19 16:50:24 - mmengine - INFO - load model from: ./pretrained.pth
08/19 16:50:24 - mmengine - INFO - local loads checkpoint from path: ./pretrained.pth
If init_cfg
is a dict
, type
means a kind of initializer registered in WEIGHT_INITIALIZERS
. The Pretrained
means PretrainedInit
, which could help us to load the target checkpoint.
All initializers have the same mapping relationship like Pretrained
-> PretrainedInit
, which strips the suffix Init
of the class name. The checkpoint
argument of PretrainedInit
means the path of the checkpoint. It could be a local path or a URL.
Commonly used initialization methods¶
Similarly, we could use the Kaiming
initialization just like Pretrained
initializer. For example, we could make init_cfg=dict(type='Kaiming', layer='Conv2d')
to initialize all Conv2d
module with Kaiming
initialization.
Sometimes we need to initialize the model with different initialization methods for different modules. For example, we could initialize the Conv2d
module with Kaiming
initialization and initialize the Linear
module with Xavier
initialization. We could make init_cfg=dict(type='Kaiming', layer='Conv2d')
:
import torch.nn as nn
from mmengine.model import BaseModule
class ToyNet(BaseModule):
def __init__(self, init_cfg=None):
super().__init__(init_cfg)
self.linear = nn.Linear(1, 1)
self.conv = nn.Conv2d(1, 1, 1)
# Apply `Kaiming` initialization to `Conv2d` module and `Xavier` initialization to `Linear` module.
toy_net = ToyNet(
init_cfg=[
dict(type='Kaiming', layer='Conv2d'),
dict(type='Xavier', layer='Linear')
], )
toy_net.init_weights()
08/19 16:50:24 - mmengine - INFO -
linear.weight - torch.Size([1, 1]):
XavierInit: gain=1, distribution=normal, bias=0
08/19 16:50:24 - mmengine - INFO -
linear.bias - torch.Size([1]):
XavierInit: gain=1, distribution=normal, bias=0
08/19 16:50:24 - mmengine - INFO -
conv.weight - torch.Size([1, 1, 1, 1]):
KaimingInit: a=0, mode=fan_out, nonlinearity=relu, distribution =normal, bias=0
08/19 16:50:24 - mmengine - INFO -
conv.bias - torch.Size([1]):
KaimingInit: a=0, mode=fan_out, nonlinearity=relu, distribution =normal, bias=0
layer
could also be a list, each element of which means a type of applied module.
# Apply Kaiming initialization to `Conv2d` and `Linear` module.
toy_net = ToyNet(init_cfg=[dict(type='Kaiming', layer=['Conv2d', 'Linear'])], )
toy_net.init_weights()
08/19 16:50:24 - mmengine - INFO -
linear.weight - torch.Size([1, 1]):
KaimingInit: a=0, mode=fan_out, nonlinearity=relu, distribution =normal, bias=0
08/19 16:50:24 - mmengine - INFO -
linear.bias - torch.Size([1]):
KaimingInit: a=0, mode=fan_out, nonlinearity=relu, distribution =normal, bias=0
08/19 16:50:24 - mmengine - INFO -
conv.weight - torch.Size([1, 1, 1, 1]):
KaimingInit: a=0, mode=fan_out, nonlinearity=relu, distribution =normal, bias=0
08/19 16:50:24 - mmengine - INFO -
conv.bias - torch.Size([1]):
KaimingInit: a=0, mode=fan_out, nonlinearity=relu, distribution =normal, bias=0
More fine-grained initialization¶
Sometimes we need to initialize the same type of module with different types of initialization. For example, we’ve defined conv1
and conv2
submodules, and we want to initialize the conv1
with Kaiming
initialization and conv2
with Xavier
initialization. We could configure the init_cfg with override
:
import torch.nn as nn
from mmengine.model import BaseModule
class ToyNet(BaseModule):
def __init__(self, init_cfg=None):
super().__init__(init_cfg)
self.conv1 = nn.Conv2d(1, 1, 1)
self.conv2 = nn.Conv2d(1, 1, 1)
# Apllly `Kaiming` initialization to `conv1` and `Xavier` initialization to `conv2`.
toy_net = ToyNet(
init_cfg=[
dict(
type='Kaiming',
layer=['Conv2d'],
override=dict(name='conv2', type='Xavier')),
], )
toy_net.init_weights()
08/19 16:50:24 - mmengine - INFO -
conv1.weight - torch.Size([1, 1, 1, 1]):
KaimingInit: a=0, mode=fan_out, nonlinearity=relu, distribution =normal, bias=0
08/19 16:50:24 - mmengine - INFO -
conv1.bias - torch.Size([1]):
KaimingInit: a=0, mode=fan_out, nonlinearity=relu, distribution =normal, bias=0
08/19 16:50:24 - mmengine - INFO -
conv2.weight - torch.Size([1, 1, 1, 1]):
XavierInit: gain=1, distribution=normal, bias=0
08/19 16:50:24 - mmengine - INFO -
conv2.bias - torch.Size([1]):
KaimingInit: a=0, mode=fan_out, nonlinearity=relu, distribution =normal, bias=0
override
could be understood as an nested init_cfg
, which could also be a list
or dict
, and we should also set “type
” for it. The difference is that we must set name
in override
to specify the applied scope for submodule. As the example above, we set name='conv2'
to specify that the Xavier
initialization is applied to all submodules of toy_net.conv2
.
Customize the initialization method¶
Although the init_cfg
could control the initialization method for different modules, we would have to register a new initialization method to WEIGHT_INITIALIZERS
if we want to customize initialization process. It is not convenient right? Actually, we could also override the init_weights
method to customize the initialization process.
Assuming we’ve defined the following modules:
ToyConv
inherit fromnn.Module
, implementsinit_weights
which initializecustom_weight
(parameter
ofToyConv
) with 1 and initializecustom_bias
with 0ToyNet
defines aToyConv
submodule.
ToyNet.init_weights
will call init_weights
of all submodules sequentially.
import torch
import torch.nn as nn
from mmengine.model import BaseModule
class ToyConv(nn.Module):
def __init__(self):
super().__init__()
self.custom_weight = nn.Parameter(torch.empty(1, 1, 1, 1))
self.custom_bias = nn.Parameter(torch.empty(1))
def init_weights(self):
with torch.no_grad():
self.custom_weight = self.custom_weight.fill_(1)
self.custom_bias = self.custom_bias.fill_(0)
class ToyNet(BaseModule):
def __init__(self, init_cfg=None):
super().__init__(init_cfg)
self.conv1 = nn.Conv2d(1, 1, 1)
self.conv2 = nn.Conv2d(1, 1, 1)
self.custom_conv = ToyConv()
toy_net = ToyNet(
init_cfg=[
dict(
type='Kaiming',
layer=['Conv2d'],
override=dict(name='conv2', type='Xavier'))
])
toy_net.init_weights()
08/19 16:50:24 - mmengine - INFO -
conv1.weight - torch.Size([1, 1, 1, 1]):
KaimingInit: a=0, mode=fan_out, nonlinearity=relu, distribution =normal, bias=0
08/19 16:50:24 - mmengine - INFO -
conv1.bias - torch.Size([1]):
KaimingInit: a=0, mode=fan_out, nonlinearity=relu, distribution =normal, bias=0
08/19 16:50:24 - mmengine - INFO -
conv2.weight - torch.Size([1, 1, 1, 1]):
XavierInit: gain=1, distribution=normal, bias=0
08/19 16:50:24 - mmengine - INFO -
conv2.bias - torch.Size([1]):
KaimingInit: a=0, mode=fan_out, nonlinearity=relu, distribution =normal, bias=0
08/19 16:50:24 - mmengine - INFO -
custom_conv.custom_weight - torch.Size([1, 1, 1, 1]):
Initialized by user-defined `init_weights` in ToyConv
08/19 16:50:24 - mmengine - INFO -
custom_conv.custom_bias - torch.Size([1]):
Initialized by user-defined `init_weights` in ToyConv
Conclusion¶
1. Configure init_cfg
to initialize model
Commonly used for the initialization of
Conv2d
,Linear
and other underlying module. All initialization methods should be managed byWEIGHT_INITIALIZERS
Dynamic initialization controlled by
init_cfg
2. Customize init_weights
Compared to configuring the
init_cfg
, implementing theinit_weights
is simpler and does not require registration. However, it is not as flexible asinit_cfg
, and it is not possible to initialize the module dynamically.
Note
The priorify of init_weights is higher than
init_cfg
Runner will call
init_weights
in Runner.train()
Ininitailize module with function¶
As mentioned in prior section, we could customize our initialization in init_weights
. To make it more convenient to initialize modules, MMEngine provides a series of module initialization functions to initialize the whole module based on torch.nn.init
. For example, we want to initialize the weights of the convolutional layer with normal distribution and initialize the bias of the convolutional layer with a constant. The implementation of torch.nn.init
is as follows:
from torch.nn.init import normal_, constant_
import torch.nn as nn
model = nn.Conv2d(1, 1, 1)
normal_(model.weight, mean=0, std=0.01)
constant_(model.bias, val=0)
Parameter containing:
tensor([0.], requires_grad=True)
The above process is actually a standard process for initializing a convolutional module with normal distribution, so MMEngine simplifies this by implementing a series of common module initialization functions. Compared with torch.nn.init
, the module initialization functions could accept the convolution module directly:
from mmengine.model import normal_init
normal_init(model, mean=0, std=0.01, bias=0)
Similarly, we could also use Kaiming initialization and Xavier initialization:
from mmengine.model import kaiming_init, xavier_init
kaiming_init(model)
xavier_init(model)
Currently, MMEngine provide the following initialization function:
Initialization function | Function |
---|---|
constant_init | Initialize the weight and bias with a constant, commonly used for Convolution |
xavier_init | Initialize the weight by Xavier initialization, and initialize the bias with a constant |
normal_init | Initialize the weight by normal distribution, and initialize the bias with a constant |
trunc_normal_init | Initialize the weight by truncated normal distribution, and initialize the bias with a constant,commonly used for Transformer |
uniform_init | Initialize the weight by uniform distribution, and initialize the bias with a constant,commonly used for convolution |
kaiming_init | Initialize the weight by Kaiming initialization, and initialize the bias with a constant. Commonly used for convolution |
caffe2_xavier_init | Xavier initialization in Caffe2, and Kaiming initialization in PyTorh with "fan_in" and "normal" mode. Commonly used for convolution |
bias_init_with_prob | Initialize the bias with the probability |
Visualization¶
Visualization provides an intuitive explanation of the training and testing process of the deep learning model.
MMEngine provides Visualizer
to visualize and store the state and intermediate results of the model training and testing process, with the following features:
It supports basic drawing interface and feature map visualization
It enables recording training states (such as loss and lr), performance evaluation metrics, and visualization results to a specified or multiple backends, including local device, TensorBoard, and WandB.
It can be used in any location in the code base.
Basic Drawing APIs¶
Visualizer
provides drawing APIs for common objects such as detection bboxes, points, text, lines, circles, polygons, and binary masks.
These APIs have the following features:
Can be called multiple times to achieve overlay drawing requirements.
All support multiple input types such as Tensor, Numpy array, etc.
Typical usages are as follows.
Draw detection bboxes, masks, text, etc.
import torch
import mmcv
from mmengine.visualization import Visualizer
image = mmcv.imread('docs/en/_static/image/cat_dog.png', channel_order='rgb')
visualizer = Visualizer(image=image)
# single bbox formatted as [xyxy]
visualizer.draw_bboxes(torch.tensor([72, 13, 179, 147]))
# draw multiple bboxes
visualizer.draw_bboxes(torch.tensor([[33, 120, 209, 220], [72, 13, 179, 147]]))
visualizer.show()

visualizer.set_image(image=image)
visualizer.draw_texts("cat and dog", torch.tensor([10, 20]))
visualizer.show()

You can also customize things like color and width using the parameters in each API.
visualizer.set_image(image=image)
visualizer.draw_bboxes(torch.tensor([72, 13, 179, 147]), edge_colors='r', line_widths=3)
visualizer.draw_bboxes(torch.tensor([[33, 120, 209, 220]]),line_styles='--')
visualizer.show()

Overlay display
These APIs can be called multiple times to get an overlay result.
visualizer.set_image(image=image)
visualizer.draw_bboxes(torch.tensor([[33, 120, 209, 220], [72, 13, 179, 147]]))
visualizer.draw_texts("cat and dog",
torch.tensor([10, 20])).draw_circles(torch.tensor([40, 50]), torch.tensor([20]))
visualizer.show()

Feature Map Visualization¶
Feature map visualization has many functions. Currently, we only support single feature map visualization.
@staticmethod
def draw_featmap(featmap: torch.Tensor, # input format must be CHW
overlaid_image: Optional[np.ndarray] = None, # if image data is input at the same time, the feature map will be overlaid on the image
channel_reduction: Optional[str] = 'squeeze_mean', # strategy to reduce multiple channels into a single channel
topk: int = 10, # topk feature maps to show
arrangement: Tuple[int, int] = (5, 2), # the layout when multiple channels are expanded into multiple images
resize_shape:Optional[tuple] = None, # scale the feature map
alpha: float = 0.5) -> np.ndarray: # overlay ratio between input image and generated feature map
The main features can be concluded as follows:
As the input Tensor usually includes multiple channels,
channel_reduction
can reduce them into a single channel and overlay the result to the image.squeeze_mean
reduces the input channel C into a single channel using the mean function, so the output dimension becomes (1, H, W)select_max
select the channel with the maximum activation, where ‘activation’ refers to the sum across spatial dimensions of a channel.None
indicates that no reduction is needed, which allows the user to select the top k feature maps with the highest activation degree through thetopk
parameter.
topk
is only valid when thechannel_reduction
isNone
. It selects the top k channels according to the activation degree and then displays them overlaid with the image. The display layout can be specified using the--arrangement
parameter.If
topk
is not -1,topk
channels with the largest activation will be selected for display.If
topk
is -1, channel number C must be either 1 or 3 to indicate if the input is a picture. Otherwise, an error will be raised to prompt the user to reduce the channel withchannel_reduction
.
Considering that the input feature map is usually very small, the function can upsample the feature map through
resize_shape
before the visualization.
For example, we would like to get the feature map from the layer4 output of a pre-trained ResNet18 model and visualize it.
Reduce the multi-channel feature map into a single channel using
select_max
and display it.
import numpy as np
from torchvision.models import resnet18
from torchvision.transforms import Compose, Normalize, ToTensor
def preprocess_image(img, mean, std):
preprocessing = Compose([
ToTensor(),
Normalize(mean=mean, std=std)
])
return preprocessing(img.copy()).unsqueeze(0)
model = resnet18(pretrained=True)
def _forward(x):
x = model.conv1(x)
x = model.bn1(x)
x = model.relu(x)
x = model.maxpool(x)
x1 = model.layer1(x)
x2 = model.layer2(x1)
x3 = model.layer3(x2)
x4 = model.layer4(x3)
return x4
model.forward = _forward
image_norm = np.float32(image) / 255
input_tensor = preprocess_image(image_norm,
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
feat = model(input_tensor)[0]
visualizer = Visualizer()
drawn_img = visualizer.draw_featmap(feat, channel_reduction='select_max')
visualizer.show(drawn_img)

Since the output feat feature map size is 7x7, the visualization effect is not good if we directly work on it. Users can scale the feature map by overlaying the input image or the resize_shape
parameter. If the size of the incoming image is not the same as the size of the feature map, the feature map will be forced to be resampled to the same spatial size as the input image.
drawn_img = visualizer.draw_featmap(feat, image, channel_reduction='select_max')
visualizer.show(drawn_img)

Select the top five channels with the highest activation in the multi-channel feature map by setting
topk=5
, then format them into a 2x3 layout.
drawn_img = visualizer.draw_featmap(feat, image, channel_reduction=None, topk=5, arrangement=(2, 3))
visualizer.show(drawn_img)

Users can set their own desired layout through arrangement
.
drawn_img = visualizer.draw_featmap(feat, image, channel_reduction=None, topk=5, arrangement=(4, 2))
visualizer.show(drawn_img)

Basic Storage APIs¶
Once the drawing is completed, users can choose to display the result directly or save it to different backends. The backends currently supported by MMEngine include local storage, Tensorboard
and WandB
. The data supported include drawn pictures, scalars, and configurations.
Save the result image
Suppose you want to save to your local device.
visualizer = Visualizer(image=image, vis_backends=[dict(type='LocalVisBackend')], save_dir='temp_dir')
visualizer.draw_bboxes(torch.tensor([[33, 120, 209, 220], [72, 13, 179, 147]]))
visualizer.draw_texts("cat and dog", torch.tensor([10, 20]))
visualizer.draw_circles(torch.tensor([40, 50]), torch.tensor([20]))
# temp_dir/vis_data/vis_image/demo_0.png will be generated
visualizer.add_image('demo', visualizer.get_image())
The zero in the result file name is used to distinguish different steps.
# temp_dir/vis_data/vis_image/demo_1.png will be generated
visualizer.add_image('demo', visualizer.get_image(), step=1)
# temp_dir/vis_data/vis_image/demo_3.png will be generated
visualizer.add_image('demo', visualizer.get_image(), step=3)
If you want to switch to other backends, you can change the configuration file like this:
# TensorboardVisBackend
visualizer = Visualizer(image=image, vis_backends=[dict(type='TensorboardVisBackend')], save_dir='temp_dir')
# WandbVisBackend
visualizer = Visualizer(image=image, vis_backends=[dict(type='WandbVisBackend')], save_dir='temp_dir')
Store feature maps
visualizer = Visualizer(vis_backends=[dict(type='LocalVisBackend')], save_dir='temp_dir')
drawn_img = visualizer.draw_featmap(feat, image, channel_reduction=None, topk=5, arrangement=(2, 3))
# temp_dir/vis_data/vis_image/feat_0.png will be generated
visualizer.add_image('feat', drawn_img)
Save scalar data such as loss
# temp_dir/vis_data/scalars.json will be generated
# save loss
visualizer.add_scalar('loss', 0.2, step=0)
visualizer.add_scalar('loss', 0.1, step=1)
# save acc
visualizer.add_scalar('acc', 0.7, step=0)
visualizer.add_scalar('acc', 0.8, step=1)
Multiple scalar data can also be saved at once.
# New contents will be added to the temp_dir/vis_data/scalars.json
visualizer.add_scalars({'loss': 0.3, 'acc': 0.8}, step=3)
Save configurations
from mmengine import Config
cfg=Config.fromfile('tests/data/config/py_config/config.py')
# temp_dir/vis_data/config.py will be saved
visualizer.add_config(cfg)
Various Storage Backends¶
Any Visualizer
can be configured with any number of storage backends. Visualizer
will loop through all the configured backends and save the results to each one.
visualizer = Visualizer(image=image, vis_backends=[dict(type='TensorboardVisBackend'),
dict(type='LocalVisBackend')],
save_dir='temp_dir')
# temp_dir/vis_data/events.out.tfevents.xxx files will be generated
visualizer.draw_bboxes(torch.tensor([[33, 120, 209, 220], [72, 13, 179, 147]]))
visualizer.draw_texts("cat and dog", torch.tensor([10, 20]))
visualizer.draw_circles(torch.tensor([40, 50]), torch.tensor([20]))
visualizer.add_image('demo', visualizer.get_image())
Note: If there are multiple backends used at the same time, the name
field must be specified. Otherwise, it is impossible to distinguish which backend it is.
visualizer = Visualizer(image=image, vis_backends=[dict(type='TensorboardVisBackend', name='tb_1', save_dir='temp_dir_1'),
dict(type='TensorboardVisBackend', name='tb_2', save_dir='temp_dir_2'),
dict(type='LocalVisBackend', name='local')],
save_dir='temp_dir')
Visualize at Anywhere¶
During the development, users may need to add visualization functions somewhere in their codes and save the results to different backends, which is very common for analysis and debugging. Visualizer
in MMEngine can obtain the data from the same visualizers and then visualize them.
Users only need to instantiate the visualizer through get_instance
during initialization. The visualizer obtained this way is unique and globally accessible. Then it can be accessed anywhere in the code through Visualizer.get_current_instance()
.
# call during the initialization stage
visualizer1 = Visualizer.get_instance(name='vis', vis_backends=[dict(type='LocalVisBackend')])
# call anywhere
visualizer2 = Visualizer.get_current_instance()
visualizer2.add_scalar('map', 0.7, step=0)
assert id(visualizer1) == id(visualizer2)
It can also be initialized globally through the config field.
from mmengine.registry import VISUALIZERS
visualizer_cfg=dict(
type='Visualizer',
name='vis_new',
vis_backends=[dict(type='LocalVisBackend')])
VISUALIZERS.build(visualizer_cfg)
Customize Storage Backends and Visualizers¶
Call a specific storage backend
The storage backend only provides basic functions such as saving configurations and scalars. However, users may want to utilize other powerful backend features like WandB and Tensorboard. Therefore, the storage backend provides the experiment
attribute to facilitate users to obtain backend objects and meet various customized functions.
For example, WandB provides an API to display tables. Users can obtain the WandB objects through the experiment
attribute and then call a specific API to save the data as a table to show.
visualizer = Visualizer(image=image, vis_backends=[dict(type='WandbVisBackend')],
save_dir='temp_dir')
# get WandB object
wandb = visualizer.get_backend('WandbVisBackend').experiment
# add data to the table
table = wandb.Table(columns=["step", "mAP"])
table.add_data(1, 0.2)
table.add_data(2, 0.5)
table.add_data(3, 0.9)
# save
wandb.log({"table": table})
Customize storage backends
Users only need to inherit BaseVisBackend
and implement various add_xx
methods to customize the storage backend easily.
from mmengine.registry import VISBACKENDS
from mmengine.visualization import BaseVisBackend
@VISBACKENDS.register_module()
class DemoVisBackend(BaseVisBackend):
def add_image(self, **kwargs):
pass
visualizer = Visualizer(vis_backends=[dict(type='DemoVisBackend')], save_dir='temp_dir')
visualizer.add_image('demo',image)
Customize visualizers
Similarly, users can easily customize the visualizer by inheriting Visualizer
and implementing the functions they want to override.
In most cases, users need to override add_datasample
. The data usually includes detection bboxes and instance masks from annotations or model predictions. This interface is for drawing datasample
data for various downstream libraries. Taking MMDetection as an example, the datasample
data usually includes labeled bboxs, labeled masks, predicted bboxs, or predicted masks. MMDetection will inherit Visualizer
and implement the add_datasample
interface, drawing the data related to the detection task.
from mmengine.registry import VISUALIZERS
@VISUALIZERS.register_module()
class DetLocalVisualizer(Visualizer):
def add_datasample(self,
name,
image: np.ndarray,
data_sample: Optional['BaseDataElement'] = None,
draw_gt: bool = True,
draw_pred: bool = True,
show: bool = False,
wait_time: int = 0,
step: int = 0) -> None:
pass
visualizer_cfg = dict(
type='DetLocalVisualizer', vis_backends=[dict(type='WandbVisBackend')], name='visualizer')
# global initialize
VISUALIZERS.build(visualizer_cfg)
# call anywhere in your code
det_local_visualizer = Visualizer.get_current_instance()
det_local_visualizer.add_datasample('det', image, data_sample)
Abstract Data Element¶
Coming soon. Please refer to chinese documentation.
Distribution Communication¶
In distributed training, different processes sometimes need to apply different logics depending on their ranks, local_ranks, etc. They also need to communicate with each other and do synchronizations on data. These demands rely on distributed communication. PyTorch provides a set of basic distributed communication primitives. Based on these primitives, MMEngine provides some higher level APIs to meet more diverse demands. Using these APIs provided by MMEngine, modules can:
ignore the differences between distributed/non-distributed environment
deliver data in various types apart from Tensor
ignore the frameworks or backends used for communication
These APIs are roughly categorized into 3 types:
Initialization:
init_dist
for setting up distributed environment for the runnerQuery & control: functions including
get_world_size
for queryingworld_size
,rank
and other distributed informationCollective communication: collective communication functions such as
all_reduce
We will detail on these APIs in the following chapters.
Initialization¶
init_dist: Launch function of distributed training. Currently it supports 3 launchers including pytorch, slurm and MPI. It also setup the given communication backends, defaults to NCCL.
Query and control¶
The query and control functions are all argument free. They can be used in both distributed and non-distributed environment. Their functionalities are listed below:
get_world_size: Returns the number of processes in current process group. Returns 1 when non-distributed
get_rank: Returns the global rank of current process in current process group. Returns 0 when non-distributed
get_backend: Returns the communication backends used by current process group. Returns
None
when non-distributedget_local_rank: Returns the local rank of current process in current process group. Returns 0 when non-distributed
get_local_size: Returns the number of processes which are both in current process group and on the same machine as the current process. Returns 1 when non-distributed
get_dist_info: Returns the world_size and rank of the current process group. Returns world_size = 1, rank = 0 when non-distributed
is_main_process: Returns
True
if current process is rank 0 in current process group, otherwiseFalse
. Always returnsTrue
when non-distributedmaster_only: A function decorator. Functions decorated by
master_only
will only execute on rank 0 process.barrier: A synchronization primitive. Every process will hold until all processes in the current process group reach the same barrier location
Collective communication¶
Collective communication functions are used for data transfer between processes in the same process group. We provide the following APIs based on PyTorch native functions including all_reduce, all_gather, gather, broadcast. These APIs are compatible with non-distributed environment and support more data types apart from Tensor.
all_reduce: AllReduce operation on Tensors in the current process group
all_gather: AllGather operation on Tensors in the current process group
gather: Gather Tensors in the current process group to a destinated rank
broadcast: Broadcast a Tensor to all processes in the current process group
sync_random_seed: Synchronize random seed between processes in the current process group
broadcast_object_list: Broadcast a list of Python objects. It requires the object can be serialized by Pickle.
all_reduce_dict: AllReduce operation on dict. It is based on broadcast and all_reduce.
all_gather_object: AllGather operations on any Python object than can be serialized by Pickle. It is based on all_gather
gather_object: Gather Python objects that can be serialized by Pickle
collect_results: Unified API for collecting a list of data in current process group. It support both CPU and GPU communication
Logging¶
Runner will produce a lot of logs during the running process, such as loss, iteration time, learning rate, etc. MMEngine implements a flexible logging system that allows us to choose different types of log statistical methods when configuring the runner. It could help us set/get the recorded log at any location in the code.
Flexible Logging System¶
Logging system is configured by passing a LogProcessor to the runner. If no log processor is passed, the runner will use the default log processor, which is equivalent to:
log_processor = dict(window_size=10, by_epoch=True, custom_cfg=None, num_digits=4)
The format of the output log is as follows:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from mmengine.runner import Runner
from mmengine.model import BaseModel
train_dataset = [(torch.ones(1, 1), torch.ones(1, 1))] * 50
train_dataloader = DataLoader(train_dataset, batch_size=2)
class ToyModel(BaseModel):
def __init__(self) -> None:
super().__init__()
self.linear = nn.Linear(1, 1)
def forward(self, img, label, mode):
feat = self.linear(img)
loss1 = (feat - label).pow(2)
loss2 = (feat - label).abs()
return dict(loss1=loss1, loss2=loss2)
runner = Runner(
model=ToyModel(),
work_dir='tmp_dir',
train_dataloader=train_dataloader,
train_cfg=dict(by_epoch=True, max_epochs=1),
optim_wrapper=dict(optimizer=dict(type='SGD', lr=0.01))
)
runner.train()
08/21 02:58:41 - mmengine - INFO - Epoch(train) [1][10/25] lr: 1.0000e-02 eta: 0:00:00 time: 0.0019 data_time: 0.0004 loss1: 0.8381 loss2: 0.9007 loss: 1.7388
08/21 02:58:41 - mmengine - INFO - Epoch(train) [1][20/25] lr: 1.0000e-02 eta: 0:00:00 time: 0.0029 data_time: 0.0010 loss1: 0.1978 loss2: 0.4312 loss: 0.6290
LogProcessor will output the log in the following format:
The prefix of the log:
epoch mode(
by_epoch=True
):Epoch(train) [{current_epoch}/{current_iteration}]/{dataloader_length}
iteration mode(
by_epoch=False
):Iter(train) [{current_iteration}/{max_iteration}]
)
Learning rate (
lr
): The learning rate of the last iteration.Time:
time
: The averaged time for infernce of the lastwindow_size
iterations.data_time
: The averaged time for loading data of the lastwindow_size
iterations.eta
: The estimated time of arrival to finish the training.
Loss: The averaged loss output by model of the last
window_size
iterations.
Note
window_size=10
by default.
The significant digits(num_digits
) of the log is 4 by default.
Output the value of all custom logsthe at last iteration by default.
Based on the rules above, the code snippet will count the average value of the loss1
and loss2
every 10 iterations.
If we want to count the global average value of loss1
, we can set custom_cfg
like this:
runner = Runner(
model=ToyModel(),
work_dir='tmp_dir',
train_dataloader=train_dataloader,
train_cfg=dict(by_epoch=True, max_epochs=1),
optim_wrapper=dict(optimizer=dict(type='SGD', lr=0.01)),
log_processor=dict(
custom_cfg=[
dict(data_src='loss1', # original loss name:loss1
method_name='mean', # statistical method:mean
window_size='global')]) # window_size:global
)
runner.train()
08/21 02:58:49 - mmengine - INFO - Epoch(train) [1][10/25] lr: 1.0000e-02 eta: 0:00:00 time: 0.0026 data_time: 0.0007 loss1: 0.7381 loss2: 0.8446 loss: 1.5827
08/21 02:58:49 - mmengine - INFO - Epoch(train) [1][20/25] lr: 1.0000e-02 eta: 0:00:00 time: 0.0030 data_time: 0.0012 loss1: 0.4521 loss2: 0.3939 loss: 0.5600
data_src
means the original loss name, method_name
means the statistic method, window_size
means the window size of the statistic method. Since we want to count the global average value of loss1
, we set window_size
to global
.
Currently, MMEngine supports the following statistical methods:
statistic method | arguments | function |
---|---|---|
mean | window_size | statistic the average log of the last `window_size` |
min | window_size | statistic the minimum log of the last `window_size` |
max | window_size | statistic the maximum log of the last `window_size` |
current | / | statistic the latest |
window_size
mentioned above could be:
int number: The window size of the statistic method.
global
: Equivalent towindow_size=cur_iteration
.epoch
: Equivalent towindow_size=len(dataloader)
.
If we want to statistic the average value of loss1
of the last 10 iterations, and also want to statistic the global average value of loss1
. We need to set log_name
additionally:
runner = Runner(
model=ToyModel(),
work_dir='tmp_dir',
train_dataloader=train_dataloader,
train_cfg=dict(by_epoch=True, max_epochs=1),
optim_wrapper=dict(optimizer=dict(type='SGD', lr=0.01)),
log_processor=dict(
custom_cfg=[
# log_name means the second name of loss1
dict(data_src='loss1', log_name='loss1_global', method_name='mean', window_size='global')])
)
runner.train()
08/21 18:39:32 - mmengine - INFO - Epoch(train) [1][10/25] lr: 1.0000e-02 eta: 0:00:00 time: 0.0016 data_time: 0.0004 loss1: 0.1512 loss2: 0.3751 loss: 0.5264 loss1_global: 0.1512
08/21 18:39:32 - mmengine - INFO - Epoch(train) [1][20/25] lr: 1.0000e-02 eta: 0:00:00 time: 0.0051 data_time: 0.0036 loss1: 0.0113 loss2: 0.0856 loss: 0.0970 loss1_global: 0.0813
Similarly, we can also statistic the global/local maximum value of loss
at the same time.
runner = Runner(
model=ToyModel(),
work_dir='tmp_dir',
train_dataloader=train_dataloader,
train_cfg=dict(by_epoch=True, max_epochs=1),
optim_wrapper=dict(optimizer=dict(type='SGD', lr=0.01)),
log_processor=dict(custom_cfg=[
# statistic loss1 with the local maximum value
dict(data_src='loss1',
log_name='loss1_local_max',
window_size=10,
method_name='max'),
# statistic loss1 with the global maximum value
dict(
data_src='loss1',
log_name='loss1_global_max',
method_name='max',
window_size='global')
]))
runner.train()
08/21 03:17:26 - mmengine - INFO - Epoch(train) [1][10/25] lr: 1.0000e-02 eta: 0:00:00 time: 0.0021 data_time: 0.0006 loss1: 1.8495 loss2: 1.3427 loss: 3.1922 loss1_local_max: 2.8872 loss1_global_max: 2.8872
08/21 03:17:26 - mmengine - INFO - Epoch(train) [1][20/25] lr: 1.0000e-02 eta: 0:00:00 time: 0.0024 data_time: 0.0010 loss1: 0.5464 loss2: 0.7251 loss: 1.2715 loss1_local_max: 2.8872 loss1_global_max: 2.8872
More examples can be found in log_processor.
Customize log¶
The logging system could not only log the loss
, lr
, .etc but also collect and output the custom log. For example, if we want to statistic the intermediate loss
:
from mmengine.logging import MessageHub
class ToyModel(BaseModel):
def __init__(self) -> None:
super().__init__()
self.linear = nn.Linear(1, 1)
def forward(self, img, label, mode):
feat = self.linear(img)
loss_tmp = (feat - label).abs()
loss = loss_tmp.pow(2)
message_hub = MessageHub.get_current_instance()
# update the intermediate `loss_tmp` in the message hub
message_hub.update_scalar('train/loss_tmp', loss_tmp.sum())
return dict(loss=loss)
runner = Runner(
model=ToyModel(),
work_dir='tmp_dir',
train_dataloader=train_dataloader,
train_cfg=dict(by_epoch=True, max_epochs=1),
optim_wrapper=dict(optimizer=dict(type='SGD', lr=0.01)),
log_processor=dict(
custom_cfg=[
# statistic the loss_tmp with the averaged value
dict(
data_src='loss_tmp',
window_size=10,
method_name='mean')
]
)
)
runner.train()
08/21 03:40:31 - mmengine - INFO - Epoch(train) [1][10/25] lr: 1.0000e-02 eta: 0:00:00 time: 0.0026 data_time: 0.0008 loss_tmp: 0.0097 loss: 0.0000
08/21 03:40:31 - mmengine - INFO - Epoch(train) [1][20/25] lr: 1.0000e-02 eta: 0:00:00 time: 0.0028 data_time: 0.0013 loss_tmp: 0.0065 loss: 0.0000
The custom log will be recorded by updating the messagehub:
Calling
MessageHub.get_current_instance()
to get the message of runnerCalling
MessageHub.update_scalar
to update the custom log. The first argument means the log name with the mode prefix(train/val/test
). The output log will only retain the log name without the mode prefix.Configure statistic method of
loss_tmp
inlog_processor
. If it is not configured, only the latest value ofloss_tmp
will be logged.
Export the debug log¶
Set log_level=DEBUG
for runner, and the debug log will be exported to the work_dir
:
runner = Runner(
model=ToyModel(),
work_dir='tmp_dir',
train_dataloader=train_dataloader,
log_level='DEBUG',
train_cfg=dict(by_epoch=True, max_epochs=1),
optim_wrapper=dict(optimizer=dict(type='SGD', lr=0.01)))
runner.train()
08/21 18:16:22 - mmengine - DEBUG - Get class `LocalVisBackend` from "vis_backend" registry in "mmengine"
08/21 18:16:22 - mmengine - DEBUG - An `LocalVisBackend` instance is built from registry, its implementation can be found in mmengine.visualization.vis_backend
08/21 18:16:22 - mmengine - DEBUG - Get class `RuntimeInfoHook` from "hook" registry in "mmengine"
08/21 18:16:22 - mmengine - DEBUG - An `RuntimeInfoHook` instance is built from registry, its implementation can be found in mmengine.hooks.runtime_info_hook
08/21 18:16:22 - mmengine - DEBUG - Get class `IterTimerHook` from "hook" registry in "mmengine"
...
Besides, logs of different ranks will be saved in debug
mode if you are training your model with the shared storage. The hierarchy of the log is as follows:
./tmp
├── tmp.log
├── tmp_rank1.log
├── tmp_rank2.log
├── tmp_rank3.log
├── tmp_rank4.log
├── tmp_rank5.log
├── tmp_rank6.log
└── tmp_rank7.log
...
└── tmp_rank63.log
The log of Multiple machine with independent storage:
# device: 0:
work_dir/
└── exp_name_logs
├── exp_name.log
├── exp_name_rank1.log
├── exp_name_rank2.log
├── exp_name_rank3.log
...
└── exp_name_rank7.log
# device: 7:
work_dir/
└── exp_name_logs
├── exp_name_rank56.log
├── exp_name_rank57.log
├── exp_name_rank58.log
...
└── exp_name_rank63.log
File IO¶
MMEngine
implements a unified set of file reading and writing interfaces in fileio
module. With the fileio
module, we can use the same function to handle different file formats, such as json
, yaml
and pickle
. Other file formats can also be easily extended.
The fileio
module also supports reading and writing files from a variety of file storage backends, including disk, Petrel (for internal use), Memcached, LMDB, and HTTP.
Load and dump data¶
MMEngine
provides a universal API for loading and dumping data, currently supported formats are json
, yaml
, and pickle
.
Load from disk or dump to disk¶
from mmengine import load, dump
# load data from a file
data = load('test.json')
data = load('test.yaml')
data = load('test.pkl')
# load data from a file-like object
with open('test.json', 'r') as f:
data = load(f, file_format='json')
# dump data to a string
json_str = dump(data, file_format='json')
# dump data to a file with a filename (infer format from file extension)
dump(data, 'out.pkl')
# dump data to a file with a file-like object
with open('test.yaml', 'w') as f:
data = dump(data, f, file_format='yaml')
Load from other backends or dump to other backends¶
from mmengine import load, dump
# load data from a file
data = load('s3://bucket-name/test.json')
data = load('s3://bucket-name/test.yaml')
data = load('s3://bucket-name/test.pkl')
# dump data to a file with a filename (infer format from file extension)
dump(data, 's3://bucket-name/out.pkl')
It is also very convenient to extend the API to support more file formats. All you need to do is to write a file handler inherited from BaseFileHandler
and register it with one or several file formats.
from mmengine import register_handler, BaseFileHandler
# To register multiple file formats, a list can be used as the argument.
# @register_handler(['txt', 'log'])
@register_handler('txt')
class TxtHandler1(BaseFileHandler):
def load_from_fileobj(self, file):
return file.read()
def dump_to_fileobj(self, obj, file):
file.write(str(obj))
def dump_to_str(self, obj, **kwargs):
return str(obj)
Here is an example of PickleHandler
.
from mmengine import BaseFileHandler
import pickle
class PickleHandler(BaseFileHandler):
def load_from_fileobj(self, file, **kwargs):
return pickle.load(file, **kwargs)
def load_from_path(self, filepath, **kwargs):
return super(PickleHandler, self).load_from_path(
filepath, mode='rb', **kwargs)
def dump_to_str(self, obj, **kwargs):
kwargs.setdefault('protocol', 2)
return pickle.dumps(obj, **kwargs)
def dump_to_fileobj(self, obj, file, **kwargs):
kwargs.setdefault('protocol', 2)
pickle.dump(obj, file, **kwargs)
def dump_to_path(self, obj, filepath, **kwargs):
super(PickleHandler, self).dump_to_path(
obj, filepath, mode='wb', **kwargs)
Load a text file as a list or dict¶
For example a.txt
is a text file with 5 lines.
a
b
c
d
e
Load from disk¶
Use list_from_file
to load the list from a.txt
.
from mmengine import list_from_file
print(list_from_file('a.txt'))
# ['a', 'b', 'c', 'd', 'e']
print(list_from_file('a.txt', offset=2))
# ['c', 'd', 'e']
print(list_from_file('a.txt', max_num=2))
# ['a', 'b']
print(list_from_file('a.txt', prefix='/mnt/'))
# ['/mnt/a', '/mnt/b', '/mnt/c', '/mnt/d', '/mnt/e']
For example b.txt
is a text file with 3 lines.
1 cat
2 dog cow
3 panda
Then use dict_from_file
to load the dict from b.txt
.
from mmengine import dict_from_file
print(dict_from_file('b.txt'))
# {'1': 'cat', '2': ['dog', 'cow'], '3': 'panda'}
print(dict_from_file('b.txt', key_type=int))
# {1: 'cat', 2: ['dog', 'cow'], 3: 'panda'}
Load from other backends¶
Use list_from_file
to load the list from s3://bucket-name/a.txt
.
from mmengine import list_from_file
print(list_from_file('s3://bucket-name/a.txt'))
# ['a', 'b', 'c', 'd', 'e']
print(list_from_file('s3://bucket-name/a.txt', offset=2))
# ['c', 'd', 'e']
print(list_from_file('s3://bucket-name/a.txt', max_num=2))
# ['a', 'b']
print(list_from_file('s3://bucket-name/a.txt', prefix='/mnt/'))
# ['/mnt/a', '/mnt/b', '/mnt/c', '/mnt/d', '/mnt/e']
Use dict_from_file
to load the dict from s3://bucket-name/b.txt
.
from mmengine import dict_from_file
print(dict_from_file('s3://bucket-name/b.txt'))
# {'1': 'cat', '2': ['dog', 'cow'], '3': 'panda'}
print(dict_from_file('s3://bucket-name/b.txt', key_type=int))
# {1: 'cat', 2: ['dog', 'cow'], 3: 'panda'}
Load and dump checkpoints¶
We can read the checkpoints from disk or internet in the following way.
import torch
filepath1 = '/path/of/your/checkpoint1.pth'
filepath2 = 'http://path/of/your/checkpoint3.pth'
# read filepath1 from disk
checkpoint = torch.load(filepath1)
# save checkpoints to disk
torch.save(checkpoint, filepath1)
# read filepath2 from internet
checkpoint = torch.utils.model_zoo.load_url(filepath2)
In MMEngine
, reading and writing checkpoints in different storage forms can be uniformly implemented with load_checkpoint
and save_checkpoint
.
from mmengine import load_checkpoint, save_checkpoint
filepath1 = '/path/of/your/checkpoint1.pth'
filepath2 = 's3://bucket-name/path/of/your/checkpoint1.pth'
filepath3 = 'http://path/of/your/checkpoint3.pth'
# read checkpoints from disk
checkpoint = load_checkpoint(filepath1)
# save checkpoints from disk
save_checkpoint(checkpoint, filepath1)
# read checkpoints from s3
checkpoint = load_checkpoint(filepath2)
# save checkpoints from s3
save_checkpoint(checkpoint, filepath2)
# read checkpoints from internet
checkpoint = load_checkpoint(filepath3)
Global manager (ManagerMixin)¶
During the training process, it is inevitable that we need to access some variables globally. Here are some examples:
Accessing the logger in model to print some initialization information
Accessing the Visualizer anywhere to visualize the predictions and feature maps.
Accessing the scope in Registry to get the current scope.
In order to unify the mechanism to get the global variable built from different classes, MMEngine designs the ManagerMixin.
Interface introduction¶
get_instance(name=’’, **kwargs): Create or get the instance by name.
get_current_instance(): Get the currently built instance.
instance_name: Get the name of the instance.
How to use¶
Define a class inherited from
ManagerMixin
from mmengine.utils import ManagerMixin
class GlobalClass(ManagerMixin):
def __init__(self, name, value):
super().__init__(name)
self.value = value
Note
Subclasses of ManagerMixin
must accept name
argument in __init__
. The name
argument is used to identify the instance, and you can get the instance by get_instance(name)
.
Instantiate the instance anywhere. let’s take the hook as an example:
from mmengine import Hook
class CustomHook(Hook):
def before_run(self, runner):
GlobalClass.get_instance('mmengine', value=50)
GlobalClass.get_instance(runner.experiment_name, value=100)
GlobalClass.get_instance({name})
will first check whether the instance with the name {name}
has been built. If not, it will build a new instance with the name {name}
, otherwise it will return the existing instance. As the above example shows, when we call GlobalClass.get_instance('mmengine')
at the first time, it will build a new instance with the name mmengine
. Then we call GlobalClass.get_instance(runner.experiment_name)
, it will also build a new instance with a different name.
Here we build two instances for the convenience of the subsequent introduction of get_current_instance
.
Accessing the instance anywhere
import torch.nn as nn
class CustomModule(nn.Module):
def forward(self, x):
value = GlobalClass.get_current_instance().value
# Since the name of the latest built instance is
# `runner.experiment_name`, value will be 100.
value = GlobalClass.get_instance('mmengine').value
# The value of instance with the name mmengine is 50.
value = GlobalClass.get_instance('mmengine', 1000).value
# `mmengine` instance has been built, an error will be raised
# if `get_instance` accepts other parameters.
We can get the instance with the specified name by get_instance(name)
, or get the currently built instance by get_current_instance
anywhere.
Warning
If the instance with the specified name has already been built, get_instance
will raise an error if it accepts its construct parameters.
Use modules from other libraries¶
Based on MMEngine’s Registry and Config, users can build modules across libraries. For example, use MMClassification’s backbones in MMDetection, or MMDetection’s data transforms in MMRotate, or using MMDetection’s detectors in MMTracking.
Modules registered in the same registry tree can be called across libraries by adding the package name prefix before the module’s type in the config. Here are some common examples:
Use backbone across libraries¶
Taking the example of using MMClassification’s ConvNeXt in MMDetection:
Firstly, adding the custom_imports
field to the config to register the backbones of MMClassification to the registry.
Secondly, adding the package name of MMClassification mmcls
to the type
of the backbone as a prefix: mmcls.ConvNeXt
# Use custom_imports to register mmcls models to the registry
custom_imports = dict(imports=['mmcls.models'], allow_failed_imports=False)
model = dict(
type='MaskRCNN',
data_preprocessor=dict(...),
backbone=dict(
type='mmcls.ConvNeXt', # Add mmcls prefix to enable cross-library mechanism
arch='tiny',
out_indices=[0, 1, 2, 3],
drop_path_rate=0.4,
layer_scale_init_value=1.0,
gap_before_final_norm=False,
init_cfg=dict(
type='Pretrained',
checkpoint=
'https://download.openmmlab.com/mmclassification/v0/convnext/downstream/convnext-tiny_3rdparty_32xb128-noema_in1k_20220301-795e9634.pth',
prefix='backbone.')),
neck=dict(...),
rpn_head=dict(...))
Use data transform across libraries¶
As with the example of backbone above, cross-library calls can be simply achieved by adding custom_imports and prefix in the config:
# Use custom_imports to register mmdet transforms to the registry
custom_imports = dict(imports=['mmdet.datasets.transforms'], allow_failed_imports=False)
# Add mmdet prefix to enable cross-library mechanism
train_pipeline=[
dict(type='mmdet.LoadImageFromFile'),
dict(type='mmdet.LoadAnnotations', with_bbox=True, box_type='qbox'),
dict(type='ConvertBoxType', box_type_mapping=dict(gt_bboxes='rbox')),
dict(type='mmdet.Resize', scale=(1024, 2014), keep_ratio=True),
dict(type='mmdet.RandomFlip', prob=0.5),
dict(type='mmdet.PackDetInputs')
]
Use detector across libraries¶
Using an algorithm from another library is a little bit complex.
An algorithm contains multiple submodules. Each submodule needs to add a prefix to its type
. Take using MMDetection’s YOLOX in MMTracking as an example:
# Use custom_imports to register mmdet models to the registry
custom_imports = dict(imports=['mmdet.models'], allow_failed_imports=False)
model = dict(
type='mmdet.YOLOX',
backbone=dict(type='mmdet.CSPDarknet', deepen_factor=1.33, widen_factor=1.25),
neck=dict(
type='mmdet.YOLOXPAFPN',
in_channels=[320, 640, 1280],
out_channels=320,
num_csp_blocks=4),
bbox_head=dict(
type='mmdet.YOLOXHead', num_classes=1, in_channels=320, feat_channels=320),
train_cfg=dict(assigner=dict(type='mmdet.SimOTAAssigner', center_radius=2.5)))
To prevent adding prefix to all of the submodules manually, the _scope_
keyword is introduced. When the _scope_
keyword is added to the config of a module, all submodules’ scope will be changed by the _scope_
keyword. Here is an example config:
# Use custom_imports to register mmdet models to the registry
custom_imports = dict(imports=['mmdet.models'], allow_failed_imports=False)
model = dict(
_scope_='mmdet', # use the _scope_ keyword to avoid adding prefix to all submodules
type='YOLOX',
backbone=dict(type='CSPDarknet', deepen_factor=1.33, widen_factor=1.25),
neck=dict(
type='YOLOXPAFPN',
in_channels=[320, 640, 1280],
out_channels=320,
num_csp_blocks=4),
bbox_head=dict(
type='YOLOXHead', num_classes=1, in_channels=320, feat_channels=320),
train_cfg=dict(assigner=dict(type='SimOTAAssigner', center_radius=2.5)))
These two examples are equivalent to each other.
If you want to know more about the registry and config, please refer to Config Tutorial and Registry Tutorial
Test time augmentation¶
Test time augmentation (TTA) is a data augmentation strategy used during the testing phase. It involves applying various augmentations, such as flipping and scaling, to the same image and then merging the predictions of each augmented image to produce a more accurate prediction. To make it easier for users to use TTA, MMEngine provides BaseTTAModel class, which allows users to implement different TTA strategies by simply extending the BaseTTAModel
class according to their needs.
The core implementation of TTA is usually divided into two parts:
Data augmentation: This part is implemented in MMCV, see the api docs TestTimeAug for more information.
Merge the predictions: The subclasses of
BaseTTAModel
will merge the predictions of enhanced data in thetest_step
method to improve the accuracy of predictions.
Get started¶
A simple example of TTA is given in examples/test_time_augmentation.py
Prepare test time augmentation pipeline¶
BaseTTAModel
needs to be used with TestTimeAug
implemented in MMCV:
tta_pipeline = [
dict(type='LoadImageFromFile'),
dict(
type='TestTimeAug',
transforms=[
[dict(type='Resize', img_scale=(1333, 800), keep_ratio=True)],
[dict(type='RandomFlip', flip_ratio=0.),
dict(type='RandomFlip', flip_ratio=1.)],
[dict(type='PackXXXInputs', keys=['img'])],
])
]
The above data augmentation pipeline will first perform a scaling enhancement on the image, followed by 2 flipping enhancements (flipping and not flipping). Finally, the image is packaged into the final result using PackXXXInputs
.
Define the merge strategy¶
Commonly, users only need to inherit BaseTTAModel
and override the BaseTTAModel.merge_preds
to merge the predictions of enhanced data. merge_preds
accepts a list of enhanced batch data, and each element of the list means the enhanced single data of the batch.
The BaseTTAModel class requires inferencing on both flipped and unflipped images and then merges the results. The merge_preds method accepts a list where each element represents the results of applying data augmentation to a single element of the batch. For example, if batch_size is 3, and we flip each image in the batch as an augmentation, merge_preds would accept a parameter like the following:
# `data_{i}_{j}` represents the result of applying the jth data augmentation to
# the ith image in the batch. So, if batch_size is 3, i can take on values of
# 0, 1, and 2. If there are 2 augmentation methods
# (such as flipping the image), then j can take on values of 0 and 1.
# For example, data_2_1 would represent the result of applying the second
# augmentation method (flipping) to the third image in the batch.
demo_results = [
[data_0_0, data_0_1],
[data_1_0, data_1_1],
[data_2_0, data_2_1],
]
The merge_preds
method will merge the predictions demo_results
into single batch results. For example, if we want to merge multiple classification results:
class AverageClsScoreTTA(BaseTTAModel):
def merge_preds(
self,
data_samples_list: List[List[ClsDataSample]],
) -> List[ClsDataSample]:
merged_data_samples = []
for data_samples in data_samples_list:
merged_data_sample: ClsDataSample = data_samples[0].new()
merged_score = sum(data_sample.pred_label.score
for data_sample in data_samples) / len(data_samples)
merged_data_sample.set_pred_score(merged_score)
merged_data_samples.append(merged_data_sample)
return merged_data_samples
The configuration file for the above example is as follows:
tta_model = dict(type='AverageClsScoreTTA')
Changes to test script¶
cfg.model = ConfigDict(**cfg.tta_model, module=cfg.model)
cfg.test_dataloader.dataset.pipeline = cfg.tta_pipeline
Advanced usage¶
In general, users who inherit the BaseTTAModel
class only need to implement the merge_preds method to perform result fusion. However, for more complex cases, such as fusing the results of a multi-stage detector, it may be necessary to override the test_step method. This requires an understanding of the data flow in the BaseTTAModel class and its relationship with other components.
The relationship between BaseTTAModel and other components¶
The BaseTTAModel class acts as an intermediary between the DDPWrapper and Model classes. When the Runner.test() method is executed, it will first call DDPWrapper.test_step(), followed by TTAModel.test_step(), and finally model.test_step().

The following diagram illustrates this sequence of method calls:

data flow¶
After data augmentation with TestTimeAug, the resulting data will have the following format:
image1 = dict(
inputs=[data_1_1, data_1_2],
data_sample=[data_sample1_1, data_sample1_2])
)
image2 = dict(
inputs=[data_2_1, data_2_2],
data_sample=[data_sample2_1, data_sample2_2])
)
image3 = dict(
inputs=[data_3_1, data_3_2],
data_sample=[data_sample3_1, data_sample3_2])
)
where data_{i}_{j}
means the enhanced data,and data_sample_{i}_{j}
means the ground truth of enhanced data. Then the data will be processed by Dataloader
, which contributes to the following format:
data_batch = dict(
inputs = [
(data_1_1, data_2_1, data_3_1),
(data_1_2, data_2_2, data_3_2),
]
data_samples=[
(data_samples1_1, data_samples2_1, data_samples3_1),
(data_samples1_2, data_samples2_2, data_samples3_2)
]
)
To facilitate model inferencing, the BaseTTAModel
will convert the data into the following format:
data_batch_aug1 = dict(
inputs = (data_1_1, data_2_1, data_3_1),
data_samples=(data_samples1_1, data_samples2_1, data_samples3_1)
)
data_batch_aug2 = dict(
inputs = (data_1_2, data_2_2, data_3_2),
data_samples=(data_samples1_2, data_samples2_2, data_samples3_2)
)
At this point, each data_batch_aug
can be passed directly to the model for inferencing. After the model has performed inferencing, the BaseTTAModel
will reorganize the predictions as follows for the convenience of merging:
preds = [
[data_samples1_1, data_samples_1_2],
[data_samples2_1, data_samples_2_2],
[data_samples3_1, data_samples_3_2],
]
Now that we understand the data flow in TTA, we can override the BaseTTAModel.test_step() method to implement more complex fusion strategies based on specific requirements.
Hook¶
Hook programming is a programming pattern in which a mount point is set in one or more locations of a program. When the program runs to a mount point, all methods registered to it at runtime are automatically called. Hook programming can increase the flexibility and extensibility of the program since users can register custom methods to the mount point to be called without modifying the code in the program.
Examples¶
Here is an example of how it works.
pre_hooks = [(print, 'hello')]
post_hooks = [(print, 'goodbye')]
def main():
for func, arg in pre_hooks:
func(arg)
print('do something here')
for func, arg in post_hooks:
func(arg)
main()
Output of the above example.
hello
do something here
goodbye
As we can see, the main
function calls print
defined in hooks in two locations without making any changes.
Hook is also used everywhere in PyTorch, for example in the neural network module (nn.Module) to get the forward input and output of the module as well as the reverse input and output. For example, the register_forward_hook
method registers a forward hook with the module, and the hook can get the forward input and output of the module.
The following is an example of the register_forward_hook
usage.
import torch
import torch.nn as nn
def forward_hook_fn(
module, # object to be registered hooks
input, # forward input of module
output, # forward output of module
):
print(f'"forward_hook_fn" is invoked by {module.name}')
print('weight:', module.weight.data)
print('bias:', module.bias.data)
print('input:', input)
print('output:', output)
class Model(nn.Module):
def __init__(self):
super().__init__()
self.fc = nn.Linear(3, 1)
def forward(self, x):
y = self.fc(x)
return y
model = Model()
# Register forward_hook_fn to each submodule of model
for module in model.children():
module.register_forward_hook(forward_hook_fn)
x = torch.Tensor([[0.0, 1.0, 2.0]])
y = model(x)
Output of the above example.
"forward_hook_fn" is invoked by Linear(in_features=3, out_features=1, bias=True)
weight: tensor([[-0.4077, 0.0119, -0.3606]])
bias: tensor([-0.2943])
input: (tensor([[0., 1., 2.]]),)
output: tensor([[-1.0036]], grad_fn=<AddmmBackward>)
We can see that the forward_hook_fn
hook registered to the nn.Linear
module is called, and in that hook the weights, biases, module inputs, and outputs of the Linear module are printed. For more information on the use of PyTorch hooks you can read nn.Module.
Design on MMEngine¶
Before introducing the design of the Hook
in MMEngine, let’s briefly introduce the basic steps of model training using PyTorch (copied from PyTorch Tutorials).
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision.transforms as transforms
from torch.utils.data import Dataset, DataLoader
class CustomDataset(Dataset):
pass
class Net(nn.Module):
pass
def main():
transform = transforms.ToTensor()
train_dataset = CustomDataset(transform=transform, ...)
val_dataset = CustomDataset(transform=transform, ...)
test_dataset = CustomDataset(transform=transform, ...)
train_dataloader = DataLoader(train_dataset, ...)
val_dataloader = DataLoader(val_dataset, ...)
test_dataloader = DataLoader(test_dataset, ...)
net = Net()
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)
for i in range(max_epochs):
for inputs, labels in train_dataloader:
optimizer.zero_grad()
outputs = net(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
with torch.no_grad():
for inputs, labels in val_dataloader:
outputs = net(inputs)
loss = criterion(outputs, labels)
with torch.no_grad():
for inputs, labels in test_dataloader:
outputs = net(inputs)
accuracy = ...
The above pseudo-code is the basic step to train a model. If we want to add custom operations to the above code, we need to modify and extend the main
function continuously. To increase the flexibility and extensibility of the main
function, we can insert mount points into the main
function and implement the logic of calling hooks at the corresponding mount points. In this case, we only need to insert hooks into these locations to implement custom logic, such as loading model weights, updating model parameters, etc.
def main():
...
call_hooks('before_run', hooks)
call_hooks('after_load_checkpoint', hooks)
call_hooks('before_train', hooks)
for i in range(max_epochs):
call_hooks('before_train_epoch', hooks)
for inputs, labels in train_dataloader:
call_hooks('before_train_iter', hooks)
outputs = net(inputs)
loss = criterion(outputs, labels)
call_hooks('after_train_iter', hooks)
loss.backward()
optimizer.step()
call_hooks('after_train_epoch', hooks)
call_hooks('before_val_epoch', hooks)
with torch.no_grad():
for inputs, labels in val_dataloader:
call_hooks('before_val_iter', hooks)
outputs = net(inputs)
loss = criterion(outputs, labels)
call_hooks('after_val_iter', hooks)
call_hooks('after_val_epoch', hooks)
call_hooks('before_save_checkpoint', hooks)
call_hooks('after_train', hooks)
call_hooks('before_test_epoch', hooks)
with torch.no_grad():
for inputs, labels in test_dataloader:
call_hooks('before_test_iter', hooks)
outputs = net(inputs)
accuracy = ...
call_hooks('after_test_iter', hooks)
call_hooks('after_test_epoch', hooks)
call_hooks('after_run', hooks)
In MMEngine, we encapsulates the training process into an executor (Runner
). The Runner
calls hooks at specific mount points to complete the customization logic. For more information about Runner
, please read the Runner documentation.
To facilitate management, MMEngine defines mount points as methods and integrates them into Base Hook. We just need to inherit the base hook and implement custom logic at specific location according to our needs, then register the hooks to the Runner
. Those hooks will be called automatically.
There are 22 mount points in the Base Hook.
before_run
after_run
before_train
after_train
before_train_epoch
after_train_epoch
before_train_iter
after_train_iter
before_val
after_val
before_test_epoch
after_test_epoch
before_val_iter
after_val_iter
before_test
after_test
before_test_epoch
after_test_epoch
before_test_iter
after_test_iter
before_save_checkpoint
after_load_checkpoint
Further readings: Hook tutorial and Hook API documentations
Runner¶
Deep learning algorithms usually share similar pipelines for training, validation and testing.
Therefore, MMengine designed Runner
to simplify the construction of these pipelines.
In most cases, users can use our default Runner
directly.
If you find it not feasible to implement your ideas, you can also modify it or customize your own runner.
Before introducing the design of Runner
, let’s walk through some examples to better understand why we should use runner.
Below is a few lines of pseudo codes for training models in PyTorch:
model = ResNet()
optimizer = SGD(model.parameters(), lr=0.01, momentum=0.9)
train_dataset = ImageNetDataset(...)
train_dataloader = DataLoader(train_dataset, ...)
for i in range(max_epochs):
for data_batch in train_dataloader:
optimizer.zero_grad()
outputs = model(data_batch)
loss = loss_func(outputs, data_batch)
loss.backward()
optimizer.step()
Pseudo codes for model validation in PyTorch:
model = ResNet()
model.load_state_dict(torch.load(CKPT_PATH))
model.eval()
test_dataset = ImageNetDataset(...)
test_dataloader = DataLoader(test_dataset, ...)
for data_batch in test_dataloader:
outputs = model(data_batch)
acc = calculate_acc(outputs, data_batch)
Pseudo codes for model inference in PyTorch:
model = ResNet()
model.load_state_dict(torch.load(CKPT_PATH))
model.eval()
for img in imgs:
prediction = model(img)
The observation from the above 3 pieces of codes is that they are similar.
They can all be divided into some distinct steps, such as model construction, data loading and loop iterations.
Although the above examples are based on image classification tasks, the same holds for many other tasks as well, including object detection, image segmentation, etc.
Based on the observation above, we propose runner, which structures the training, validation and testing pipeline.
With runner, the only thing you need to do is to prepare necessary components (models, data, etc.) of your pipeline, and leave the schedule and execution to Runner
.
You are free of constructing similar pipelines one and another time.
You are free of annoying details like the differences between distributed and non-distributed training.
You can focus on your own awesome ideas.
These are all achieved by runner and various practical modules in MMEngine.
The Runner
in MMEngine contains various modules required for training, testing and validation, as well as loop controllers(Loop
) and Hook, as shown in the figure above.
It provides 3 APIs for users: train
, val
and test
, each correspond to a specific Loop
.
You can use Runner
either by providing a config file, or by providing manually constructed modules.
Once activated, the Runner
will automatically setup the runtime environment, build/compose your modules, execute the loop iterations in Loop
and call registered hooks during iterations.
The execution order of Runner
is as follows:
A feature of Runner
is that it will always lazily initialize modules managed by itself.
To be specific, Runner
won’t build every module on initialization, and it won’t build a module until it is needed in current Loop
.
Therefore, if you are running only one of the train
, val
, or test
pipelines, you only need to provide the relevant configs/modules.
Loop¶
In MMEngine, we abstract the execution process of the task into Loop
, based on the observation that most deep learning tasks can be summarized as a model iterating over datasets.
We provide 4 built-in loops in MMEngine:
EpochBasedTrainLoop
IterBasedTrainLoop
ValLoop
TestLoop
The built-in runner and loops are capable of most deep learning tasks, but surely not all. Some tasks need extra modifications and refactorizations. Therefore, we make it possible for users to customize their own pipelines for model training, validation and testing.
You can write your own pipeline by subclassing BaseLoop, which needs 2 arguments for initialization: 1) runner
the Runner instance, and 2) dataloader
the dataloader used in this loop.
You are free to add more arguments to your own loop subclass.
After defining your own loop subclass, you should register it to LOOPS(mmengine.registry.LOOPS), and specify it in config files by type
field in train_cfg
, val_cfg
and test_cfg
.
In fact, you can write any execution order, any hook position in your own loop.
However, built-in hooks may not work if you change hook positions, which may lead to inconsistent behavior during training.
Therefore, we strongly recommend you to implement you subclass with similar execution order illustrated in the figure above, and with the same hook positions defined in hook documentation.
from mmengine.registry import LOOPS, HOOKS
from mmengine.runner import BaseLoop
from mmengine.hooks import Hook
# Customized validation loop
@LOOPS.register_module()
class CustomValLoop(BaseLoop):
def __init__(self, runner, dataloader, evaluator, dataloader2):
super().__init__(runner, dataloader, evaluator)
self.dataloader2 = runner.build_dataloader(dataloader2)
def run(self):
self.runner.call_hooks('before_val_epoch')
for idx, data_batch in enumerate(self.dataloader):
self.runner.call_hooks(
'before_val_iter', batch_idx=idx, data_batch=data_batch)
outputs = self.run_iter(idx, data_batch)
self.runner.call_hooks(
'after_val_iter', batch_idx=idx, data_batch=data_batch, outputs=outputs)
metric = self.evaluator.evaluate()
# add extra loop for validation purpose
for idx, data_batch in enumerate(self.dataloader2):
# add new hooks
self.runner.call_hooks(
'before_valloader2_iter', batch_idx=idx, data_batch=data_batch)
self.run_iter(idx, data_batch)
# add new hooks
self.runner.call_hooks(
'after_valloader2_iter', batch_idx=idx, data_batch=data_batch, outputs=outputs)
metric2 = self.evaluator.evaluate()
...
self.runner.call_hooks('after_val_epoch')
# Define a hook with extra hook positions
@HOOKS.register_module()
class CustomValHook(Hook):
def before_valloader2_iter(self, batch_idx, data_batch):
...
def after_valloader2_iter(self, batch_idx, data_batch, outputs):
...
The example above shows how to implement a different validation loop.
The new loop validates on two different validation datasets.
It also defines a new hook position in the second validation.
You can easily use it by setting type='CustomValLoop'
in val_cfg
in your config file.
# Customized validation loop
val_cfg = dict(type='CustomValLoop', dataloader2=dict(dataset=dict(type='ValDataset2'), ...))
# Customized hook with extra hook position
custom_hooks = [dict(type='CustomValHook')]
Customize Runner¶
Moreover, you can write your own runner by subclassing Runner
if the built-in Runner
is not feasible.
The method is similar to writing other modules: write your subclass inherited from Runner
, overrides some functions, register it to RUNNERS and access it by assigning runner_type
in your config file.
from mmengine.registry import RUNNERS
from mmengine.runner import Runner
@RUNNERS.register_module()
class CustomRunner(Runner):
def setup_env(self):
...
The example above shows how to implement a customized runner which overrides the setup_env
function and is registered to RUNNERS.
Now CustomRunner
is prepared to be used by setting runner_type='CustomRunner'
in your config file.
Further readings: Runner tutorial and Runner API documentations
Evaluation¶
Coming soon. Please refer to chinese documentation.
Visualization¶
1 Overall Design¶
Visualization provides an intuitive explanation of the training and testing process of the deep learning model. In OpenMMLab, we expect the visualization module to meet the following requirements:
Provides rich out-of-the-box features that can meet most computer vision visualization tasks.
Versatile, expandable, and can be customized easily
Able to visualize at anywhere in the training and testing process.
Unified APIs for all OpenMMLab libraries, which is convenient for users to understand and use.
Based on the above requirements, we proposed the Visualizer
and various VisBackend
such as LocalVisBackend
, WandbVisBackend
, and TensorboardVisBackend
in OpenMMLab 2.0. The visualizer could not only visualize the image data, but also things like configurations, scalars, and model structure.
For convenience, the APIs provided by the
Visualizer
implement the drawing and storage functions. As an internal property ofVisualizer
,VisBackend
will be called byVisualizer
to write data to different backends.Considering that you may want to write data to multiple backends after drawing,
Visualizer
can be configured with multiple backends. When the user calls the storage API of theVisualizer
, it will traverse and call all the specified APIs ofVisBackend
internally.
The UML diagram of the two is as follows.

2 Visualizer¶
The external interface of Visualizer
can be divided into three categories.
Drawing APIs
draw_bboxes draws a single or multiple bounding boxes
draw_points draws a single or multiple points
draw_texts draws a single or multiple text boxes
draw_lines draws a single or multiple line segments
draw_circles draws a single or multiple circles
draw_polygons draws a single or multiple polygons
draw_binary_masks draws single or multiple binary masks
draw_featmap draws feature map (static method)
The above APIs can be called in a chain except for draw_featmap
because the image size may change after this method is called. To avoid confusion, draw_featmap
is a static method.
Storage APIs
add_config writes configuration to a specific storage backend
add_graph writes model graph to a specific storage backend
add_image writes image to a specific storage backend
add_scalar writes scalar to a specific storage backend
add_scalars writes multiple scalars to a specific storage backend at once
add_datasample the abstract interface for each repositories to draw data sample
Interfaces beginning with the add
prefix represent storage APIs. [datasample] (./data_element.md
)is the unified interface of each downstream repository in the OpenMMLab 2.0, and add_datasample
can process the data sample directly .
Other APIs
set_image sets the original image data, the default input image format is RGB
get_image gets the image data in Numpy format after drawing, the default output format is RGB
show for visualization
get_backend gets a specific storage backend by name
close closes all resources, including
VisBackend
For more details, you can refer to Visualizer Tutorial.
3 VisBackend¶
After drawing, the drawn data can be stored in multiple visualization storage backends. To unify the interfaces, MMEngine provides an abstract class, BaseVisBackend
, and some commonly used backends such as LocalVisBackend
, WandbVisBackend
, and TensorboardVisBackend
.
The main interfaces and properties of BaseVisBackend
are as follows:
add_config writes configuration to a specific storage backend
add_graph writes model graph to a specific backend
add_image writes image to a specific backend
add_scalar writes scalar to a specific backend
add_scalars writes multiple scalars to a specific backend at once
close closes the resource that has been opened
experiment writes backend objects, such as WandB objects and Tensorboard objects
BaseVisBackend
defines five common data writing interfaces. Some writing backends are very powerful, such as WandB, which could write tables and videos. Users can directly obtain the experiment
object for such needs and then call native APIs of the corresponding backend. LocalVisBackend
, WandbVisBackend
, and TensorboardVisBackend
are all inherited from BaseVisBackend
and implement corresponding storage functions according to their features. Users can also customize BaseVisBackend
to extend the storage backends and implement custom storage requirements.
For more details, you can refer to Storage Backend Tutorial.
Logging¶
Overview¶
Runner produces amounts of logs during execution. These logs include dataset information, model initialization, learning rates, losses, etc. In order to make these logs easily accessed by users, MMEngine designs MessageHub, HistoryBuffer, LogProcessor and MMLogger, which enable:
Configure statistical methods in config files. For example, losses can be globally averaged or smoothed by a sliding window.
Query training states (iterations, epochs, etc.) in any module
Configure whether save the multi-process log or not during distributed training.
Each scalar (losses, learning rates, etc.) during training is encapsulated by HistoryBuffer, managed by MessageHub in key-value pairs, formatted by LogProcessor and then exported to various visualization backends by LoggerHook. In most cases, statistical methods of these scalars can be configured through the LogProcessor without understanding the data flow. Before diving into the design of the logging system, please read through logging tutorial first for familiarizing basic use cases.
HistoryBuffer¶
HistoryBuffer
records the history of the corresponding scalar such as losses, learning rates, and iteration time in an array. As an internal class, it works with MessageHub, LoggerHook and LogProcessor to make training log configurable. Meanwhile, HistoryBuffer can also be used alone, which enables users to manage their training logs and do various statistics in an easy manner.
We will first introduce the usage of HistoryBuffer in the following section. The association between HistoryBuffer and MessageHub will be introduced later in the MessageHub section.
HistoryBuffer Initialization¶
HistoryBuffer accepts log_history
, count_history
and max_length
for initialization.
log_history
records the history of the scaler. For example, if the loss in the previous 3 iterations is 0.3, 0.2, 0.1 respectively, there will belog_history=[0.3, 0.2, 0.1]
.count_history
controls the statistical granularity and will be used when counting the average. Take the above example, if we count the average loss across iterations, we havecount_history=[1, 1, 1]
. Instead, if we count the average loss across images withbatch_size=8
, then we havecount_history=[8, 8, 8]
.max_length
controls the maximum length of the history. If the length oflog_history
andcount_history
exceedsmax_length
, the earliest elements will be removed.
Besides, we can access the history of the data through history_buffer.data
.
from mmengine.logging import HistoryBuffer
history_buffer = HistoryBuffer() # Default initialization
log_history, count_history = history_buffer.data
# [] []
history_buffer = HistoryBuffer([1, 2, 3], [1, 2, 3]) # Init with lists
log_history, count_history = history_buffer.data
# [1 2 3] [1 2 3]
history_buffer = HistoryBuffer([1, 2, 3], [1, 2, 3], max_length=2)
# The length of history buffer(3) exceeds the max_length(2), the first few elements will be ignored.
log_history, count_history = history_buffer.data
# [2 3] [2 3]
HistoryBuffer Update¶
We can update the log_history
and count_history
through HistoryBuffer.update(log_history, count_history)
.
history_buffer = HistoryBuffer([1, 2, 3], [1, 1, 1])
history_buffer.update(4) # count default to 1
log_history, count_history = history_buffer.data
# [1, 2, 3, 4] [1, 1, 1, 1]
history_buffer.update(5, 2)
log_history, count_history = history_buffer.data
# [1, 2, 3, 4, 5] [1, 1, 1, 1, 2]
Basic Statistical Methods¶
HistoryBuffer provides some basic statistical methods:
current()
: Get the latest data.mean(window_size=None)
: Count the mean value of the previouswindow_size
data. Defaults to None, as global mean.max(window_size=None)
: Count the max value of the previouswindow_size
data. Defaults to None, as global maximum.min(window_size=None)
: Count the min value of the previouswindow_size
data. Defaults to None, as global minimum.
history_buffer = HistoryBuffer([1, 2, 3], [1, 1, 1])
history_buffer.min(2)
# 2, the minimum in [2, 3]
history_buffer.min()
# 1, the global minimum
history_buffer.max(2)
# 3,the maximum in [2, 3]
history_buffer.min()
# 3, the global maximum
history_buffer.mean(2)
# 2.5,the mean value in [2, 3], (2 + 3) / (1 + 1)
history_buffer.mean()
# 2, the global mean, (1 + 2 + 3) / (1 + 1 + 1)
history_buffer = HistoryBuffer([1, 2, 3], [2, 2, 2]) # Cases when counts are not 1
history_buffer.mean()
# 1, (1 + 2 + 3) / (2 + 2 + 2)
history_buffer = HistoryBuffer([1, 2, 3], [1, 1, 1])
history_buffer.update(4, 1)
history_buffer.current()
# 4
Statistical Methods Invoking¶
Statistical methods can be accessed through HistoryBuffer.statistics
with method name and arguments. The name
parameter should be a registered method name (i.e. built-in methods like min
and max
), while arguments should be the corresponding method’s arguments.
history_buffer = HistoryBuffer([1, 2, 3], [1, 1, 1])
history_buffer.statistics('mean')
# 2, as global mean
history_buffer.statistics('mean', 2)
# 2.5, as the mean of [2, 3]
history_buffer.statistics('mean', 2, 3)
# Error! mismatch arguments given to `mean(window_size)`
history_buffer.statistics('data')
# Error! `data` method not registered
Statistical Methods Registration¶
Custom statistical methods can be registered through @HistoryBuffer.register_statistics
.
from mmengine.logging import HistoryBuffer
import numpy as np
@HistoryBuffer.register_statistics
def weighted_mean(self, window_size, weight):
assert len(weight) == window_size
return (self._log_history[-window_size:] * np.array(weight)).sum() / \
self._count_history[-window_size:]
history_buffer = HistoryBuffer([1, 2], [1, 1])
history_buffer.statistics('weighted_mean', 2, [2, 1]) # get (2 * 1 + 1 * 2) / (1 + 1)
Use Cases¶
logs = dict(lr=HistoryBuffer(), loss=HistoryBuffer()) # different keys for different logs
max_iter = 10
log_interval = 5
for iter in range(1, max_iter+1):
lr = iter / max_iter * 0.1 # linear scaling of lr
loss = 1 / iter # loss
logs['lr'].update(lr, 1)
logs['loss'].update(loss, 1)
if iter % log_interval == 0:
latest_lr = logs['lr'].statistics('current') # select statistical methods by name
mean_loss = logs['loss'].statistics('mean', log_interval) # mean loss of the latest `log_interval` iterations
print(f'lr: {latest_lr}\n'
f'loss: {mean_loss}')
# lr: 0.05
# loss: 0.45666666666666667
# lr: 0.1
# loss: 0.12912698412698415
MessageHub¶
As shown above, HistoryBuffer can easily handle the update and statistics of a single variable. However, there are multiple variables to log during training, each potentially coming from a different module. This makes it an issue to collect and distribute different variables. To address this issue, we provide MessageHub in MMEngine. It is derived from ManagerMixin and thus can be accessed globally. It can be used to simplify the sharing of data across modules.
MessageHub stores data into 2 internal dictionaries, each has its own definition:
log_scalars
: Scalars including losses, learning rates and iteration time are collected from different modules and stored into the HistoryBuffer with corresponding key in this dict. Values in this dict will be formatted by LogProcessor and then output to terminal or saved locally. If you want to customize your logging info, you can add new keys to this dict and update in the subsequent training steps.runtime_info
: Some runtime information including epochs and iterations are stored in this dict. This dict makes it easy to share some necessary information across modules.
Note
You may need to use MessageHub only if you want to add extra data to logs or share custom data across modules.
The following examples show the usage of MessageHub, including scalars update, data sharing and log customization.
Update & get training log¶
HistoryBuffers are stored in MessageHub’s log_scalars
dictionary as values. You can call update_scalars
method to update the HistoryBuffer with the given key. On first call with an unseen key, a HistoryBuffer will be initialized. In the subsequent calls with the same key, the corresponding HistoryBuffer’s update
method will be invoked. You can get values or statistics of a HistoryBuffer by specifying a key in get_scalar
method. You can also get full logs by directly accessing the log_scalars
attribute of a MessageHub.
from mmengine import MessageHub
message_hub = MessageHub.get_instance('task')
message_hub.update_scalar('train/loss', 1, 1)
message_hub.get_scalar('train/loss').current() # 1, the latest updated train/loss
message_hub.update_scalar('train/loss', 3, 1)
message_hub.get_scalar('train/loss').mean() # 2, the mean calculated as (1 + 3) / (1 + 1)
message_hub.update_scalar('train/lr', 0.1, 1)
message_hub.update_scalars({'train/time': {'value': 0.1, 'count': 1},
'train/data_time': {'value': 0.1, 'count': 1}})
train_time = message_hub.get_scalar('train/time') # 1
log_dict = message_hub.log_scalars # return the whole dict
lr_buffer, loss_buffer, time_buffer, data_time_buffer = (
log_dict['train/lr'], log_dict['train/loss'], log_dict['train/time'],
log_dict['train/data_time'])
Note
Losses, learning rates and iteration time are automatically updated by runner and hooks. You are not supposed to manually update them.
Note
MessageHub has no special requirements for keys in log_scalars
. However, MMEngine will only output a scalar to logs if it has a key prfixed with train/val/test.
Update & get runtime info¶
Runtime information is stored in runtime_info
dict. The dict accepts data in any data types. Different from HistoryBuffer, the value will be overwritten on every update.
message_hub = MessageHub.get_instance('task')
message_hub.update_info('iter', 1)
message_hub.get_info('iter') # 1
message_hub.update_info('iter', 2)
message_hub.get_info('iter') # 2, overwritten by the above command
Add custom logs¶
Users can update scalars in MessageHub anywhere in any module. All data in log_scalars
with valid keys are exported to user defined backends after statistical methods.
Note
Only those data in log_scalars
with keys prefixed with train/val/test are exported.
class CustomModule:
def __init__(self):
self.message_hub = MessageHub.get_current_instance()
def custom_method(self):
self.message_hub.update_scalar('train/a', 100)
self.message_hub.update_scalars({'train/b': 1, 'train/c': 2})
By default, the latest value of the custom data(a, b and c) are exported. Users can also configure the LogProcessor to switch between statistical methods.
LogProcessor¶
Users can configure the LogProcessor to specify the statistical methods and extra arguments. By default, learning rates are displayed by the latest value, while losses and iteration time are counted with an iteration-based smooth method.
Minimum example¶
log_processor = dict(
window_size=10
)
In this configuration, losses and iteration time will be averaged in the latest 10 iterations. The output might be:
04/15 12:34:24 - mmengine - INFO - Iter [10/12] , eta: 0:00:00, time: 0.003, data_time: 0.002, loss: 0.13
Custom statistical methods¶
Users can configure the custom_cfg
list to specify the statistical method. Each element in custom_cfg
must be a dict consisting of the following keys:
data_src
: Required argument representing the data source of the log. A data source may have multiple statistical methods. Default sources, which are automatically added to logs, include all keys in loss dict(i.e.loss
), learning rate(lr
) and iteration time(time
&data_time
). Besides, all scalars updated by MessageHub’supdate_scalar
/update_scalars
methods with valid keys are configurable data sources, but be aware that the prefix(‘train/’, ‘val/’, ‘test/’) should be removed.method_name
: Required argument representing the statistical method. It supports both built-in methods and custom methods.log_name
: Optional argument representing the output name after statistics. If not specified, the new log will overwrite the old one.Other arguments: Extra arguments needed by your specified method.
window_size
is a special key, which can be either an int, ‘epoch’ or ‘global’. LogProcessor will parse these arguments and return statistical result based on iteration/epoch/global smooth.
Overwrite the old statistical method
log_processor = dict(
window_size=10,
by_epoch=True,
custom_cfg=[
dict(data_src='loss',
method_name='mean',
window_size=100)])
In this configuration, LogProcessor will overwrite the default window size 10 by a larger window size 100 and output the mean value to ‘loss’ field in logs.
04/15 12:34:24 - mmengine - INFO - Iter [10/12] , eta: 0:00:00, time: 0.003, data_time: 0.002, loss: 0.11
New statistical method without overwriting
log_processor = dict(
window_size=10,
by_epoch=True,
custom_cfg=[
dict(data_src='loss',
log_name='loss_min',
method_name='min',
window_size=100)])
04/15 12:34:24 - mmengine - INFO - Iter [10/12] , eta: 0:00:00, time: 0.003, data_time: 0.002, loss: 0.11, loss_min: 0.08
MMLogger¶
In order to export logs with clear hierarchies, unified formats and less disturbation from third-party logging systems, MMengine implements a MMLogger
class based on logging
. It is derived from ManagerMixin. Compared with logging.logger
, it enables accessing logger in current runner without knowing the logger name.
Instantiate MMLogger¶
Users can create a global logger by calling get_instance
. The default log format is shown as below
logger = MMLogger.get_instance('mmengine', log_level='INFO')
logger.info("this is a test")
# 04/15 14:01:11 - mmengine - INFO - this is a test
Apart from user defined messages, the logger will also export timestamps, logger name and log level. ERROR messages are treated specially with red highlight and extra information like error locations.
logger = MMLogger.get_instance('mmengine', log_level='INFO')
logger.error('division by zero')
# 04/15 14:01:56 - mmengine - ERROR - /mnt/d/PythonCode/DeepLearning/OpenMMLab/mmengine/a.py - <module> - 4 - division by zero
Export logs¶
When get_instance
is invoked with log_file argument, logs will be additionally exported to local storage in text format.
logger = MMLogger.get_instance('mmengine', log_file='tmp.log', log_level='INFO')
logger.info("this is a test")
# 04/15 14:01:11 - mmengine - INFO - this is a test
tmp/tmp.log
:
04/15 14:01:11 - mmengine - INFO - this is a test
Since distributed applications will create multiple log files, we add a directory with the same name to the exported log file name. Logs from different processes are all saved in this directory. Therefore, the actual log file path in the above example is tmp/tmp.log
.
Export logs in distributed training¶
When training with pytorch distributed methods, users can set distributed=True
in config file to export multiple logs from all processes. If not specified, only master process will export log file.
logger = MMLogger.get_instance('mmengine', log_file='tmp.log', distributed=True, log_level='INFO')
In the case of multiple processes in a single node, or multiple processes in multiple nodes with shared storage, the exported log files have the following hierarchy
# shared storage case
./tmp
├── tmp.log
├── tmp_rank1.log
├── tmp_rank2.log
├── tmp_rank3.log
├── tmp_rank4.log
├── tmp_rank5.log
├── tmp_rank6.log
└── tmp_rank7.log
...
└── tmp_rank63.log
In the case of multiple processes in multiple nodes without storage, logs are organized as follows
# without shared storage
# node 0:
work_dir/
└── exp_name_logs
├── exp_name.log
├── exp_name_rank1.log
├── exp_name_rank2.log
├── exp_name_rank3.log
...
└── exp_name_rank7.log
# node 7:
work_dir/
└── exp_name_logs
├── exp_name_rank56.log
├── exp_name_rank57.log
├── exp_name_rank58.log
...
└── exp_name_rank63.log
Migrate Runner from MMCV to MMEngine¶
Introduction¶
As MMCV supports more and more deep learning tasks, and users’ needs become much more complicated, we have higher requirements for the flexibility and versatility of the existing Runner
of MMCV. Therefore, MMEngine implements a more general and flexible Runner
based on MMCV to support more complicated training processes.
The Runner
in MMEngine expands the scope and takes on more functions. we abstracted training loop controller (EpochBasedTrainLoop/IterBasedTrainLoop), validation loop controller ( ValLoop) and TestLoop to make it more convenient for users to customize their training process.
Firstly, we will introduce how to migrate the entry point of training from MMCV to MMEngine, to simplify and unify the training script. Then, we’ll introduce the difference in the instantiation of Runner
between MMCV and MMEngine in detail.
Migrate the entry point¶
Take MMDet as an example, the differences between training scripts in MMCV and MMEngine are as follows:
Migrate the configuration file¶
Configuration file based on MMCV Runner | Configuration file based on MMEngine Runner |
---|---|
# default_runtime.py
checkpoint_config = dict(interval=1)
log_config = dict(
interval=50,
hooks=[
dict(type='TextLoggerHook'),
# dict(type='TensorboardLoggerHook')
])
custom_hooks = [dict(type='NumClassCheckHook')]
dist_params = dict(backend='nccl')
log_level = 'INFO'
load_from = None
resume_from = None
workflow = [('train', 1)]
opencv_num_threads = 0
mp_start_method = 'fork'
auto_scale_lr = dict(enable=False, base_batch_size=16)
|
# default_runtime.py
default_scope = 'mmdet'
default_hooks = dict(
timer=dict(type='IterTimerHook'),
logger=dict(type='LoggerHook', interval=50),
param_scheduler=dict(type='ParamSchedulerHook'),
checkpoint=dict(type='CheckpointHook', interval=1),
sampler_seed=dict(type='DistSamplerSeedHook'),
visualization=dict(type='DetVisualizationHook'))
env_cfg = dict(
cudnn_benchmark=False,
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
dist_cfg=dict(backend='nccl'),
)
vis_backends = [dict(type='LocalVisBackend')]
visualizer = dict(
type='DetLocalVisualizer', vis_backends=vis_backends, name='visualizer')
log_processor = dict(type='LogProcessor', window_size=50, by_epoch=True)
log_level = 'INFO'
load_from = None
resume = False
|
# scheduler.py
# optimizer
optimizer = dict(type='SGD', lr=0.02, momentum=0.9, weight_decay=0.0001)
optimizer_config = dict(grad_clip=None)
# learning policy
lr_config = dict(
policy='step',
warmup='linear',
warmup_iters=500,
warmup_ratio=0.001,
step=[8, 11])
runner = dict(type='EpochBasedRunner', max_epochs=12)
|
# scheduler.py
# training schedule for 1x
train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=12, val_interval=1)
val_cfg = dict(type='ValLoop')
test_cfg = dict(type='TestLoop')
# learning rate
param_scheduler = [
dict(
type='LinearLR', start_factor=0.001, by_epoch=False, begin=0, end=500),
dict(
type='MultiStepLR',
begin=0,
end=12,
by_epoch=True,
milestones=[8, 11],
gamma=0.1)
]
# optimizer
optim_wrapper = dict(
type='OptimWrapper',
optimizer=dict(type='SGD', lr=0.02, momentum=0.9, weight_decay=0.0001))
# Default setting for scaling LR automatically
# - `enable` means enable scaling LR automatically
# or not by default.
# - `base_batch_size` = (8 GPUs) x (2 samples per GPU).
auto_scale_lr = dict(enable=False, base_batch_size=16)
|
# coco_detection.py
# dataset settings
dataset_type = 'CocoDataset'
data_root = 'data/coco/'
img_norm_cfg = dict(
mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
train_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='LoadAnnotations', with_bbox=True),
dict(type='Resize', img_scale=(1333, 800), keep_ratio=True),
dict(type='RandomFlip', flip_ratio=0.5),
dict(type='Normalize', **img_norm_cfg),
dict(type='Pad', size_divisor=32),
dict(type='DefaultFormatBundle'),
dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels']),
]
test_pipeline = [
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(1333, 800),
flip=False,
transforms=[
dict(type='Resize', keep_ratio=True),
dict(type='RandomFlip'),
dict(type='Normalize', **img_norm_cfg),
dict(type='Pad', size_divisor=32),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img']),
])
]
data = dict(
samples_per_gpu=2,
workers_per_gpu=2,
train=dict(
type=dataset_type,
ann_file=data_root + 'annotations/instances_train2017.json',
img_prefix=data_root + 'train2017/',
pipeline=train_pipeline),
val=dict(
type=dataset_type,
ann_file=data_root + 'annotations/instances_val2017.json',
img_prefix=data_root + 'val2017/',
pipeline=test_pipeline),
test=dict(
type=dataset_type,
ann_file=data_root + 'annotations/instances_val2017.json',
img_prefix=data_root + 'val2017/',
pipeline=test_pipeline))
evaluation = dict(interval=1, metric='bbox')
|
# coco_detection.py
# dataset settings
dataset_type = 'CocoDataset'
data_root = 'data/coco/'
file_client_args = dict(backend='disk')
train_pipeline = [
dict(type='LoadImageFromFile', file_client_args=file_client_args),
dict(type='LoadAnnotations', with_bbox=True),
dict(type='Resize', scale=(1333, 800), keep_ratio=True),
dict(type='RandomFlip', prob=0.5),
dict(type='PackDetInputs')
]
test_pipeline = [
dict(type='LoadImageFromFile', file_client_args=file_client_args),
dict(type='Resize', scale=(1333, 800), keep_ratio=True),
# If you don't have a gt annotation, delete the pipeline
dict(type='LoadAnnotations', with_bbox=True),
dict(
type='PackDetInputs',
meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
'scale_factor'))
]
train_dataloader = dict(
batch_size=2,
num_workers=2,
persistent_workers=True,
sampler=dict(type='DefaultSampler', shuffle=True),
batch_sampler=dict(type='AspectRatioBatchSampler'),
dataset=dict(
type=dataset_type,
data_root=data_root,
ann_file='annotations/instances_train2017.json',
data_prefix=dict(img='train2017/'),
filter_cfg=dict(filter_empty_gt=True, min_size=32),
pipeline=train_pipeline))
val_dataloader = dict(
batch_size=1,
num_workers=2,
persistent_workers=True,
drop_last=False,
sampler=dict(type='DefaultSampler', shuffle=False),
dataset=dict(
type=dataset_type,
data_root=data_root,
ann_file='annotations/instances_val2017.json',
data_prefix=dict(img='val2017/'),
test_mode=True,
pipeline=test_pipeline))
test_dataloader = val_dataloader
val_evaluator = dict(
type='CocoMetric',
ann_file=data_root + 'annotations/instances_val2017.json',
metric='bbox',
format_only=False)
test_evaluator = val_evaluator
|
Runner
in MMEngine provides more customizable components, including training/validation/testing process and DataLoader. Therefore, the configuration file is a bit longer compared to MMCV.
MMEngine
follows the WYSIWYG principle and reorganizes the hierarchy of each component in configuration so that most of the first-level fields of configuration correspond to the core components in the Runner
, such as DataLoader, Evaluator, Hook, etc. The new format configuration file could help users to read and understand the core components in Runner
, and ignore the relatively unimportant parts.
Migrate the training script¶
Compared with the Runner
in MMCV, Runner
in MMEngine takes on more functions, such as building DataLoader and distributed model. Therefore, we do not need to build the components like DataLoader and distributed model manually anymore. We can configure them during the instantiation of Runner
, and then build them in the training/validation/testing process. Take the training script of MMDet as an example:
Training script based on MMCV Runner | Training script based on MMEngine Runner |
---|---|
# tools/train.py
args = parse_args()
cfg = Config.fromfile(args.config)
# replace the ${key} with the value of cfg.key
cfg = replace_cfg_vals(cfg)
# update data root according to MMDET_DATASETS
update_data_root(cfg)
if args.cfg_options is not None:
cfg.merge_from_dict(args.cfg_options)
if args.auto_scale_lr:
if 'auto_scale_lr' in cfg and \
'enable' in cfg.auto_scale_lr and \
'base_batch_size' in cfg.auto_scale_lr:
cfg.auto_scale_lr.enable = True
else:
warnings.warn('Can not find "auto_scale_lr" or '
'"auto_scale_lr.enable" or '
'"auto_scale_lr.base_batch_size" in your'
' configuration file. Please update all the '
'configuration files to mmdet >= 2.24.1.')
# set multi-process settings
setup_multi_processes(cfg)
# set cudnn_benchmark
if cfg.get('cudnn_benchmark', False):
torch.backends.cudnn.benchmark = True
# work_dir is determined in this priority: CLI > segment in file > filename
if args.work_dir is not None:
# update configs according to CLI args if args.work_dir is not None
cfg.work_dir = args.work_dir
elif cfg.get('work_dir', None) is None:
# use config filename as default work_dir if cfg.work_dir is None
cfg.work_dir = osp.join('./work_dirs',
osp.splitext(osp.basename(args.config))[0])
if args.resume_from is not None:
cfg.resume_from = args.resume_from
cfg.auto_resume = args.auto_resume
if args.gpus is not None:
cfg.gpu_ids = range(1)
warnings.warn('`--gpus` is deprecated because we only support '
'single GPU mode in non-distributed training. '
'Use `gpus=1` now.')
if args.gpu_ids is not None:
cfg.gpu_ids = args.gpu_ids[0:1]
warnings.warn('`--gpu-ids` is deprecated, please use `--gpu-id`. '
'Because we only support single GPU mode in '
'non-distributed training. Use the first GPU '
'in `gpu_ids` now.')
if args.gpus is None and args.gpu_ids is None:
cfg.gpu_ids = [args.gpu_id]
# init distributed env first, since logger depends on the dist info.
if args.launcher == 'none':
distributed = False
else:
distributed = True
init_dist(args.launcher, **cfg.dist_params)
# re-set gpu_ids with distributed training mode
_, world_size = get_dist_info()
cfg.gpu_ids = range(world_size)
# create work_dir
mmcv.mkdir_or_exist(osp.abspath(cfg.work_dir))
# dump config
cfg.dump(osp.join(cfg.work_dir, osp.basename(args.config)))
# init the logger before other steps
timestamp = time.strftime('%Y%m%d_%H%M%S', time.localtime())
log_file = osp.join(cfg.work_dir, f'{timestamp}.log')
logger = get_root_logger(log_file=log_file, log_level=cfg.log_level)
# init the meta dict to record some important information such as
# environment info and seed, which will be logged
meta = dict()
# log env info
env_info_dict = collect_env()
env_info = '\n'.join([(f'{k}: {v}') for k, v in env_info_dict.items()])
dash_line = '-' * 60 + '\n'
logger.info('Environment info:\n' + dash_line + env_info + '\n' +
dash_line)
meta['env_info'] = env_info
meta['config'] = cfg.pretty_text
# log some basic info
logger.info(f'Distributed training: {distributed}')
logger.info(f'Config:\n{cfg.pretty_text}')
cfg.device = get_device()
# set random seeds
seed = init_random_seed(args.seed, device=cfg.device)
seed = seed + dist.get_rank() if args.diff_seed else seed
logger.info(f'Set random seed to {seed}, '
f'deterministic: {args.deterministic}')
set_random_seed(seed, deterministic=args.deterministic)
cfg.seed = seed
meta['seed'] = seed
meta['exp_name'] = osp.basename(args.config)
model = build_detector(
cfg.model,
train_cfg=cfg.get('train_cfg'),
test_cfg=cfg.get('test_cfg'))
model.init_weights()
datasets = []
train_detector(
model,
datasets,
cfg,
distributed=distributed,
validate=(not args.no_validate),
timestamp=timestamp,
meta=meta)
|
# tools/train.py
args = parse_args()
# register all modules in mmdet into the registries
# do not init the default scope here because it will be init in the runner
register_all_modules(init_default_scope=False)
# load config
cfg = Config.fromfile(args.config)
cfg.launcher = args.launcher
if args.cfg_options is not None:
cfg.merge_from_dict(args.cfg_options)
# work_dir is determined in this priority: CLI > segment in file > filename
if args.work_dir is not None:
# update configs according to CLI args if args.work_dir is not None
cfg.work_dir = args.work_dir
elif cfg.get('work_dir', None) is None:
# use config filename as default work_dir if cfg.work_dir is None
cfg.work_dir = osp.join('./work_dirs',
osp.splitext(osp.basename(args.config))[0])
# enable automatic-mixed-precision training
if args.amp is True:
optim_wrapper = cfg.optim_wrapper.type
if optim_wrapper == 'AmpOptimWrapper':
print_log(
'AMP training is already enabled in your config.',
logger='current',
level=logging.WARNING)
else:
assert optim_wrapper == 'OptimWrapper', (
'`--amp` is only supported when the optimizer wrapper type is '
f'`OptimWrapper` but got {optim_wrapper}.')
cfg.optim_wrapper.type = 'AmpOptimWrapper'
cfg.optim_wrapper.loss_scale = 'dynamic'
# enable automatically scaling LR
if args.auto_scale_lr:
if 'auto_scale_lr' in cfg and \
'enable' in cfg.auto_scale_lr and \
'base_batch_size' in cfg.auto_scale_lr:
cfg.auto_scale_lr.enable = True
else:
raise RuntimeError('Can not find "auto_scale_lr" or '
'"auto_scale_lr.enable" or '
'"auto_scale_lr.base_batch_size" in your'
' configuration file.')
cfg.resume = args.resume
# build the runner from config
if 'runner_type' not in cfg:
# build the default runner
runner = Runner.from_cfg(cfg)
else:
# build customized runner from the registry
# if 'runner_type' is set in the cfg
runner = RUNNERS.build(cfg)
# start training
runner.train()
|
# apis/train.py
def init_random_seed(...):
...
def set_random_seed(...):
...
# define function tools.
...
def train_detector(model,
dataset,
cfg,
distributed=False,
validate=False,
timestamp=None,
meta=None):
cfg = compat_cfg(cfg)
logger = get_root_logger(log_level=cfg.log_level)
# put model on gpus
if distributed:
find_unused_parameters = cfg.get('find_unused_parameters', False)
# Sets the `find_unused_parameters` parameter in
# torch.nn.parallel.DistributedDataParallel
model = build_ddp(
model,
cfg.device,
device_ids=[int(os.environ['LOCAL_RANK'])],
broadcast_buffers=False,
find_unused_parameters=find_unused_parameters)
else:
model = build_dp(model, cfg.device, device_ids=cfg.gpu_ids)
# build optimizer
auto_scale_lr(cfg, distributed, logger)
optimizer = build_optimizer(model, cfg.optimizer)
runner = build_runner(
cfg.runner,
default_args=dict(
model=model,
optimizer=optimizer,
work_dir=cfg.work_dir,
logger=logger,
meta=meta))
# an ugly workaround to make .log and .log.json filenames the same
runner.timestamp = timestamp
# fp16 setting
fp16_cfg = cfg.get('fp16', None)
if fp16_cfg is not None:
optimizer_config = Fp16OptimizerHook(
**cfg.optimizer_config, **fp16_cfg, distributed=distributed)
elif distributed and 'type' not in cfg.optimizer_config:
optimizer_config = OptimizerHook(**cfg.optimizer_config)
else:
optimizer_config = cfg.optimizer_config
# register hooks
runner.register_training_hooks(
cfg.lr_config,
optimizer_config,
cfg.checkpoint_config,
cfg.log_config,
cfg.get('momentum_config', None),
custom_hooks_config=cfg.get('custom_hooks', None))
if distributed:
if isinstance(runner, EpochBasedRunner):
runner.register_hook(DistSamplerSeedHook())
# register eval hooks
if validate:
val_dataloader_default_args = dict(
samples_per_gpu=1,
workers_per_gpu=2,
dist=distributed,
shuffle=False,
persistent_workers=False)
val_dataloader_args = {
**val_dataloader_default_args,
**cfg.data.get('val_dataloader', {})
}
# Support batch_size > 1 in validation
if val_dataloader_args['samples_per_gpu'] > 1:
# Replace 'ImageToTensor' to 'DefaultFormatBundle'
cfg.data.val.pipeline = replace_ImageToTensor(
cfg.data.val.pipeline)
val_dataset = build_dataset(cfg.data.val, dict(test_mode=True))
val_dataloader = build_dataloader(val_dataset, **val_dataloader_args)
eval_cfg = cfg.get('evaluation', {})
eval_cfg['by_epoch'] = cfg.runner['type'] != 'IterBasedRunner'
eval_hook = DistEvalHook if distributed else EvalHook
# In this PR (https://github.com/open-mmlab/mmcv/pull/1193), the
# priority of IterTimerHook has been modified from 'NORMAL' to 'LOW'.
runner.register_hook(
eval_hook(val_dataloader, **eval_cfg), priority='LOW')
resume_from = None
if cfg.resume_from is None and cfg.get('auto_resume'):
resume_from = find_latest_checkpoint(cfg.work_dir)
if resume_from is not None:
cfg.resume_from = resume_from
if cfg.resume_from:
runner.resume(cfg.resume_from)
elif cfg.load_from:
runner.load_checkpoint(cfg.load_from)
runner.run(data_loaders, cfg.workflow)
|
# `apis/train.py` is removed in `mmengine`
|
Table above shows the differences between training script of MMEngine Runner
and MMCV Runner
. Repositories of OpenMMLab 1.x organize their own process to build Runner
, which contributes to the large amount of redundant code. MMEngine unifies and formats the building process, such as setting random seed, initializing distributed environment, building DataLoader, building Optimizer
, etc. This help the downstream repositories simplify the process to prepare the runner, and only need to configure the parameters of Runner
.
For the downstream repositories, training script based on MMEngine Runner not only simplify the tools/train.py
, but also can directly omit the apis/train.py
. Similarly, we can also set random seed, initialize distributed environment by configuring the parameters of Runner
, and do not need to implement the corresponding code.
Migrate Runner¶
This section describes the differences in the training, validation, and testing processes between the MMCV Runner and the MMEngine Runner, as follows.
The following tutorial will describe the difference above in detail.
Prepare logger¶
Prepare logger in MMCV
MMCV needs to call the get_logger
to get a formatted logger and use it to output and log the training information.
logger = get_logger(name='custom', log_file=log_file, log_level=cfg.log_level)
env_info_dict = collect_env()
env_info = '\n'.join([(f'{k}: {v}') for k, v in env_info_dict.items()])
dash_line = '-' * 60 + '\n'
logger.info('Environment info:\n' + dash_line + env_info + '\n' +
dash_line)
The instantiation of the Runner also relies on the logger:
runner = Runner(
...
logger=logger
...)
Prepare logger in MMEngine
Configure the log_level
for Runner
, and it will build the logger automatically.
log_level = 'INFO'
Set random seed¶
Set random seed in MMCV
Set random seed manually in training script:
...
seed = init_random_seed(args.seed, device=cfg.device)
seed = seed + dist.get_rank() if args.diff_seed else seed
logger.info(f'Set random seed to {seed}, '
f'deterministic: {args.deterministic}')
set_random_seed(seed, deterministic=args.deterministic)
...
Set random seed in MMEngine
Configure the randomness
for Runner
, see more information in Runner.set_randomness
Configuration changes
Configuration of MMCV | Configuration of MMEngine |
---|---|
seed = 1
deterministic=False
diff_seed=False
|
randomness=dict(seed=1,
deterministic=True,
diff_rank_seed=False)
|
Initialize environment variables¶
Initialize the environment variables
MMCV needs to setup launcher of distributed training, set environment variables for multi-process communication, initialize the distributed environment and wrap model with the distributed wrapper like this:
...
setup_multi_processes(cfg)
init_dist(cfg.launcher, **cfg.dist_params)
model = MMDistributedDataParallel(
model,
device_ids=[int(os.environ['LOCAL_RANK'])],
broadcast_buffers=False,
find_unused_parameters=find_unused_parameters)
As for MMEngine, you can setup launcher by configuring launcher
of Runner
, and configure other items mentioned above in env_cfg
. See more information in the table below:
Configuration changes
MMCV configuration | MMEngine configuration |
---|---|
launcher = 'pytorch' # enable distributed training
dist_params = dict(backend='nccl') # choose communication backend
|
launcher = 'pytorch'
env_cfg = dict(dist_cfg=dict(backend='nccl'))
|
In this tutorial, we set env_cfg
to:
env_cfg = dict(dist_cfg=dict(backend='nccl'))
Prepare data¶
Both MMEngine and MMCV Runner
can accept built DataLoader
import torchvision.transforms as transforms
from torch.utils.data import DataLoader
from torchvision.datasets import CIFAR10
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
train_dataset = CIFAR10(
root='data', train=True, download=True, transform=transform)
train_dataloader = DataLoader(
train_dataset, batch_size=128, shuffle=True, num_workers=2)
val_dataset = CIFAR10(
root='data', train=False, download=True, transform=transform)
val_dataloader = DataLoader(
val_dataset, batch_size=128, shuffle=False, num_workers=2)
Configuration changes
Configuration of MMCV | Configuration of MMEngine |
---|---|
data = dict(
samples_per_gpu=2, # batch_size of single gpu
workers_per_gpu=2, # num_workers of DataLoader
train=dict(
type=dataset_type,
ann_file=data_root + 'annotations/instances_train2017.json',
img_prefix=data_root + 'train2017/',
pipeline=train_pipeline),
val=dict(
type=dataset_type,
ann_file=data_root + 'annotations/instances_val2017.json',
img_prefix=data_root + 'val2017/',
pipeline=test_pipeline),
test=dict(
type=dataset_type,
ann_file=data_root + 'annotations/instances_val2017.json',
img_prefix=data_root + 'val2017/',
pipeline=test_pipeline))
|
train_dataloader = dict(
batch_size=2,
num_workers=2,
persistent_workers=True,
# Configurable sampler
sampler=dict(type='DefaultSampler', shuffle=True),
# Configurable batch_sampler
batch_sampler=dict(type='AspectRatioBatchSampler'),
dataset=dict(
type=dataset_type,
data_root=data_root,
ann_file='annotations/instances_train2017.json',
data_prefix=dict(img='train2017/'),
filter_cfg=dict(filter_empty_gt=True, min_size=32),
pipeline=train_pipeline))
val_dataloader = dict(
batch_size=1, # batch_size of validation process
num_workers=2,
persistent_workers=True,
drop_last=False, # whether drop the last batch
sampler=dict(type='DefaultSampler', shuffle=False),
dataset=dict(
type=dataset_type,
data_root=data_root,
ann_file='annotations/instances_val2017.json',
data_prefix=dict(img='val2017/'),
test_mode=True,
pipeline=test_pipeline))
test_dataloader = val_dataloader
|
Prepare model¶
See Migrate model from mmcv for more information
import torch.nn as nn
import torch.nn.functional as F
from mmengine.model import BaseModel
class Model(BaseModel):
def __init__(self):
super().__init__()
self.conv1 = nn.Conv2d(3, 6, 5)
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(6, 16, 5)
self.fc1 = nn.Linear(16 * 5 * 5, 120)
self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, 10)
self.loss_fn = nn.CrossEntropyLoss()
def forward(self, img, label, mode):
feat = self.pool(F.relu(self.conv1(img)))
feat = self.pool(F.relu(self.conv2(feat)))
feat = feat.view(-1, 16 * 5 * 5)
feat = F.relu(self.fc1(feat))
feat = F.relu(self.fc2(feat))
feat = self.fc3(feat)
if mode == 'loss':
loss = self.loss_fn(feat, label)
return dict(loss=loss)
else:
return [feat.argmax(1)]
model = Model()
Prepare optimizer¶
Prepare optimizer in MMCV
MMCV Runner can accept built optimizer
optimizer = SGD(model.parameters(), lr=0.1, momentum=0.9)
For complicated configurations of optimizers, MMCV needs to build optimizers based on the optimizer constructors.
optimizer_cfg = dict(
optimizer=dict(type='SGD', lr=0.01, weight_decay=0.0001),
paramwise_cfg=dict(norm_decay_mult=0))
def build_optimizer_constructor(cfg):
constructor_type = cfg.get('type')
if constructor_type in OPTIMIZER_BUILDERS:
return build_from_cfg(cfg, OPTIMIZER_BUILDERS)
elif constructor_type in MMCV_OPTIMIZER_BUILDERS:
return build_from_cfg(cfg, MMCV_OPTIMIZER_BUILDERS)
else:
raise KeyError(f'{constructor_type} is not registered '
'in the optimizer builder registry.')
def build_optimizer(model, cfg):
optimizer_cfg = copy.deepcopy(cfg)
constructor_type = optimizer_cfg.pop('constructor',
'DefaultOptimizerConstructor')
paramwise_cfg = optimizer_cfg.pop('paramwise_cfg', None)
optim_constructor = build_optimizer_constructor(
dict(
type=constructor_type,
optimizer_cfg=optimizer_cfg,
paramwise_cfg=paramwise_cfg))
optimizer = optim_constructor(model)
return optimizer
optimizer = build_optimizer(model, optimizer_cfg)
Prepare optimizer in MMEngine
MMEngine needs to configure optim_wrapper for Runner
. For more complicated cases, you can also configure the optim_wrapper
more specifically. See more information in the API documents
Configuration changes
Configuration in MMCV | Configuration in MMEngine |
---|---|
optimizer = dict(
constructor='CustomConstructor',
type='AdamW',
lr=0.0001,
betas=(0.9, 0.999),
weight_decay=0.05,
paramwise_cfg={ # parameters of constructor
'decay_rate': 0.95,
'decay_type': 'layer_wise',
'num_layers': 6
})
# MMCV needs to configure `optim_config` additionally
optimizer_config = dict(grad_clip=None)
|
optim_wrapper = dict(
constructor='CustomConstructor',
type='OptimWrapper', # Specify the type of OptimWrapper
optimizer=dict( # optimizer configuration
type='AdamW',
lr=0.0001,
betas=(0.9, 0.999),
weight_decay=0.05)
paramwise_cfg={
'decay_rate': 0.95,
'decay_type': 'layer_wise',
'num_layers': 6
})
|
Note
For the high-level tasks like detection and classification, MMCV needs to configure optim_config
to build OptimizerHook
, while not necessary for MMEngine.
optim_wrapper
used in this tutorial is as follows:
from torch.optim import SGD
optimizer = SGD(model.parameters(), lr=0.1, momentum=0.9)
optim_wrapper = dict(optimizer=optimizer)
Prepare hooks¶
Prepare hooks in MMCV
The commonly used hooks configuration in MMCV is as follows:
# learning rate scheduler config
lr_config = dict(policy='step', step=[2, 3])
# configuration of optimizer
optimizer_config = dict(grad_clip=None)
# configuration of saving checkpoints periodically
checkpoint_config = dict(interval=1)
# save log periodically and multiple hooks can be used simultaneously
log_config = dict(interval=100, hooks=[dict(type='TextLoggerHook')])
# register hooks to runner and those hooks will be invoked automatically
runner.register_training_hooks(
lr_config=lr_config,
optimizer_config=optimizer_config,
checkpoint_config=checkpoint_config,
log_config=log_config)
Among them:
lr_config
is used forLrUpdaterHook
optimizer_config
is used forOptimizerHook
checkpoint_config
is used forCheckPointHook
log_config
is used forLoggerHook
Besides the hooks mentioned above, MMCV Runner will build IterTimerHook
automatically. MMCV Runner
will register the training hooks after instantiating the model, while MMEngine Runner will initialize the hooks during instantiating the model.
Prepare hooks in MMEngine
MMEngine Runner
takes some commonly used hooks in MMCV as the default hooks.
Compared with the example of MMCV
LrUpdaterHook
correspond to theParamSchedulerHook
, find more details in migrate schedulerMMEngine optimize the model in train_step, therefore we do not need
OptimizerHook
in MMEngine anymoreMMEngine takes
CheckPointHook
as the default hookMMEngine take
LoggerHook
as the default hook
Therefore, we can achieve the same effect as the MMCV example as long as we configure the param_scheduler correctly.
We can also register custom hooks in MMEngine runner, find more details in runner tutorial and migrate hook.
Commonly used hooks in MMCV | Default hooks in MMEngine |
---|---|
# Configure training hooks
# Configure LrUpdaterHook
lr_config = dict(
policy='step',
warmup='linear',
warmup_iters=500,
warmup_ratio=0.001,
step=[8, 11])
# Configure OptimizerHook
optimizer_config = dict(grad_clip=None)
# Configure LoggerHook
log_config = dict( # LoggerHook
interval=50,
hooks=[
dict(type='TextLoggerHook'),
# dict(type='TensorboardLoggerHook')
])
# Configure CheckPointHook
checkpoint_config = dict(interval=1) # CheckPointHook
|
# Configure parameter scheduler
param_scheduler = [
dict(
type='LinearLR', start_factor=0.001, by_epoch=False, begin=0, end=500),
dict(
type='MultiStepLR',
begin=0,
end=12,
by_epoch=True,
milestones=[8, 11],
gamma=0.1)
]
# Configure default hooks
default_hooks = dict(
timer=dict(type='IterTimerHook'),
logger=dict(type='LoggerHook', interval=50),
param_scheduler=dict(type='ParamSchedulerHook'),
checkpoint=dict(type='CheckpointHook', interval=1),
sampler_seed=dict(type='DistSamplerSeedHook'),
visualization=dict(type='DetVisualizationHook'))
|
The parameter scheduler used in this tutorial is as follows:
from math import gamma
param_scheduler = dict(type='MultiStepLR', milestones=[2, 3], gamma=0.1)
Prepare testing/validation components¶
MMCV implements the validation process by EvalHook
, and we’ll not talk too much about it here. Given that validation is a common process in training, MMEngine abstracts validation as two independent modules: Evaluator and ValLoop. We can customize the metric or the validation process by defining a new loop or a new metric.
import torch
from mmengine.evaluator import BaseMetric
from mmengine.registry import METRICS
@METRICS.register_module(force=True)
class ToyAccuracyMetric(BaseMetric):
def process(self, label, pred) -> None:
self.results.append((label[1], pred, len(label[1])))
def compute_metrics(self, results: list) -> dict:
num_sample = 0
acc = 0
for label, pred, batch_size in results:
acc += (label == torch.stack(pred)).sum()
num_sample += batch_size
return dict(Accuracy=acc / num_sample)
After defining the metric, we should also configure the evaluator and loop for Runner
. The example used in this tutorial is as follows:
val_evaluator = dict(type='ToyAccuracyMetric')
val_cfg = dict(type='ValLoop')
Configure validation in MMCV | Configure validation in MMEngine |
---|---|
eval_cfg = cfg.get('evaluation', {})
eval_cfg['by_epoch'] = cfg.runner['type'] != 'IterBasedRunner'
eval_hook = DistEvalHook if distributed else EvalHook
runner.register_hook(
eval_hook(val_dataloader, **eval_cfg), priority='LOW')
|
val_dataloader = val_dataloader
val_evaluator = dict(type='ToyAccuracyMetric')
val_cfg = dict(type='ValLoop')
|
Build Runner¶
Building Runner in MMCV
runner = EpochBasedRunner(
model=model,
optimizer=optimizer,
work_dir=work_dir,
logger=logger,
max_epochs=4
)
Building Runner in MMEngine
The EpochBasedRunner
and max_epochs
arguments in MMCV
are moved to train_cfg
in MMEngine. All parameters configurable in train_cfg
are listed below:
by_epoch:
True
equivalent toEpochBasedRunner
.False
equivalent toIterBasedRunner
max_epoch/max_iter
: Equivalent tomax_epochs
andmax_iters
in MMCVval_iterval
: Equivalent tointerval
in MMCV
from mmengine.runner import Runner
runner = Runner(
model=model, # model to be optimized
work_dir='./work_dir', # working directory
randomness=randomness, # random seed
env_cfg=env_cfg, # environment config
launcher='none', # launcher for distributed training
optim_wrapper=optim_wrapper, # configure optimizer wrapper
param_scheduler=param_scheduler, # configure parameter scheduler
train_dataloader=train_dataloader, # configure train dataloader
train_cfg=dict(by_epoch=True, max_epochs=4, val_interval=1), # Configure training loop
val_dataloader=val_dataloader, # Configure validation dataloader
val_evaluator=val_evaluator, # Configure evaluator and metrics
val_cfg=val_cfg) # Configure validation loop
Load checkpoint¶
Loading checkpoint in MMCV
if cfg.resume_from:
runner.resume(cfg.resume_from)
elif cfg.load_from:
runner.load_checkpoint(cfg.load_from)
Loading checkpoint in MMEngine
runner = Runner(
...
load_from='/path/to/checkpoint',
resume=True
)
Configuration of loading checkpoint in MMCV | Configuration of loading checkpoint in MMEngine |
---|---|
load_from = 'path/to/ckpt'
|
load_from = 'path/to/ckpt'
resume = False
|
resume_from = 'path/to/ckpt'
|
load_from = 'path/to/ckpt'
resume = True
|
Training process¶
Training process in MMCV
Resume or load checkpoint firstly, and then start training.
if cfg.resume_from:
runner.resume(cfg.resume_from)
elif cfg.load_from:
runner.load_checkpoint(cfg.load_from)
runner.run(data_loaders, cfg.workflow)
Training process in MMEngine
Complete the process mentioned above the Runner.__init__
and Runner.train
runner.train()
Testing process¶
Since MMCV Runner does not integrate the test function, we need to implement the test scripts by ourselves.
For MMEngine Runner, as long as we have configured the test_dataloader
, test_cfg
and test_evaluator
for the Runner
, we can call Runner.test
to start the testing process.
work_dir
is the same for training
runner = Runner(
model=model,
work_dir='./work_dir',
randomness=randomness,
env_cfg=env_cfg,
launcher='none', # 不开启分布式训练
optim_wrapper=optim_wrapper,
train_dataloader=train_dataloader,
train_cfg=dict(by_epoch=True, max_epochs=5, val_interval=1),
val_dataloader=val_dataloader,
val_evaluator=val_evaluator,
val_cfg=val_cfg,
test_dataloader=val_dataloader, # 假设测试和验证使用相同的数据和评测器
test_evaluator=val_evaluator,
test_cfg=dict(type='TestLoop'),
)
runner.test()
work_dir
is the different for training, configure load_from manually
runner = Runner(
model=model,
work_dir='./test_work_dir',
load_from='./work_dir/epoch_5.pth', # set load_from additionally
randomness=randomness,
env_cfg=env_cfg,
launcher='none',
optim_wrapper=optim_wrapper,
train_dataloader=train_dataloader,
train_cfg=dict(by_epoch=True, max_epochs=5, val_interval=1),
val_dataloader=val_dataloader,
val_evaluator=val_evaluator,
val_cfg=val_cfg,
test_dataloader=val_dataloader,
test_evaluator=val_evaluator,
test_cfg=dict(type='TestLoop'),
)
runner.test()
Customize training process¶
If we want to customize a training/validation process, we need to override the Runner.val
or Runner.train
in a custom Runner
. Take overriding runner.train
as an example, suppose we need to train with the same batch twice for each iteration, we can override the Runner.train
like this:
class CustomRunner(EpochBasedRunner):
def train(self, data_loader, **kwargs):
self.model.train()
self.mode = 'train'
self.data_loader = data_loader
self._max_iters = self._max_epochs * len(self.data_loader)
self.call_hook('before_train_epoch')
time.sleep(2) # Prevent possible deadlock during epoch transition
for i, data_batch in enumerate(self.data_loader):
self.data_batch = data_batch
self._inner_iter = i
for _ in range(2)
self.call_hook('before_train_iter')
self.run_iter(data_batch, train_mode=True, **kwargs)
self.call_hook('after_train_iter')
del self.data_batch
self._iter += 1
self.call_hook('after_train_epoch')
self._epoch += 1
In MMEngine, we need to customize a train loop.
from mmengine.registry import LOOPS
from mmengine.runner import EpochBasedTrainLoop
@LOOPS.register_module()
class CustomEpochBasedTrainLoop(EpochBasedTrainLoop):
def run_iter(self, idx, data_batch) -> None:
for _ in range(2):
super().run_iter(idx, data_batch)
and then, we need to set type
as CustomEpochBasedTrainLoop
in train_cfg
. Note that by_epoch
and type
cannot be configured at the same time. Once by_epoch
is configured, the type of the training loop will be inferred as EpochBasedTrainLoop
.
runner = Runner(
model=model,
work_dir='./test_work_dir',
randomness=randomness,
env_cfg=env_cfg,
launcher='none',
optim_wrapper=dict(optimizer=dict(type='SGD', lr=0.001, momentum=0.9)),
train_dataloader=train_dataloader,
train_cfg=dict(
type='CustomEpochBasedTrainLoop',
max_epochs=5,
val_interval=1),
val_dataloader=val_dataloader,
val_evaluator=val_evaluator,
val_cfg=val_cfg,
test_dataloader=val_dataloader,
test_evaluator=val_evaluator,
test_cfg=dict(type='TestLoop'),
)
runner.train()
For more complicated migration needs of Runner
, you can refer to the runner tutorials and runner design.
Migrate Hook from MMCV to MMEngine¶
Coming soon. Please refer to chinese documentation.
Migrate Model from MMCV to MMEngine¶
Introduction¶
The early computer vision tasks supported by MMCV, such as detection and classification, used a general process to optimize model. It can be summarized as the following four steps:
Calculate the loss
Calculate the gradients
Update the model parameters
Clean the gradients of the last iteration
For most of the high-level tasks, “where” and “when” to perform the above processes is commonly fixed, therefore it seems reasonable to use Hook to implement it. MMCV implements series of hooks, such as OptimizerHook
, Fp16OptimizerHook
and GradientCumulativeFp16OptimizerHook
to provide varies of optimization strategies.
On the other hand, tasks like GAN (Generative adversarial network) and Self-supervision require more flexible training processes, which do not meet the characteristics mentioned above, and it could be hard to use hooks to implement them. To meet the needs of these tasks, MMCV will pass optimizer
to train_step
and users can customize the optimization process as they want. Although it works, it cannot utilize various OptimizerHook
implemented in MMCV, and downstream repositories have to implement mix-precision training, and gradient accumulation on their own.
To unify the training process of various deep learning tasks, MMEngine designed the OptimWrapper, which integrates the mixed-precision training, gradient accumulation and other optimization strategies into a unified interface.
Migrate optimization process¶
Since MMEngine designs the OptimWrapper
and deprecates series of OptimizerHook
, there would be some differences between the optimization process in MMCV and MMEngine.
Commonly used optimization process¶
Considering tasks like detection and classification, the optimization process is usually the same, so BaseModel
integrates the process into train_step
.
Model based on MMCV
Before describing how to migrate the model, let’s look at a minimal example to train a model based on the MMCV.
import torch
import torch.nn as nn
from torch.optim import SGD
from torch.utils.data import DataLoader
from mmcv.runner import Runner
from mmcv.utils.logging import get_logger
train_dataset = [(torch.ones(1, 1), torch.ones(1, 1))] * 50
train_dataloader = DataLoader(train_dataset, batch_size=2)
class MMCVToyModel(nn.Module):
def __init__(self) -> None:
super().__init__()
self.linear = nn.Linear(1, 1)
def forward(self, img, label, return_loss=False):
feat = self.linear(img)
loss1 = (feat - label).pow(2)
loss2 = (feat - label).abs()
loss = (loss1 + loss2).sum()
return dict(loss=loss,
num_samples=len(img),
log_vars=dict(
loss1=loss1.sum().item(),
loss2=loss2.sum().item()))
def train_step(self, data, optimizer=None):
return self(*data, return_loss=True)
def val_step(self, data, optimizer=None):
return self(*data, return_loss=False)
model = MMCVToyModel()
optimizer = SGD(model.parameters(), lr=0.01)
logger = get_logger('demo')
lr_config = dict(policy='step', step=[2, 3])
optimizer_config = dict(grad_clip=None)
log_config = dict(interval=10, hooks=[dict(type='TextLoggerHook')])
runner = Runner(
model=model,
work_dir='tmp_dir',
optimizer=optimizer,
logger=logger,
max_epochs=5)
runner.register_training_hooks(
lr_config=lr_config,
optimizer_config=optimizer_config,
log_config=log_config)
runner.run([train_dataloader], [('train', 1)])
Model based on MMCV must implement train_step
, and return a dict
which contains the following keys:
loss
: Passed toOptimizerHook
to calculate gradient.num_samples: Passed to
LogBuffer
to count the averaged losslog_vars: Passed to
LogBuffer
to count the averaged loss
Model based on MMEngine
The same model based on MMEngine
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from mmengine.runner import Runner
from mmengine.model import BaseModel
train_dataset = [(torch.ones(1, 1), torch.ones(1, 1))] * 50
train_dataloader = DataLoader(train_dataset, batch_size=2)
class MMEngineToyModel(BaseModel):
def __init__(self) -> None:
super().__init__()
self.linear = nn.Linear(1, 1)
def forward(self, img, label, mode):
feat = self.linear(img)
# Called by train_step and return the loss dict
if mode == 'loss':
loss1 = (feat - label).pow(2)
loss2 = (feat - label).abs()
return dict(loss1=loss1, loss2=loss2)
# Called by val_step and return the predictions
elif mode == 'predict':
return [_feat for _feat in feat]
# tensor model, find more details in tutorials/model.md
else:
pass
runner = Runner(
model=MMEngineToyModel(),
work_dir='tmp_dir',
train_dataloader=train_dataloader,
train_cfg=dict(by_epoch=True, max_epochs=5),
optim_wrapper=dict(optimizer=dict(type='SGD', lr=0.01)))
runner.train()
In MMEngine, users can customize their model based on BaseModel
, which implements the same logic as OptimizerHook
in train_step
. For high-level tasks, train_step
will be called in train loop with specific arguments, and users do not need to care about the optimization process. For low-level tasks, users can override the train_step
to customize the optimization process.
Model in MMCV | Model in MMEngine |
---|---|
class MMCVToyModel(nn.Module):
def __init__(self) -> None:
super().__init__()
self.linear = nn.Linear(1, 1)
def forward(self, img, label, return_loss=False):
feat = self.linear(img)
loss1 = (feat - label).pow(2)
loss2 = (feat - label).abs()
loss = (loss1 + loss2).sum()
return dict(loss=loss,
num_samples=len(img),
log_vars=dict(
loss1=loss1.sum().item(),
loss2=loss2.sum().item()))
def train_step(self, data, optimizer=None):
return self(*data, return_loss=True)
def val_step(self, data, optimizer=None):
return self(*data, return_loss=False)
|
class MMEngineToyModel(BaseModel):
def __init__(self) -> None:
super().__init__()
self.linear = nn.Linear(1, 1)
def forward(self, img, label, mode):
if mode == 'loss':
feat = self.linear(img)
loss1 = (feat - label).pow(2)
loss2 = (feat - label).abs()
return dict(loss1=loss1, loss2=loss2)
elif mode == 'predict':
return [_feat for _feat in feat]
else:
pass
# The equivalent code snippet of `train_step`
# def train_step(self, data, optim_wrapper):
# data = self.data_preprocessor(data)
# loss_dict = self(*data, mode='loss')
# loss_dict['loss1'] = loss_dict['loss1'].sum()
# loss_dict['loss2'] = loss_dict['loss2'].sum()
# loss = (loss_dict['loss1'] + loss_dict['loss2']).sum()
# Call the optimizer wrapper to update parameters.
# optim_wrapper.update_params(loss)
# return loss_dict
|
Note
See more information about data_preprocessor
and optim_wrapper
in docs optim_wrapper and data_preprocessor.
The main differences of model in MMCV and MMEngine can be summarized as follows:
MMCVToyModel
inherits fromnn.Module
, andMMEngineToyModel
inherits fromBaseModel
MMCVToyModel
must implementtrain_step
method and return adict
with keysloss
,log_vars
, andnum_samples
.MMEngineToyModel
only needs to implementforward
method for high level tasks, and return adict
with differentiable losses.MMCVToyModel.forward
andMMEngineToyModel.forward
must match withtrain_step
which will call it. SinceMMEngineToyModel
does not override thetrain_step
,BaseModel.train_step
will be directly called, which requires that forward must acceptmode
parameter. Find more details in tutorials of model
Custom optimization process¶
Takes training a GAN model as an example, generator and discriminator need to be optimized in turn and the optimization strategy could change as the training iteration grows. Therefore it could be hard to use OptimizerHook
to meet such requirements in MMCV. GAN model based on MMCV will accept an optimizer in train_step
and update parameters in it. Actually, MMEngine borrows this way and simplifies it by passing an optim_wrapper rather than an optimizer.
Referred to training a GAN model, The differences of MMCV and MMEngine are as follows:
Training gan in MMCV | Training gan in MMEngine |
---|---|
def train_discriminator(self, inputs, optimizer):
real_imgs = inputs['inputs']
z = torch.randn(
(real_imgs.shape[0], self.noise_size)).type_as(real_imgs)
with torch.no_grad():
fake_imgs = self.generator(z)
disc_pred_fake = self.discriminator(fake_imgs)
disc_pred_real = self.discriminator(real_imgs)
parsed_losses, log_vars = self.disc_loss(disc_pred_fake,
disc_pred_real)
parsed_losses.backward()
optimizer.step()
optimizer.zero_grad()
return log_vars
def train_generator(self, inputs, optimizer):
real_imgs = inputs['inputs']
z = torch.randn(inputs['inputs'].shape[0], self.noise_size).type_as(
real_imgs)
fake_imgs = self.generator(z)
disc_pred_fake = self.discriminator(fake_imgs)
parsed_loss, log_vars = self.gen_loss(disc_pred_fake)
parsed_losses.backward()
optimizer.step()
optimizer.zero_grad()
return log_vars
|
def train_discriminator(self, inputs, optimizer_wrapper):
real_imgs = inputs['inputs']
z = torch.randn(
(real_imgs.shape[0], self.noise_size)).type_as(real_imgs)
with torch.no_grad():
fake_imgs = self.generator(z)
disc_pred_fake = self.discriminator(fake_imgs)
disc_pred_real = self.discriminator(real_imgs)
parsed_losses, log_vars = self.disc_loss(disc_pred_fake,
disc_pred_real)
optimizer_wrapper.update_params(parsed_losses)
return log_vars
def train_generator(self, inputs, optimizer_wrapper):
real_imgs = inputs['inputs']
z = torch.randn(real_imgs.shape[0], self.noise_size).type_as(real_imgs)
fake_imgs = self.generator(z)
disc_pred_fake = self.discriminator(fake_imgs)
parsed_loss, log_vars = self.gen_loss(disc_pred_fake)
optimizer_wrapper.update_params(parsed_loss)
return log_vars
|
Apart from the differences mentioned in the previous section, the main difference in the optimization process in MMCV and MMEngine is that the latter can use optim_wrapper
in a more simple way. The convenience of optim_wrapper
would be more obvious if gradient accumulation and mix-precision training are applied.
Migrate validation/testing process¶
Model based on MMCV usually does not need to provide test_step
or val_step
for testing/validation. However, MMEngine performs the testing/validation by ValLoop and TestLoop, which will call runner.model.val_step
and runner.model.test_step
. Therefore model based on MMEngine needs to implement val_step
and test_step
, of which input data and output predictions should be compatible with DataLoader and Evaluator.process respectively. You can find more details in the model tutorial. Therefore, MMEngineToyModel.forward
will slice the feat and return the predictions as a list.
class MMEngineToyModel(BaseModel):
...
def forward(self, img, label, mode):
if mode == 'loss':
...
elif mode == 'predict':
# Slice the data to a list
return [_feat for _feat in feat]
else:
...
Migrate the distributed training¶
MMCV will wrap the model with distributed wrapper before building the runner, while MMEngine will wrap the model in Runner. Therefore, we need to configure the launcher
and model_wrapper_cfg
for Runner. Migrate Runner from MMCV to MMEngine will introduce it in detail.
Commonly used training process
For the high-level tasks mentioned in introduction, the default distributed model wrapper is enough. Therefore, we only need to configure the
launcher
for MMEngine Runner.Distributed training in MMCV Distributed training in MMEngine model = MMDistributedDataParallel( model, device_ids=[int(os.environ['LOCAL_RANK'])], broadcast_buffers=False, find_unused_parameters=find_unused_parameters) ... runner = Runner(model=model, ...)
runner = Runner( model=model, launcher='pytorch', # enable distributed training ..., )
optimize modules independently with custom optimization process
Again, taking the example of training a GAN model, the generator and discriminator need to be optimized separately. Therefore, the model needs to be wrapped by
MMSeparateDistributedDataParallel
, which need to be specified when building the runner.cfg = dict(model_wrapper_cfg='MMSeparateDistributedDataParallel') runner = Runner( model=model, ..., # 其他配置 launcher='pytorch', cfg=cfg)
Optimize a model with a custom optimization process
Sometimes we need to optimize the whole model with a custom optimization process, where we cannot reuse BaseModel.train_step
, but need to override it, e.g. we want to optimize the model twice with the same batch of images: the first time with batch data augmentation on, and the second time with it off
class CustomModel(BaseModel):
def train_step(self, data, optim_wrapper):
data = self.data_preprocessor(data, training=True) # Enable batch augmentation
loss = self(data, mode='loss')
optim_wrapper.update_params(loss)
data = self.data_preprocessor(data, training=False) # Disable batch augmentation
loss = self(data, mode='loss')
optim_wrapper.update_params(loss)
In this case, we need to customize a model wrapper that overrides the train_step
and performs the same process as CustomModel.train_step
.
class CustomDistributedDataParallel(MMSeparateDistributedDataParallel):
def train_step(self, data, optim_wrapper):
data = self.data_preprocessor(data, training=True) # Enable batch augmentation
loss = self(data, mode='loss')
optim_wrapper.update_params(loss)
data = self.data_preprocessor(data, training=False) # Disable batch augmentation
loss = self(data, mode='loss')
optim_wrapper.update_params(loss)
Then we can specify it when building Runner:
cfg = dict(model_wrapper_cfg=dict(type='CustomDistributedDataParallel'))
runner = Runner(
model=model,
...,
launcher='pytorch',
cfg=cfg
)
Migrate parameter scheduler from MMCV to MMEngine¶
MMCV 1.x version uses LrUpdaterHook and MomentumUpdaterHook to adjust the learning rate and momentum. However, the design of LrUpdaterHook has been difficult to meet more abundant customization requirements due to the development of the training strategies. Hence, MMEngine proposes parameter schedulers (ParamScheduler).
The interface of the parameter scheduler is consistent with PyTroch’s learning rate scheduler (LRScheduler). In addition, the parameter scheduler provides stronger functions. For details, please refer to Parameter Scheduler User Guide.
Learning rate scheduler (LrUpdater) migration¶
MMEngine uses LRScheduler instead of LrUpdaterHook. The field in the config file is changed from the original lr_config
to param_scheduler
.
The learning rate config in MMCV corresponds to the parameter scheduler config in MMEngine as follows:
Learning rate warm-up migration¶
The learning rate warm-up can be achieved through the combination of schedulers by specifying the effective range begin
and end
. There are 3 learning rate warm-up methods in MMCV, namely 'constant'
, 'linear'
, 'exp'
. The corresponding config in MMEngine should be modified as follows:
Constant warm-up¶
MMCV-1.x | MMEngine |
---|---|
lr_config = dict(
warmup='constant',
warmup_ratio=0.1,
warmup_iters=500,
warmup_by_epoch=False
)
|
param_scheduler = [
dict(type='ConstantLR',
factor=0.1,
begin=0,
end=500,
by_epoch=False),
dict(...) # the main learning rate scheduler
]
|
Linear warm-up¶
MMCV-1.x | MMEngine |
---|---|
lr_config = dict(
warmup='linear',
warmup_ratio=0.1,
warmup_iters=500,
warmup_by_epoch=False
)
|
param_scheduler = [
dict(type='LinearLR',
start_factor=0.1,
begin=0,
end=500,
by_epoch=False),
dict(...) # the main learning rate scheduler
]
|
Exponential warm-up¶
MMCV-1.x | MMEngine |
---|---|
lr_config = dict(
warmup='exp',
warmup_ratio=0.1,
warmup_iters=500,
warmup_by_epoch=False
)
|
param_scheduler = [
dict(type='ExponentialLR',
gamma=0.1,
begin=0,
end=500,
by_epoch=False),
dict(...) # the main learning rate scheduler
]
|
Fixed learning rate (FixedLrUpdaterHook) migration¶
MMCV-1.x | MMEngine |
---|---|
lr_config = dict(policy='fixed')
|
param_scheduler = [
dict(type='ConstantLR', factor=1)
]
|
Step learning rate (StepLrUpdaterHook) migration¶
MMCV-1.x | MMEngine |
---|---|
lr_config = dict(
policy='step',
step=[8, 11],
gamma=0.1,
by_epoch=True
)
|
param_scheduler = [
dict(type='MultiStepLR',
milestone=[8, 11],
gamma=0.1,
by_epoch=True)
]
|
Poly learning rate (PolyLrUpdaterHook) migration¶
MMCV-1.x | MMEngine |
---|---|
lr_config = dict(
policy='poly',
power=0.7,
min_lr=0.001,
by_epoch=True
)
|
param_scheduler = [
dict(type='PolyLR',
power=0.7,
eta_min=0.001,
begin=0,
end=num_epochs,
by_epoch=True)
]
|
Exponential learning rate (ExpLrUpdaterHook) migration¶
MMCV-1.x | MMEngine |
---|---|
lr_config = dict(
policy='exp',
power=0.5,
by_epoch=True
)
|
param_scheduler = [
dict(type='ExponentialLR',
gamma=0.5,
begin=0,
end=num_epochs,
by_epoch=True)
]
|
Cosine annealing learning rate (CosineAnnealingLrUpdaterHook) migration¶
MMCV-1.x | MMEngine |
---|---|
lr_config = dict(
policy='CosineAnnealing',
min_lr=0.5,
by_epoch=True
)
|
param_scheduler = [
dict(type='CosineAnnealingLR',
eta_min=0.5,
T_max=num_epochs,
begin=0,
end=num_epochs,
by_epoch=True)
]
|
FlatCosineAnnealingLrUpdaterHook migration¶
The learning rate strategy combined by multiple phases like FlatCosineAnnealing originally needs to be achieved by rewriting a Hook. But in MMEngine, it can be achieved with combining two parameter scheduler configs:
MMCV-1.x | MMEngine |
---|---|
lr_config = dict(
policy='FlatCosineAnnealing',
start_percent=0.5,
min_lr=0.005,
by_epoch=True
)
|
param_scheduler = [
dict(type='ConstantLR', factor=1, begin=0, end=num_epochs * 0.75)
dict(type='CosineAnnealingLR',
eta_min=0.005,
begin=num_epochs * 0.75,
end=num_epochs,
T_max=num_epochs * 0.25,
by_epoch=True)
]
|
CosineRestartLrUpdaterHook migration¶
MMCV-1.x | MMEngine |
---|---|
lr_config = dict(policy='CosineRestart',
periods=[5, 10, 15],
restart_weights=[1, 0.7, 0.3],
min_lr=0.001,
by_epoch=True)
|
param_scheduler = [
dict(type='CosineRestartLR',
periods=[5, 10, 15],
restart_weights=[1, 0.7, 0.3],
eta_min=0.001,
by_epoch=True)
]
|
OneCycleLrUpdaterHook migration¶
MMCV-1.x | MMEngine |
---|---|
lr_config = dict(policy='OneCycle',
max_lr=0.02,
total_steps=90000,
pct_start=0.3,
anneal_strategy='cos',
div_factor=25,
final_div_factor=1e4,
three_phase=True,
by_epoch=False)
|
param_scheduler = [
dict(type='OneCycleLR',
eta_max=0.02,
total_steps=90000,
pct_start=0.3,
anneal_strategy='cos',
div_factor=25,
final_div_factor=1e4,
three_phase=True,
by_epoch=False)
]
|
Notice: by_epoch
defaults to False
in MMCV. It now defaults to True
in MMEngine.
LinearAnnealingLrUpdaterHook migration¶
MMCV-1.x | MMEngine |
---|---|
lr_config = dict(
policy='LinearAnnealing',
min_lr_ratio=0.01,
by_epoch=True
)
|
param_scheduler = [
dict(type='LinearLR',
start_factor=1,
end_factor=0.01,
begin=0,
end=num_epochs,
by_epoch=True)
]
|
MomentumUpdater migration¶
MMCV uses momentum_config
field and MomentumUpdateHook to adjust momentum. The momentum in MMEngine is also controlled by the parameter scheduler. Users can simply change the LR
of the learning rate scheduler to Momentum
to use the same strategy to adjust the momentum. The momentum scheduler shares the same param_scheduler
field in the config with the learning rate scheduler:
MMCV-1.x | MMEngine |
---|---|
lr_config = dict(...)
momentum_config = dict(
policy='CosineAnnealing',
min_momentum=0.1,
by_epoch=True
)
|
param_scheduler = [
# config of learning rate schedulers
dict(...),
# config of momentum schedulers
dict(type='CosineAnnealingMomentum',
eta_min=0.1,
T_max=num_epochs,
begin=0,
end=num_epochs,
by_epoch=True)
]
|
Migrate Data Transform to OpenMMLab 2.0¶
Introduction¶
According to the data transform interface convention of TorchVision, all data transform classes need to
implement the __call__
method. And in the convention of OpenMMLab 1.0, we require the input and output of
the __call__
method should be a dictionary.
In OpenMMLab 2.0, to make the data transform classes more extensible, we use transform
method instead of
__call__
method to implement data transformation, and all data transform classes should inherit the
mmcv.transforms.BaseTransfrom
class. And you can still use these data
transform classes by calling.
A tutorial to implement a data transform class can be found in the Data Transform.
In addition, we move some common data transform classes from every repositories to MMCV, and in this document, we will compare the functionalities, usages and implementations between the original data transform classes (in MMClassification v0.23.2, MMDetection v2.25.1) and the new data transform classes (in MMCV v2.0.0rc1)
Functionality Differences¶
MMClassification (original) | MMDetection (original) | MMCV (new) | |
---|---|---|---|
LoadImageFromFile |
Join the 'img_prefix' and 'img_info.filename' field to find the path of images and loading. | Join the 'img_prefix' and 'img_info.filename' field to find the path of images and loading. Support specifying the order of channels. | Load images from 'img_path'. Support ignoring failed loading and specifying decode backend. |
LoadAnnotations |
Not available. | Load bbox, label, mask (include polygon masks), semantic segmentation. Support converting bbox coordinate system. | Load bbox, label, mask (not include polygon masks), semantic segmentation. |
Pad |
Pad all images in the "img_fields" field. | Pad all images in the "img_fields" field. Support padding to integer multiple size. | Pad the image in the "img" field. Support padding to integer multiple size. |
CenterCrop |
Crop all images in the "img_fields" field. Support cropping as EfficientNet style. | Not available. | Crop the image in the "img" field, the bbox in the "gt_bboxes" field, the semantic segmentation in the "gt_seg_map" field, the keypoints in the "gt_keypoints" field. Support padding the margin of the cropped image. |
Normalize |
Normalize the image. | No differences. | No differences, but we recommend to use data preprocessor to normalize the image. |
Resize |
Resize all images in the "img_fields" field. Support resizing proportionally according to the specified edge. | Use Resize with ratio_range=None , the img_scale have a single scale, and multiscale_mode="value" . |
Resize the image in the "img" field, the bbox in the "gt_bboxes" field, the semantic segmentation in the "gt_seg_map" field, the keypoints in the "gt_keypoints" field. Support specifying the ratio of new scale to original scale and support resizing proportionally. |
RandomResize |
Not available | Use Resize with ratio_range=None , img_scale have two scales and multiscale_mode="range" , or ratio_range is not None.
Resize( img_sacle=[(640, 480), (960, 720)], mode="range", ) |
Have the same resize function as Resize . Support sampling the scale from a scale range or scale ratio range.
RandomResize(scale=[(640, 480), (960, 720)]) |
RandomChoiceResize |
Not available | Use Resize with ratio_range=None , img_scale have multiple scales, and multiscale_mode="value" .
Resize( img_sacle=[(640, 480), (960, 720)], mode="value", ) |
Have the same resize function as Resize . Support randomly choosing the scale from multiple scales or multiple scale ratios.
RandomChoiceResize(scales=[(640, 480), (960, 720)]) |
RandomGrayscale |
Randomly grayscale all images in the "img_fields" field. Support keeping channels after grayscale. | Not available | Randomly grayscale the image in the "img" field. Support specifying the weight of each channel, and support keeping channels after grayscale. |
RandomFlip |
Randomly flip all images in the "img_fields" field. Support flipping horizontally and vertically. | Randomly flip all values in the "img_fields", "bbox_fields", "mask_fields" and "seg_fields". Support flipping horizontally, vertically and diagonally, and support specifying the probability of every kind of flipping. | Randomly flip the values in the "img", "gt_bboxes", "gt_seg_map", "gt_keypoints" field. Support flipping horizontally, vertically and diagonally, and support specifying the probability of every kind of flipping. |
MultiScaleFlipAug |
Not available | Used for test-time-augmentation. | Use TestTimeAug |
ToTensor |
Convert the values in the specified fields to torch.Tensor . |
No differences | No differences |
ImageToTensor |
Convert the values in the specified fields to torch.Tensor and transpose the channels to CHW. |
No differences. | No differences. |
Implementation Differences¶
Take RandomFlip
as example, the new version RandomFlip in MMCV inherits BaseTransfrom
, and move the
functionality implementation from __call__
to transform
method. In addition, the randomness related code
is placed in some extra methods and these methods need to be wrapped by cache_randomness
decorator.
MMDetection (original version)
class RandomFlip:
def __call__(self, results):
"""Randomly flip images."""
...
# Randomly choose the flip direction
cur_dir = np.random.choice(direction_list, p=flip_ratio_list)
...
return results
MMCV (new version)
class RandomFlip(BaseTransfrom):
def transform(self, results):
"""Randomly flip images"""
...
cur_dir = self._random_direction()
...
return results
@cache_randomness
def _random_direction(self):
"""Randomly choose the flip direction"""
...
return np.random.choice(direction_list, p=flip_ratio_list)
mmengine.registry¶
A registry to map strings to classes or functions. |
|
Scope of current task used to reset the current registry, which can be accessed globally. |
Build a module from config dict when it is a class configuration, or call a function from config dict when it is a function configuration. |
|
Build a PyTorch model from config dict(s). |
|
Build a Runner object. |
|
Builds a |
|
Scan all modules in MMEngine’s root and child registries and dump to json. |
|
Traverse the whole registry tree from any given node, and collect information of all registered modules in this registry tree. |
|
Initialize the given default scope. |
mmengine.config¶
A facility for config and config files. |
|
A dictionary for config which has the same interface as python’s built- in dictionary and can be used as a normal dictionary. |
|
argparse action to split an argument into KEY=VALUE form on the first = and append to a dictionary. |
mmengine.runner¶
mmengine.runner
Loop¶
Base loop class. |
|
Loop for epoch-based training. |
|
Loop for iter-based training. |
|
Loop for validation. |
|
Loop for test. |
Checkpoints¶
A general checkpoint loader to manage all schemes. |
Find the latest checkpoint from the given path. |
|
Returns a dictionary containing a whole state of the module. |
|
Load checkpoint from a file or URI. |
|
Load state_dict to a module. |
|
Save checkpoint to file. |
|
Copy a model state_dict to cpu. |
Miscellaneous¶
A log processor used to format log information collected from |
|
Hook priority levels. |
Get priority value. |
mmengine.hooks¶
Base hook class. |
|
Save checkpoints periodically. |
|
A Hook to apply Exponential Moving Average (EMA) on the model during training. |
|
Collect logs from different components of |
|
Show or Write the predicted results during the process of testing. |
|
A hook to update some hyper-parameters in optimizer, e.g., learning rate and momentum. |
|
A hook that updates runtime information into message hub. |
|
Data-loading sampler for distributed training. |
|
A hook that logs the time spent during iteration. |
|
Synchronize model buffers such as running_mean and running_var in BN at the end of each epoch. |
|
Releases all unoccupied cached GPU memory during the process of training. |
|
A hook to analyze performance during training and inference. |
|
Wraps runner.model with subclass of |
mmengine.model¶
mmengine.model
Module¶
Base module for all modules in openmmlab. |
|
ModuleDict in openmmlab. |
|
ModuleList in openmmlab. |
|
Sequential module in openmmlab. |
Model¶
Base class for all algorithmic models. |
|
Base data pre-processor used for copying data to the target device. |
|
Image pre-processor for normalization and bgr to rgb conversion. |
|
Base model for inference with test-time augmentation. |
EMA¶
A base class for averaging model weights. |
|
Implements the exponential moving average (EMA) of the model. |
|
Exponential moving average (EMA) with momentum annealing strategy. |
|
Implements the stochastic weight averaging (SWA) of the model. |
Model Wrapper¶
A distributed model wrapper used for training,testing and validation in loop. |
|
A DistributedDataParallel wrapper for models in MMGeneration. |
|
A wrapper for sharding Module parameters across data parallel workers. |
Check if a module is a model wrapper. |
Weight Initialization¶
Initialize module parameters with constant values. |
|
Initialize module parameters with the values according to the method described in `Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification - He, K. |
|
Initialize module parameters with the values drawn from the normal distribution \(\mathcal{N}(\text{mean}, \text{std}^2)\). |
|
Initialize module by loading a pretrained model. |
|
Initialize module parameters with the values drawn from the normal distribution \(\mathcal{N}(\text{mean}, \text{std}^2)\) with values outside \([a, b]\). |
|
Initialize module parameters with values drawn from the uniform distribution \(\mathcal{U}(a, b)\). |
|
Initialize module parameters with values according to the method described in `Understanding the difficulty of training deep feedforward neural networks - Glorot, X. |
initialize conv/fc bias value according to a given probability value. |
|
Initialize a module. |
|
Update the _params_init_info in the module if the value of parameters are changed. |
|
Utils¶
Merge all dictionaries into one dictionary. |
|
Stack multiple tensors to form a batch and pad the tensor to the max shape use the right bottom padding mode in these images. |
|
Helper function to convert all SyncBatchNorm (SyncBN) and mmcv.ops.sync_bn.SyncBatchNorm`(MMSyncBN) layers in the model to `BatchNormXd layers. |
|
Helper function to convert all BatchNorm layers in the model to SyncBatchNorm (SyncBN) or `mmcv.ops.sync_bn.SyncBatchNorm`(MMSyncBN) layers. Adapted from <https://pytorch.org/docs/stable/generated/torch.nn.Sy ncBatchNorm.html#torch.nn.SyncBatchNorm.convert_sync_batchnorm>_. |
mmengine.optim¶
Optimizer¶
A subclass of |
|
Optimizer wrapper provides a common interface for updating parameters. |
|
A dictionary container of |
|
Default constructor for optimizers. |
Build function of OptimWrapper. |
Scheduler¶
Base class for parameter schedulers. |
|
Decays the learning rate value of each parameter group by a small constant factor until the number of epoch reaches a pre-defined milestone: |
|
Decays the momentum value of each parameter group by a small constant factor until the number of epoch reaches a pre-defined milestone: |
|
Decays the parameter value of each parameter group by a small constant factor until the number of epoch reaches a pre-defined milestone: |
|
Set the learning rate of each parameter group using a cosine annealing schedule, where \(\eta_{max}\) is set to the initial value and \(T_{cur}\) is the number of epochs since the last restart in SGDR: |
|
Set the momentum of each parameter group using a cosine annealing schedule, where \(\eta_{max}\) is set to the initial value and \(T_{cur}\) is the number of epochs since the last restart in SGDR: |
|
Set the parameter value of each parameter group using a cosine annealing schedule, where \(\eta_{max}\) is set to the initial value and \(T_{cur}\) is the number of epochs since the last restart in SGDR: |
|
Decays the learning rate of each parameter group by gamma every epoch. |
|
Decays the momentum of each parameter group by gamma every epoch. |
|
Decays the parameter value of each parameter group by gamma every epoch. |
|
Decays the learning rate of each parameter group by linearly changing small multiplicative factor until the number of epoch reaches a pre-defined milestone: |
|
Decays the momentum of each parameter group by linearly changing small multiplicative factor until the number of epoch reaches a pre-defined milestone: |
|
Decays the parameter value of each parameter group by linearly changing small multiplicative factor until the number of epoch reaches a pre-defined milestone: |
|
Decays the specified learning rate in each parameter group by gamma once the number of epoch reaches one of the milestones. |
|
Decays the specified momentum in each parameter group by gamma once the number of epoch reaches one of the milestones. |
|
Decays the specified parameter in each parameter group by gamma once the number of epoch reaches one of the milestones. |
|
Sets the learning rate of each parameter group according to the 1cycle learning rate policy. |
|
Sets the parameters of each parameter group according to the 1cycle learning rate policy. |
|
Decays the learning rate of each parameter group in a polynomial decay scheme. |
|
Decays the momentum of each parameter group in a polynomial decay scheme. |
|
Decays the parameter value of each parameter group in a polynomial decay scheme. |
|
Decays the learning rate of each parameter group by gamma every step_size epochs. |
|
Decays the momentum of each parameter group by gamma every step_size epochs. |
|
Decays the parameter value of each parameter group by gamma every step_size epochs. |
mmengine.evaluator¶
Evaluator¶
Wrapper class to compose multiple |
mmengine.structures¶
A base data interface that supports Tensor-like and dict-like operations. |
|
Data structure for instance-level annotations or predictions. |
|
Data structure for label-level annotations or predictions. |
|
Data structure for pixel-level annotations or predictions. |
mmengine.dataset¶
mmengine.dataset
Dataset¶
BaseDataset for open source projects in OpenMMLab. |
|
Compose multiple transforms sequentially. |
Dataset Wrapper¶
A wrapper of class balanced dataset. |
|
A wrapper of concatenated dataset. |
|
A wrapper of repeated dataset. |
Sampler¶
The default data sampler for both distributed and non-distributed environment. |
|
It’s designed for iteration-based runner and yields a mini-batch indices each time. |
Utils¶
Convert list of data sampled from dataset into a batch of data, of which type consistent with the type of each data_itement in |
|
Convert list of data sampled from dataset into a batch of data, of which type consistent with the type of each data_itement in |
|
This function will be called on each worker subprocess after seeding and before data loading. |
mmengine.device¶
Returns the currently existing device type. |
|
Returns the maximum GPU memory occupied by tensors in megabytes (MB) for a given device. |
|
Returns True if cuda devices exist. |
|
Returns True if Ascend PyTorch and npu devices exist. |
|
Returns True if Cambricon PyTorch and mlu devices exist. |
|
Return True if mps devices exist. |
mmengine.hub¶
Get config from external package. |
|
Get built model from external package. |
mmengine.logging¶
Formatted logger used to record messages. |
|
Message hub for component interaction. |
|
Unified storage format for different log types. |
Print a log message. |
mmengine.visualization¶
mmengine.visualization
Visualizer¶
MMEngine provides a Visualizer class that uses the |
visualization Backend¶
Base class for visualization backend. |
|
Local visualization backend class. |
|
Tensorboard visualization backend class. |
|
Wandb visualization backend class. |
mmengine.fileio¶
mmengine.fileio
File Backend¶
Abstract class of storage backends. |
|
A general file client to access files in different backends. |
|
Raw hard disks storage backend. |
|
Raw local storage backend. |
|
HTTP and HTTPS storage bachend. |
|
Lmdb storage backend. |
|
Memcached storage backend. |
|
Petrel storage backend (for internal usage). |
Register a backend. |
File IO¶
Dump data to json/yaml/pickle strings or files. |
|
Load data from json/yaml/pickle files. |
|
Create a symbolic link pointing to src named dst. |
|
Copy a file src to dst and return the destination file. |
|
Copy a local file src to dst and return the destination file. |
|
Copy the file src to local dst and return the destination file. |
|
Recursively copy an entire directory tree rooted at src to a directory named dst and return the destination directory. |
|
Recursively copy an entire directory tree rooted at src to a directory named dst and return the destination directory. |
|
Recursively copy an entire directory tree rooted at src to a local directory named dst and return the destination directory. |
|
Check whether a file path exists. |
|
Generate the presigned url of video stream which can be passed to mmcv.VideoReader. |
|
Read bytes from a given |
|
Return a file backend based on the prefix of uri or backend_args. |
|
Download data from |
|
Read text from a given |
|
Check whether a file path is a directory. |
|
Check whether a file path is a file. |
|
Concatenate all file paths. |
|
Scan a directory to find the interested directories or files in arbitrary order. |
|
Write bytes to a given |
|
Write text to a given |
|
Remove a file. |
|
Recursively delete a directory tree. |
Parse File¶
Load a text file and parse the content as a dict. |
|
Load a text file and parse the content as a list of strings. |
mmengine.dist¶
dist¶
Gather data from the whole group to |
|
Gathers picklable objects from the whole group in a single process. |
|
Gather data from the whole group in a list. |
|
Gather picklable objects from the whole group into a list. |
|
Reduces the tensor data across all machines in such a way that all get the final result. |
|
Reduces the dict across all machines in such a way that all get the final result. |
|
All-reduce parameters. |
|
Broadcast the data from |
|
Synchronize a random seed to all processes. |
|
Broadcasts picklable objects in |
|
Collected results in distributed environments. |
|
Collect results under cpu mode. |
|
Collect results under gpu mode. |
utils¶
Get distributed information of the given process group. |
|
Initialize distributed environment. |
|
Setup the local process group. |
|
Return the backend of the given process group. |
|
Return the number of the given process group. |
|
Return the rank of the given process group. |
|
Return the number of the current node. |
|
Return the rank of current process in the current node. |
|
Whether the current rank of the given process group is equal to 0. |
|
Decorate those methods which should be executed in master process. |
|
Synchronize all processes from the given process group. |
|
Return True if distributed environment has been initialized. |
|
Return local process group. |
|
Return default process group. |
|
Return the device of |
|
Return the device for communication among groups. |
|
Recursively convert Tensor in |
mmengine.utils¶
mmengine.utils
Manager¶
The metaclass for global accessible class. |
|
|
Path¶
Check if path is an absolute path in different backends. |
|
Scan a directory to find the interested files. |
|
Progress Bar¶
A progress bar which can print the progress. |
Track the progress of tasks iteration or enumeration with a progress bar. |
|
Track the progress of parallel task execution with a progress bar. |
|
Track the progress of tasks execution with a progress bar. |
Miscellaneous¶
A flexible Timer class. |
|
Check whether it is a list of some type. |
|
Check whether it is a tuple of some type. |
|
Check whether it is a sequence of some type. |
|
Whether the input is an string instance. |
|
Cast elements of an iterable object into some type. |
|
Cast elements of an iterable object into a list of some type. |
|
Cast elements of an iterable object into a tuple of some type. |
|
Concatenate a list of list into a single list. |
|
Slice a list into several sub lists by a list of given length. |
|
A decorator factory to check if prerequisites are satisfied. |
|
A decorator to check if some arguments are deprecate and try to replace deprecate src_arg_name to dst_arg_name. |
|
Marks functions as deprecated. |
|
Check whether the object has a method. |
|
Check if a method of base class is overridden in derived class. |
|
Import modules from the given list of strings. |
|
A decorator to check if some executable files are installed. |
|
A decorator to check if some python packages are installed. |
|
Add check points in a single line. |
mmengine.utils.dl_utils¶
A tool that counts the average running time of a function or a method. |
Collect the information of the running environments. |
|
Loads the Torch serialized object at the given URL. |
|
Detect whether model has a BatchNormalization layer. |
|
Check if a layer is a normalization layer. |
|
Check whether mmcv-full is installed. |
|
Convert tensor to 3-channel images or 1-channel gray images. |
|
A string with magic powers to compare to both Version and iterables! Prior to 1.10.0 torch.__version__ was stored as a str and so many did comparisons against torch.__version__ as if it were a str. |
|
Set multi-processing related environment. |
|
A wrapper of torch.meshgrid to compat different PyTorch versions. |
|
Changelog of v0.x¶
v0.4.0 (12/28/2022)¶
Highlights¶
Registry supports importing modules automatically
Upgrade the documentation and provide the English documentation
Provide
ProfileHook
to profile the running process
New Features & Enhancements¶
Add
conf_path
in PetrelBackend by @sunyc11 in https://github.com/open-mmlab/mmengine/pull/774Support multiple
--cfg-options
. by @mzr1996 in https://github.com/open-mmlab/mmengine/pull/759Support passing arguments to
OptimWrapper.update_params
by @twmht in https://github.com/open-mmlab/mmengine/pull/796Make
get_torchvision_model
compatible with torch 1.13 by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/793Support
flat_decay_mult
and fixbias_decay_mult
of depth-wise-conv inDefaultOptimWrapperConstructor
by @RangiLyu in https://github.com/open-mmlab/mmengine/pull/771Registry supports importing modules automatically. by @RangiLyu in https://github.com/open-mmlab/mmengine/pull/643
Add profiler hook functionality by @BayMaxBHL in https://github.com/open-mmlab/mmengine/pull/768
Make TTAModel compatible with FSDP. by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/611
Bug Fixes¶
hub.get_model
fails on some MMCls models by @C1rN09 in https://github.com/open-mmlab/mmengine/pull/784Fix
BaseModel.to
andBaseDataPreprocessor.to
to make them consistent withtorch.nn.Module
by @C1rN09 in https://github.com/open-mmlab/mmengine/pull/783Fix creating a new logger at PretrainedInit by @xiexinch in https://github.com/open-mmlab/mmengine/pull/791
Fix
ZeroRedundancyOptimizer
ambiguous error with param groups when PyTorch < 1.12.0 by @C1rN09 in https://github.com/open-mmlab/mmengine/pull/818Fix MessageHub set resumed key repeatedly by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/839
Add
progress
argument toload_from_http
by @austinmw in https://github.com/open-mmlab/mmengine/pull/770Ensure metrics is not empty when saving best checkpoint by @zhouzaida in https://github.com/open-mmlab/mmengine/pull/849
Docs¶
Add
contributing.md
by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/754Add gif to 15 min tutorial by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/748
Refactor documentations and translate them to English by @zhouzaida in https://github.com/open-mmlab/mmengine/pull/786
Fix document link by @MambaWong in https://github.com/open-mmlab/mmengine/pull/775
Fix typos in EN
contributing.md
by @RangeKing in https://github.com/open-mmlab/mmengine/pull/792Translate data transform docs. by @mzr1996 in https://github.com/open-mmlab/mmengine/pull/737
Replace markdown table with html table by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/800
Fix wrong example in
Visualizer.draw_polygons
by @lyviva in https://github.com/open-mmlab/mmengine/pull/798Fix docstring format and rescale the images by @zhouzaida in https://github.com/open-mmlab/mmengine/pull/802
Fix failed link in registry by @zhouzaida in https://github.com/open-mmlab/mmengine/pull/811
Fix typos by @shanmo in https://github.com/open-mmlab/mmengine/pull/814
Fix wrong links and typos in docs by @shanmo in https://github.com/open-mmlab/mmengine/pull/815
Translate
save_gpu_memory.md
by @xin-li-67 in https://github.com/open-mmlab/mmengine/pull/803Translate the documentation of hook design by @zhouzaida in https://github.com/open-mmlab/mmengine/pull/780
Fix docstring format by @zhouzaida in https://github.com/open-mmlab/mmengine/pull/816
Translate
registry.md
by @xin-li-67 in https://github.com/open-mmlab/mmengine/pull/817Update docstring of
BaseDataElement
by @Xiangxu-0103 in https://github.com/open-mmlab/mmengine/pull/836Fix typo by @Xiangxu-0103 in https://github.com/open-mmlab/mmengine/pull/841
Update docstring of
structures
by @Xiangxu-0103 in https://github.com/open-mmlab/mmengine/pull/840Translate
optim_wrapper.md
by @xin-li-67 in https://github.com/open-mmlab/mmengine/pull/833Fix link error in initialize tutorial. by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/843
Fix table in
initialized.md
by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/844
Contributors¶
A total of 16 developers contributed to this release. Thanks @BayMaxBHL, @RangeKing, @Xiangxu-0103, @xin-li-67, @twmht, @shanmo, @sunyc11, @lyviva, @austinmw, @xiexinch, @mzr1996, @RangiLyu, @MambaWong, @C1rN09, @zhouzaida, @HAOCHENYE
v0.3.2 (11/24/2022)¶
New Features & Enhancements¶
Send git errors to subprocess.PIPE by @austinmw in https://github.com/open-mmlab/mmengine/pull/717
Add a common
TestRunnerTestCase
to build a Runner instance. by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/631Align the log by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/436
Log the called order of hooks during training process by @songyuc in https://github.com/open-mmlab/mmengine/pull/672
Support setting
eta_min_ratio
inCosineAnnealingParamScheduler
by @cir7 in https://github.com/open-mmlab/mmengine/pull/725Enhance compatibility of
revert_sync_batchnorm
by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/695
Bug Fixes¶
Fix
distributed_training.py
in examples by @PingHGao in https://github.com/open-mmlab/mmengine/pull/700Format the log of
CheckpointLoader.load_checkpoint
by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/685Fix bug of CosineAnnealingParamScheduler by @fangyixiao18 in https://github.com/open-mmlab/mmengine/pull/735
Fix
add_graph
is not called bug by @shenmishajing in https://github.com/open-mmlab/mmengine/pull/632Fix .pre-commit-config-zh-cn.yaml pyupgrade-repo github->gitee by @BayMaxBHL in https://github.com/open-mmlab/mmengine/pull/756
Docs¶
Add English docs of BaseDataset by @GT9505 in https://github.com/open-mmlab/mmengine/pull/713
Fix
BaseDataset
typo about lazy initialization by @MengzhangLI in https://github.com/open-mmlab/mmengine/pull/733Fix typo by @zhouzaida in https://github.com/open-mmlab/mmengine/pull/734
Translate visualization docs by @xin-li-67 in https://github.com/open-mmlab/mmengine/pull/692
v0.3.1 (11/09/2022)¶
Highlights¶
Fix error when saving best checkpoint in ddp-training
New Features & Enhancements¶
Replace
print
withprint_log
for those functions called by runner by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/686
Bug Fixes¶
Fix error when saving best checkpoint in ddp-training by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/682
Docs¶
Refine Chinese tutorials by @Xiangxu-0103 in https://github.com/open-mmlab/mmengine/pull/694
Add MMEval in README by @sanbuphy in https://github.com/open-mmlab/mmengine/pull/669
Fix error URL in runner docstring by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/668
Fix error evaluator type name in
evaluator.md
by @sanbuphy in https://github.com/open-mmlab/mmengine/pull/675Fix typo in
utils.md
@sanbuphy in https://github.com/open-mmlab/mmengine/pull/702
v0.3.0 (11/02/2022)¶
New Features & Enhancements¶
Support running on Ascend chip by @wangjiangben-hw in https://github.com/open-mmlab/mmengine/pull/572
Support torch
ZeroRedundancyOptimizer
by @nijkah in https://github.com/open-mmlab/mmengine/pull/551Add non-blocking feature to
BaseDataPreprocessor
by @shenmishajing in https://github.com/open-mmlab/mmengine/pull/618Add documents for
clip_grad
, and support clip grad by value. by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/513Add ROCm info when collecting env by @zhouzaida in https://github.com/open-mmlab/mmengine/pull/633
Add a function to mark the deprecated function. by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/609
Call
register_all_modules
inRegistry.get()
by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/541Deprecate
_save_to_state_dict
implemented in mmengine by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/610Add
ignore_keys
in ConcatDataset by @BIGWangYuDong in https://github.com/open-mmlab/mmengine/pull/556
Docs¶
Fix cannot show
changelog.md
in chinese documents. by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/606Fix Chinese docs whitespaces by @C1rN09 in https://github.com/open-mmlab/mmengine/pull/521
Translate installation and 15_min by @xin-li-67 in https://github.com/open-mmlab/mmengine/pull/629
Refine chinese doc by @Tau-J in https://github.com/open-mmlab/mmengine/pull/516
Add MMYOLO link in README by @Xiangxu-0103 in https://github.com/open-mmlab/mmengine/pull/634
Add MMEngine logo in docs by @zhouzaida in https://github.com/open-mmlab/mmengine/pull/641
Fix docstring of
BaseDataset
by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/656Fix docstring and documentation used for
hub.get_model
by @zengyh1900 in https://github.com/open-mmlab/mmengine/pull/659Fix typo in
docs/zh_cn/advanced_tutorials/visualization.md
by @MambaWong in https://github.com/open-mmlab/mmengine/pull/616Fix typo docstring of
DefaultOptimWrapperConstructor
by @triple-Mu in https://github.com/open-mmlab/mmengine/pull/644Fix typo in advanced tutorial by @cxiang26 in https://github.com/open-mmlab/mmengine/pull/650
Fix typo in
Config
docstring by @sanbuphy in https://github.com/open-mmlab/mmengine/pull/654Fix typo in
docs/zh_cn/tutorials/config.md
by @Xiangxu-0103 in https://github.com/open-mmlab/mmengine/pull/596Fix typo in
docs/zh_cn/tutorials/model.md
by @C1rN09 in https://github.com/open-mmlab/mmengine/pull/598
Bug Fixes¶
Fix error calculation of
eta_min
inCosineRestartParamScheduler
by @Z-Fran in https://github.com/open-mmlab/mmengine/pull/639Fix
BaseDataPreprocessor.cast_data
could not handle string data by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/602Make
autocast
compatible with mps by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/587Fix error format of log message by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/508
Fix error implementation of
is_model_wrapper
by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/640Fix
VisBackend.add_config
is not called by @shenmishajing in https://github.com/open-mmlab/mmengine/pull/613Change
strict_load
of EMAHook to False by default by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/642Fix
open
encoding problem of Config in Windows by @sanbuphy in https://github.com/open-mmlab/mmengine/pull/648Fix the total number of iterations in log is a float number. by @jbwang1997 in https://github.com/open-mmlab/mmengine/pull/604
Fix
pip upgrade
CI by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/622
New Contributors¶
@shenmishajing made their first contribution in https://github.com/open-mmlab/mmengine/pull/618
@Xiangxu-0103 made their first contribution in https://github.com/open-mmlab/mmengine/pull/596
@Tau-J made their first contribution in https://github.com/open-mmlab/mmengine/pull/516
@wangjiangben-hw made their first contribution in https://github.com/open-mmlab/mmengine/pull/572
@triple-Mu made their first contribution in https://github.com/open-mmlab/mmengine/pull/644
@sanbuphy made their first contribution in https://github.com/open-mmlab/mmengine/pull/648
@Z-Fran made their first contribution in https://github.com/open-mmlab/mmengine/pull/639
@BIGWangYuDong made their first contribution in https://github.com/open-mmlab/mmengine/pull/556
@zengyh1900 made their first contribution in https://github.com/open-mmlab/mmengine/pull/659
v0.2.0 (11/10/2022)¶
New Features & Enhancements¶
Add SMDDP backend and support running on AWS by @austinmw in https://github.com/open-mmlab/mmengine/pull/579
Refactor
FileIO
but without breaking bc by @zhouzaida in https://github.com/open-mmlab/mmengine/pull/533Add test time augmentation base model by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/538
Use
torch.lerp\_()
to speed up EMA by @RangiLyu in https://github.com/open-mmlab/mmengine/pull/519Support converting
BN
toSyncBN
by config by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/506Support defining metric name in wandb backend by @okotaku in https://github.com/open-mmlab/mmengine/pull/509
Add dockerfile by @zhouzaida in https://github.com/open-mmlab/mmengine/pull/347
Docs¶
Fix API files of English documentation by @zhouzaida in https://github.com/open-mmlab/mmengine/pull/525
Fix typo in
instance_data.py
by @Dai-Wenxun in https://github.com/open-mmlab/mmengine/pull/530Fix the docstring of the model sub-package by @zhouzaida in https://github.com/open-mmlab/mmengine/pull/573
Fix a spelling error in docs/zh_cn by @cxiang26 in https://github.com/open-mmlab/mmengine/pull/548
Fix typo in docstring by @MengzhangLI in https://github.com/open-mmlab/mmengine/pull/527
Update
config.md
by @Zhengfei-0311 in https://github.com/open-mmlab/mmengine/pull/562
Bug Fixes¶
Fix
LogProcessor
does not smooth loss if the name of loss doesn’t start withloss
by @liuyanyi in https://github.com/open-mmlab/mmengine/pull/539Fix failed to enable
detect_anomalous_params
inMMSeparateDistributedDataParallel
by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/588Fix CheckpointHook behavior unexpected if given
filename_tmpl
argument by @C1rN09 in https://github.com/open-mmlab/mmengine/pull/518Fix error argument sequence in
FSDP
by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/520Fix uploading image in wandb backend @okotaku in https://github.com/open-mmlab/mmengine/pull/510
Fix loading state dictionary in
EMAHook
by @okotaku in https://github.com/open-mmlab/mmengine/pull/507Fix circle import in
EMAHook
by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/523Fix unit test could fail caused by
MultiProcessTestCase
by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/535Remove unnecessary “if statement” in
Registry
by @MambaWong in https://github.com/open-mmlab/mmengine/pull/536Fix
_save_to_state_dict
by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/542Support comparing NumPy array dataset meta in
Runner.resume
by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/511Use
get
instead ofpop
to dumprunner_type
inbuild_runner_from_cfg
by @nijkah in https://github.com/open-mmlab/mmengine/pull/549Upgrade pre-commit hooks by @zhouzaida in https://github.com/open-mmlab/mmengine/pull/576
Delete the error comment in
registry.md
by @vansin in https://github.com/open-mmlab/mmengine/pull/514Fix Some out-of-date unit tests by @C1rN09 in https://github.com/open-mmlab/mmengine/pull/586
Fix typo in
MMFullyShardedDataParallel
by @yhna940 in https://github.com/open-mmlab/mmengine/pull/569Update Github Action CI and CircleCI by @zhouzaida in https://github.com/open-mmlab/mmengine/pull/512
Fix unit test in windows by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/515
Fix merge ci & multiprocessing unit test by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/529
New Contributors¶
@okotaku made their first contribution in https://github.com/open-mmlab/mmengine/pull/510
@MengzhangLI made their first contribution in https://github.com/open-mmlab/mmengine/pull/527
@MambaWong made their first contribution in https://github.com/open-mmlab/mmengine/pull/536
@cxiang26 made their first contribution in https://github.com/open-mmlab/mmengine/pull/548
@nijkah made their first contribution in https://github.com/open-mmlab/mmengine/pull/549
@Zhengfei-0311 made their first contribution in https://github.com/open-mmlab/mmengine/pull/562
@austinmw made their first contribution in https://github.com/open-mmlab/mmengine/pull/579
@yhna940 made their first contribution in https://github.com/open-mmlab/mmengine/pull/569
@liuyanyi made their first contribution in https://github.com/open-mmlab/mmengine/pull/539
Contributing to OpenMMLab¶
Welcome to the MMEngine community, we are committed to building a cutting-edge computer vision foundational library and all kinds of contributions are welcomed, including but not limited to
Fix bug
You can directly post a Pull Request to fix typos in code or documents
The steps to fix the bug of code implementation are as follows.
If the modification involves significant changes, you should create an issue first and describe the error information and how to trigger the bug. Other developers will discuss it with you and propose a proper solution.
Posting a pull request after fixing the bug and adding the corresponding unit test.
New Feature or Enhancement
If the modification involves significant changes, you should create an issue to discuss with our developers to propose a proper design.
Post a Pull Request after implementing the new feature or enhancement and add the corresponding unit test.
Document
You can directly post a pull request to fix documents. If you want to add a document, you should first create an issue to check if it is reasonable.
Pull Request Workflow¶
If you’re not familiar with Pull Request, don’t worry! The following guidance will tell you how to create a Pull Request step by step. If you want to dive into the development mode of Pull Request, you can refer to the official documents.
1. Fork and clone¶
If you are posting a pull request for the first time, you should fork the OpenMMLab repositories by clicking the Fork button in the top right corner of the GitHub page, and the forked repositories will appear under your GitHub profile.

Then, you can clone the repositories to local:
git clone git@github.com:{username}/mmengine.git
After that, you should add the official repository as the upstream repository.
git remote add upstream git@github.com:open-mmlab/mmengine
Check whether the remote repository has been added successfully by git remote -v
.
origin git@github.com:{username}/mmengine.git (fetch)
origin git@github.com:{username}/mmengine.git (push)
upstream git@github.com:open-mmlab/mmengine (fetch)
upstream git@github.com:open-mmlab/mmengine (push)
Note
Here’s a brief introduction to origin and upstream. When we use “git clone”, we create an “origin” remote by default, which points to the repository cloned from. As for “upstream”, we add it ourselves to point to the target repository. Of course, if you don’t like the name “upstream”, you could name it as you wish. Usually, we’ll push the code to “origin”. If the pushed code conflicts with the latest code in official(“upstream”), we should pull the latest code from upstream to resolve the conflicts, and then push to “origin” again. The posted Pull Request will be updated automatically.
2. Configure pre-commit¶
You should configure pre-commit in the local development environment to make sure the code style matches that of OpenMMLab. Note: The following code should be executed under the mmengine directory.
pip install -U pre-commit
pre-commit install
Check that pre-commit is configured successfully, and install the hooks defined in .pre-commit-config.yaml
.
pre-commit run --all-files


If the installation process is interrupted, you can repeatedly run pre-commit run ...
to continue the installation.
If the code does not conform to the code style specification, pre-commit will raise a warning and fixes some of the errors automatically.

If we want to commit our code bypassing the pre-commit hook, we can use the --no-verify
option(only for temporary committing).
git commit -m "xxx" --no-verify
3. Create a development branch¶
After configuring the pre-commit, we should create a branch based on the master branch to develop the new feature or fix the bug. The proposed branch name is username/pr_name
git checkout -b yhc/refactor_contributing_doc
In subsequent development, if the master branch of the local repository is behind the master branch of “upstream”, we need to pull the upstream for synchronization, and then execute the above command:
git pull upstream master
4. Commit the code and pass the unit test¶
MMEngine introduces mypy to do static type checking to increase the robustness of the code. Therefore, we need to add Type Hints to our code and pass the mypy check. If you are not familiar with Type Hints, you can refer to this tutorial.
The committed code should pass through the unit test
# Pass all unit tests pytest tests # Pass the unit test of runner pytest tests/test_runner/test_runner.py
If the unit test fails for lack of dependencies, you can install the dependencies referring to the guidance
If the documents are modified/added, we should check the rendering result referring to guidance
5. Push the code to remote¶
We could push the local commits to remote after passing through the check of unit test and pre-commit. You can associate the local branch with remote branch by adding -u
option.
git push -u origin {branch_name}
This will allow you to use the git push
command to push code directly next time, without having to specify a branch or the remote repository.
6. Create a Pull Request¶
(1) Create a pull request in GitHub’s Pull request interface

(2) Modify the PR description according to the guidelines so that other developers can better understand your changes

Find more details about Pull Request description in pull request guidelines.
note
(a) The Pull Request description should contain the reason for the change, the content of the change, and the impact of the change, and be associated with the relevant Issue (see documentation
(b) If it is your first contribution, please sign the CLA

(c) Check whether the Pull Request pass through the CI

MMEngine will run unit test for the posted Pull Request on different platforms (Linux, Window, Mac), based on different versions of Python, PyTorch, CUDA to make sure the code is correct. We can see the specific test information by clicking Details
in the above image so that we can modify the code.
(3) If the Pull Request passes the CI, then you can wait for the review from other developers. You’ll modify the code based on the reviewer’s comments, and repeat the steps 4-5 until all reviewers approve it. Then, we will merge it ASAP.

7. Resolve conflicts¶
If your local branch conflicts with the latest master branch of “upstream”, you’ll need to resolove them. There are two ways to do this:
git fetch --all --prune
git rebase upstream/master
or
git fetch --all --prune
git merge upstream/master
If you are very good at handling conflicts, then you can use rebase to resolve conflicts, as this will keep your commit logs tidy. If you are not familiar with rebase
, then you can use merge
to resolve conflicts.
Guidance¶
Unit test¶
We should also make sure the committed code will not decrease the coverage of unit test, we could run the following command to check the coverage of unit test:
python -m coverage run -m pytest /path/to/test_file
python -m coverage html
# check file in htmlcov/index.html
Document rendering¶
If the documents are modified/added, we should check the rendering result. We could install the dependencies and run the following command to render the documents and check the results:
pip install -r requirements/docs.txt
cd docs/zh_cn/
# or docs/en
make html
# check file in ./docs/zh_cn/_build/html/index.html
Python Code style¶
We adopt PEP8 as the preferred code style.
We use the following tools for linting and formatting:
flake8: A wrapper around some linter tools.
isort: A Python utility to sort imports.
yapf: A formatter for Python files.
codespell: A Python utility to fix common misspellings in text files.
mdformat: Mdformat is an opinionated Markdown formatter that can be used to enforce a consistent style in Markdown files.
docformatter: A formatter to format docstring.
Style configurations of yapf and isort can be found in setup.cfg.
We use pre-commit hook that checks and formats for flake8
, yapf
, isort
, trailing whitespaces
, markdown files
,
fixes end-of-files
, double-quoted-strings
, python-encoding-pragma
, mixed-line-ending
, sorts requirments.txt
automatically on every commit.
The config for a pre-commit hook is stored in .pre-commit-config.
PR Specs¶
Use pre-commit hook to avoid issues of code style
One short-time branch should be matched with only one PR
Accomplish a detailed change in one PR. Avoid large PR
Bad: Support Faster R-CNN
Acceptable: Add a box head to Faster R-CNN
Good: Add a parameter to box head to support custom conv-layer number
Provide clear and significant commit message
Provide clear and meaningful PR description
Task name should be clarified in title. The general format is: [Prefix] Short description of the PR (Suffix)
Prefix: add new feature [Feature], fix bug [Fix], related to documents [Docs], in developing [WIP] (which will not be reviewed temporarily)
Introduce main changes, results and influences on other modules in short description
Associate related issues and pull requests with a milestone