Shortcuts

Welcome to MMEngine’s documentation!

You can switch between Chinese and English documents in the lower-left corner of the layout.

Introduction

Coming soon. Please refer to chinese documentation.

Installation

Prerequisites

  • Python 3.7+

  • PyTorch 1.6+

  • CUDA 9.2+

  • GCC 5.4+

Prepare the Environment

  1. Use conda and activate the environment:

    conda create -n open-mmlab python=3.7 -y
    conda activate open-mmlab
    
  2. Install PyTorch

    Before installing MMEngine, please make sure that PyTorch has been successfully installed in the environment. You can refer to PyTorch official installation documentation. Verify the installation with the following command:

    python -c 'import torch;print(torch.__version__)'
    

Install MMEngine

Install with mim

mim is a package management tool for OpenMMLab projects, which can be used to install the OpenMMLab project easily.

pip install -U openmim
mim install mmengine

Install with pip

pip install mmengine

Use docker images

  1. Build the image

    docker build -t mmengine https://github.com/open-mmlab/mmengine.git#main:docker/release
    

    More information can be referred from mmengine/docker.

  2. Run the image

    docker run --gpus all --shm-size=8g -it mmengine
    
Build from source
# if cloning speed is too slow, you can switch the source to https://gitee.com/open-mmlab/mmengine.git
git clone https://github.com/open-mmlab/mmengine.git
cd mmengine
pip install -e . -v

Verify the Installation

To verify if MMEngine and the necessary environment are successfully installed, we can run this command:

python -c 'import mmengine;print(mmengine.__version__)'

15 minutes to get started with MMEngine

In this tutorial, we’ll take training a ResNet-50 model on CIFAR-10 dataset as an example. We will build a complete and configurable pipeline for both training and validation in only 80 lines of code with MMEgnine. The whole process includes the following steps:

  1. Build a Model

  2. Build a Dataset and DataLoader

  3. Build a Evaluation Metrics

  4. Build a Runner and Run the Task

Build a Model

First, we need to build a model. In MMEngine, the model should inherit from BaseModel. Aside from parameters representing inputs from the dataset, its forward method needs to accept an extra argument called mode:

  • for training, the value of mode is “loss,” and the forward method should return a dict containing the key “loss”.

  • for validation, the value of mode is “predict”, and the forward method should return results containing both predictions and labels.

import torch.nn.functional as F
import torchvision
from mmengine.model import BaseModel


class MMResNet50(BaseModel):
    def __init__(self):
        super().__init__()
        self.resnet = torchvision.models.resnet50()

    def forward(self, imgs, labels, mode):
        x = self.resnet(imgs)
        if mode == 'loss':
            return {'loss': F.cross_entropy(x, labels)}
        elif mode == 'predict':
            return x, labels

Build a Dataset and DataLoader

Next, we need to create Dataset and DataLoader for training and validation. For basic training and validation, we can simply use built-in datasets supported in TorchVision.

import torchvision.transforms as transforms
from torch.utils.data import DataLoader

norm_cfg = dict(mean=[0.491, 0.482, 0.447], std=[0.202, 0.199, 0.201])
train_dataloader = DataLoader(batch_size=32,
                              shuffle=True,
                              dataset=torchvision.datasets.CIFAR10(
                                  'data/cifar10',
                                  train=True,
                                  download=True,
                                  transform=transforms.Compose([
                                      transforms.RandomCrop(32, padding=4),
                                      transforms.RandomHorizontalFlip(),
                                      transforms.ToTensor(),
                                      transforms.Normalize(**norm_cfg)
                                  ])))

val_dataloader = DataLoader(batch_size=32,
                            shuffle=False,
                            dataset=torchvision.datasets.CIFAR10(
                                'data/cifar10',
                                train=False,
                                download=True,
                                transform=transforms.Compose([
                                    transforms.ToTensor(),
                                    transforms.Normalize(**norm_cfg)
                                ])))

Build a Evaluation Metrics

To validate and test the model, we need to define a Metric called accuracy to evaluate the model. This metric needs inherit from BaseMetric and implements the process and compute_metrics methods where the process method accepts the output of the dataset and other outputs when mode="predict". The output data at this scenario is a batch of data. After processing this batch of data, we save the information to self.results property. compute_metrics accepts a results parameter. The input results of compute_metrics is all the information saved in process (In the case of a distributed environment, results are the information collected from all process in all the processes). Use these information to calculate and return a dict that holds the results of the evaluation metrics

from mmengine.evaluator import BaseMetric

class Accuracy(BaseMetric):
    def process(self, data_batch, data_samples):
        score, gt = data_samples
        # save the middle result of a batch to `self.results`
        self.results.append({
            'batch_size': len(gt),
            'correct': (score.argmax(dim=1) == gt).sum().cpu(),
        })

    def compute_metrics(self, results):
        total_correct = sum(item['correct'] for item in results)
        total_size = sum(item['batch_size'] for item in results)
        # return the dict containing the eval results
        # the key is the name of the metric name
        return dict(accuracy=100 * total_correct / total_size)

Build a Runner and Run the Task

Now we can build a Runner with previously defined Model, DataLoader, and Metrics, and some other configs shown as follows:

from torch.optim import SGD
from mmengine.runner import Runner

runner = Runner(
    # the model used for training and validation.
    # Needs to meet specific interface requirements
    model=MMResNet50(),
    # working directory which saves training logs and weight files
    work_dir='./work_dir',
    # train dataloader needs to meet the PyTorch data loader protocol
    train_dataloader=train_dataloader,
    # optimize wrapper for optimization with additional features like
    # AMP, gradtient accumulation, etc
    optim_wrapper=dict(optimizer=dict(type=SGD, lr=0.001, momentum=0.9)),
    # trainging coinfs for specifying training epoches, verification intervals, etc
    train_cfg=dict(by_epoch=True, max_epochs=5, val_interval=1),
    # validation dataloaer also needs to meet the PyTorch data loader protocol
    val_dataloader=val_dataloader,
    # validation configs for specifying additional parameters required for validation
    val_cfg=dict(),
    # validation evaluator. The default one is used here
    val_evaluator=dict(type=Accuracy),
)

runner.train()

Finally, let’s put all the codes above together into a complete script that uses the MMEngine executor for training and validation:

Open in Colab

import torch.nn.functional as F
import torchvision
import torchvision.transforms as transforms
from torch.optim import SGD
from torch.utils.data import DataLoader

from mmengine.evaluator import BaseMetric
from mmengine.model import BaseModel
from mmengine.runner import Runner


class MMResNet50(BaseModel):
    def __init__(self):
        super().__init__()
        self.resnet = torchvision.models.resnet50()

    def forward(self, imgs, labels, mode):
        x = self.resnet(imgs)
        if mode == 'loss':
            return {'loss': F.cross_entropy(x, labels)}
        elif mode == 'predict':
            return x, labels


class Accuracy(BaseMetric):
    def process(self, data_batch, data_samples):
        score, gt = data_samples
        self.results.append({
            'batch_size': len(gt),
            'correct': (score.argmax(dim=1) == gt).sum().cpu(),
        })

    def compute_metrics(self, results):
        total_correct = sum(item['correct'] for item in results)
        total_size = sum(item['batch_size'] for item in results)
        return dict(accuracy=100 * total_correct / total_size)


norm_cfg = dict(mean=[0.491, 0.482, 0.447], std=[0.202, 0.199, 0.201])
train_dataloader = DataLoader(batch_size=32,
                              shuffle=True,
                              dataset=torchvision.datasets.CIFAR10(
                                  'data/cifar10',
                                  train=True,
                                  download=True,
                                  transform=transforms.Compose([
                                      transforms.RandomCrop(32, padding=4),
                                      transforms.RandomHorizontalFlip(),
                                      transforms.ToTensor(),
                                      transforms.Normalize(**norm_cfg)
                                  ])))

val_dataloader = DataLoader(batch_size=32,
                            shuffle=False,
                            dataset=torchvision.datasets.CIFAR10(
                                'data/cifar10',
                                train=False,
                                download=True,
                                transform=transforms.Compose([
                                    transforms.ToTensor(),
                                    transforms.Normalize(**norm_cfg)
                                ])))

runner = Runner(
    model=MMResNet50(),
    work_dir='./work_dir',
    train_dataloader=train_dataloader,
    optim_wrapper=dict(optimizer=dict(type=SGD, lr=0.001, momentum=0.9)),
    train_cfg=dict(by_epoch=True, max_epochs=5, val_interval=1),
    val_dataloader=val_dataloader,
    val_cfg=dict(),
    val_evaluator=dict(type=Accuracy),
)
runner.train()

Training log would be similar to this:

2022/08/22 15:51:53 - mmengine - INFO -
------------------------------------------------------------
System environment:
    sys.platform: linux
    Python: 3.8.12 (default, Oct 12 2021, 13:49:34) [GCC 7.5.0]
    CUDA available: True
    numpy_random_seed: 1513128759
    GPU 0: NVIDIA GeForce GTX 1660 SUPER
    CUDA_HOME: /usr/local/cuda
...

2022/08/22 15:51:54 - mmengine - INFO - Checkpoints will be saved to /home/mazerun/work_dir by HardDiskBackend.
2022/08/22 15:51:56 - mmengine - INFO - Epoch(train) [1][10/1563]  lr: 1.0000e-03  eta: 0:18:23  time: 0.1414  data_time: 0.0077  memory: 392  loss: 5.3465
2022/08/22 15:51:56 - mmengine - INFO - Epoch(train) [1][20/1563]  lr: 1.0000e-03  eta: 0:11:29  time: 0.0354  data_time: 0.0077  memory: 392  loss: 2.7734
2022/08/22 15:51:56 - mmengine - INFO - Epoch(train) [1][30/1563]  lr: 1.0000e-03  eta: 0:09:10  time: 0.0352  data_time: 0.0076  memory: 392  loss: 2.7789
2022/08/22 15:51:57 - mmengine - INFO - Epoch(train) [1][40/1563]  lr: 1.0000e-03  eta: 0:08:00  time: 0.0353  data_time: 0.0073  memory: 392  loss: 2.5725
2022/08/22 15:51:57 - mmengine - INFO - Epoch(train) [1][50/1563]  lr: 1.0000e-03  eta: 0:07:17  time: 0.0347  data_time: 0.0073  memory: 392  loss: 2.7382
2022/08/22 15:51:57 - mmengine - INFO - Epoch(train) [1][60/1563]  lr: 1.0000e-03  eta: 0:06:49  time: 0.0347  data_time: 0.0072  memory: 392  loss: 2.5956
2022/08/22 15:51:58 - mmengine - INFO - Epoch(train) [1][70/1563]  lr: 1.0000e-03  eta: 0:06:28  time: 0.0348  data_time: 0.0072  memory: 392  loss: 2.7351
...
2022/08/22 15:52:50 - mmengine - INFO - Saving checkpoint at 1 epochs
2022/08/22 15:52:51 - mmengine - INFO - Epoch(val) [1][10/313]    eta: 0:00:03  time: 0.0122  data_time: 0.0047  memory: 392
2022/08/22 15:52:51 - mmengine - INFO - Epoch(val) [1][20/313]    eta: 0:00:03  time: 0.0122  data_time: 0.0047  memory: 308
2022/08/22 15:52:51 - mmengine - INFO - Epoch(val) [1][30/313]    eta: 0:00:03  time: 0.0123  data_time: 0.0047  memory: 308
...
2022/08/22 15:52:54 - mmengine - INFO - Epoch(val) [1][313/313]  accuracy: 35.7000

The corresponding implementation of PyTorch and MMEngine:

output

In addition to these basic components, you can also use executor to easily combine and configure various training techniques, such as enabling mixed-precision training and gradient accumulation (see OptimWrapper), configuring the learning rate decay curve (see Metrics & Evaluator), and etc.

Resume Training

Resuming training means continuing training from the state saved from some previous training, where the state includes the model’s weights, the state of the optimizer and the state of parameter scheduler.

Automatically resume training

Users can set the resume parameter of Runner to enable automatic training resumption. When resume is set to True, the Runner will try to resume from the latest checkpoint in work_dir automatically. If there is a latest checkpoint in work_dir (e.g. the training was interrupted during the last training), the training will be resumed from that checkpoint, otherwise (e.g. the last training did not have time to save the checkpoint or a new training task is started) the training will restart. Here is an example of how to enable automatic training resumption.

runner = Runner(
    model=ResNet18(),
    work_dir='./work_dir',
    train_dataloader=train_dataloader_cfg,
    optim_wrapper=dict(optimizer=dict(type='SGD', lr=0.001, momentum=0.9)),
    train_cfg=dict(by_epoch=True, max_epochs=3),
    resume=True,
)
runner.train()

Specify the checkpoint path

If you want to specify the path to resume training, you need to set load_from in addition to resume=True. Note that if only load_from is set without resume=True, then only the weights in the checkpoint will be loaded and training will be restarted, instead of continuing with the previous state.

runner = Runner(
    model=ResNet18(),
    work_dir='./work_dir',
    train_dataloader=train_dataloader_cfg,
    optim_wrapper=dict(optimizer=dict(type='SGD', lr=0.001, momentum=0.9)),
    train_cfg=dict(by_epoch=True, max_epochs=3),
    load_from='./work_dir/epoch_2.pth',
    resume=True,
)
runner.train()

Speed up Training

Distributed Training

MMEngine supports training models with CPU, single GPU, multiple GPUs in single machine and multiple machines. When multiple GPUs are available in the environment, we can use the following command to enable multiple GPUs in single machine or multiple machines to shorten the training time of the model.

  • multiple GPUs in single machine

    Assuming the current machine has 8 GPUs, you can enable multiple GPUs training with the following command:

    python -m torch.distributed.launch --nproc_per_node=8 examples/train.py --launcher pytorch
    

    If you need to specify the GPU index, you can set the CUDA_VISIBLE_DEVICES environment variable, e.g. use the 0th and 3rd GPU.

    CUDA_VISIBLE_DEVICES=0,3 python -m torch.distributed.launch --nproc_per_node=2 examples/train.py --launcher pytorch
    
  • multiple machines

    Assume that there are 2 machines connected with ethernet, you can simply run following commands.

    On the first machine:

    python -m torch.distributed.launch \
        --nnodes 8 \
        --node_rank 0 \
        --master_addr 127.0.0.1 \
        --master_port 29500 \
        --nproc_per_node=8 \
        examples/train.py --launcher pytorch
    

    On the second machine:

    python -m torch.distributed.launch \
        --nnodes 8 \
        --node_rank 1 \
        --master_addr 127.0.0.1 \
        --master_port 29500 \
        --nproc_per_node=8 \
    

    If you are running MMEngine in a slurm cluster, simply run the following command to enable training for 2 machines and 16 GPUs.

    srun -p mm_dev \
        --job-name=test \
        --gres=gpu:8 \
        --ntasks=16 \
        --ntasks-per-node=8 \
        --cpus-per-task=5 \
        --kill-on-bad-exit=1 \
        python examples/train.py --launcher="slurm"
    

Mixed Precision Training

Nvidia introduced the Tensor Core unit into the Volta and Turing architectures to support FP32 and FP16 mixed precision computing. With automatic mixed precision training enabled, some operators operate at FP16 and the rest operate at FP32, which reduces training time and storage requirements without changing the model or degrading its training precision, thus supporting training with larger batch sizes, larger models, and larger input sizes.

PyTorch officially supports amp from 1.6. If you are interested in the implementation of automatic mixing precision, you can refer to Mixed Precision Training.

MMEngine provides the wrapper AmpOptimWrapper for auto-mixing precision training, just set type='AmpOptimWrapper' in optim_wrapper to enable auto-mixing precision training, no other code changes are needed.

runner = Runner(
    model=ResNet18(),
    work_dir='./work_dir',
    train_dataloader=train_dataloader_cfg,
    optim_wrapper=dict(type='AmpOptimWrapper', optimizer=dict(type='SGD', lr=0.001, momentum=0.9)),
    train_cfg=dict(by_epoch=True, max_epochs=3),
)
runner.train()

Save Memory on GPU

Memory capacity is critical in deep learning training and inference and determines whether the model can run successfully. Common memory saving approaches include:

  • Gradient Accumulation

    Gradient accumulation is the mechanism that runs at a configured number of steps accumulating the gradients instead of updating parameters, after which the network parameters are updated and the gradients are cleared. With this technique of delayed parameter update, the result is similar to those scenarios using a large batch size, while the memory of activation can be saved. However, it should be noted that if the model contains a batch normalization layer, using gradient accumulation will impact performance.

  • Gradient Checkpointing

    Gradient checkpointing is a time-for-space method that compresses the model by reducing the number of saved activations, however, the unstored activations must be recomputed when calculating the gradient. The corresponding functionality has been implemented in the torch.utils.checkpoint package. The implementation can be briefly concluded as that, in the forward phase, the forward function passed to the checkpoint runs in torch.no_grad mode and saves only the input and the output of the forward function. Then recalculates its intermediate activations in the backward phase.

  • Large Model Training Techniques

    Recent research has shown that training a large model would be helpful to improve performance, but training a model at such a scale requires huge resources, and it is hard to store the entire model in the memory of a single graphics card. Therefore large model training techniques, typically such as DeepSpeed ZeRO and the Fully Shared Data Parallel (FSDP) technique introduced in FairScale are introduced. These techniques allow slicing the parameters, gradients, and optimizer states among the parallel processes, while still maintaining the simplicity of the data parallelism.

MMEngine now supports gradient accumulation and large model training FSDP techniques, and the usages are described as follows.

Gradient Accumulation

The configuration can be written in this way:

optim_wrapper_cfg = dict(
    type='OptimWrapper',
    optimizer=dict(type='SGD', lr=0.001, momentum=0.9),
    # update every four times
    accumulative_counts=4)

The full example working with Runner is as follows.

import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from mmengine.runner import Runner
from mmengine.model import BaseModel

train_dataset = [(torch.ones(1, 1), torch.ones(1, 1))] * 50
train_dataloader = DataLoader(train_dataset, batch_size=2)


class ToyModel(BaseModel):
    def __init__(self) -> None:
        super().__init__()
        self.linear = nn.Linear(1, 1)

    def forward(self, img, label, mode):
        feat = self.linear(img)
        loss1 = (feat - label).pow(2)
        loss2 = (feat - label).abs()
        return dict(loss1=loss1, loss2=loss2)


runner = Runner(
    model=ToyModel(),
    work_dir='tmp_dir',
    train_dataloader=train_dataloader,
    train_cfg=dict(by_epoch=True, max_epochs=1),
    optim_wrapper=dict(optimizer=dict(type='SGD', lr=0.01),
                       accumulative_counts=4)
)
runner.train()

Large Model Training

FSDP is officially supported from PyTorch 1.11. The config can be written in this way:

# located in cfg file
model_wrapper_cfg=dict(type='MMFullyShardedDataParallel', cpu_offload=True)

The full example working with Runner is as follows.

import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from mmengine.runner import Runner
from mmengine.model import BaseModel

train_dataset = [(torch.ones(1, 1), torch.ones(1, 1))] * 50
train_dataloader = DataLoader(train_dataset, batch_size=2)


class ToyModel(BaseModel):
    def __init__(self) -> None:
        super().__init__()
        self.linear = nn.Linear(1, 1)

    def forward(self, img, label, mode):
        feat = self.linear(img)
        loss1 = (feat - label).pow(2)
        loss2 = (feat - label).abs()
        return dict(loss1=loss1, loss2=loss2)


runner = Runner(
    model=ToyModel(),
    work_dir='tmp_dir',
    train_dataloader=train_dataloader,
    train_cfg=dict(by_epoch=True, max_epochs=1),
    optim_wrapper=dict(optimizer=dict(type='SGD', lr=0.01)),
    cfg=dict(model_wrapper_cfg=dict(type='MMFullyShardedDataParallel', cpu_offload=True))
)
runner.train()

Please be noted that FSDP works only in distributed training environments.

Train a GAN

Generative Adversarial Network (GAN) can be used to generate data such as images and videos. This tutorial will show you how to train a GAN with MMEngine step by step!

It will be divided into the following steps:

Building a DataLoader

Building a Dataset

First, we will build a dataset class MNISTDataset for the MNIST dataset, inheriting from the base dataset class BaseDataset, and overwrite the load_data_list function of the base dataset class to ensure that the return value is a list[dict], where each dict represents a data sample. More details about using datasets in MMEngine, refer to the Dataset tutorial.

import numpy as np
from mmcv.transforms import to_tensor
from torch.utils.data import random_split
from torchvision.datasets import MNIST

from mmengine.dataset import BaseDataset


class MNISTDataset(BaseDataset):

    def __init__(self, data_root, pipeline, test_mode=False):
        # Download MNIST Dataset
        if test_mode:
            mnist_full = MNIST(data_root, train=True, download=True)
            self.mnist_dataset, _ = random_split(mnist_full, [55000, 5000])
        else:
            self.mnist_dataset = MNIST(data_root, train=False, download=True)

        super().__init__(
            data_root=data_root, pipeline=pipeline, test_mode=test_mode)

    @staticmethod
    def totensor(img):
        if len(img.shape) < 3:
            img = np.expand_dims(img, -1)
        img = np.ascontiguousarray(img.transpose(2, 0, 1))
        return to_tensor(img)

    def load_data_list(self):
        return [
            dict(inputs=self.totensor(np.array(x[0]))) for x in self.mnist_dataset
        ]


dataset = MNISTDataset("./data", [])

Use the function build_dataloader in Runner to build the dataloader.

import os
import torch
from mmengine.runner import Runner

NUM_WORKERS = int(os.cpu_count() / 2)
BATCH_SIZE = 256 if torch.cuda.is_available() else 64

train_dataloader = dict(
    batch_size=BATCH_SIZE,
    num_workers=NUM_WORKERS,
    persistent_workers=True,
    sampler=dict(type='DefaultSampler', shuffle=True),
    dataset=dataset)
train_dataloader = Runner.build_dataloader(train_dataloader)

Build a Generator Network and a Discriminator Network

The following code builds and instantiates a Generator and a Discriminator.

import torch.nn as nn

class Generator(nn.Module):
    def __init__(self, noise_size, img_shape):
        super().__init__()
        self.img_shape = img_shape
        self.noise_size = noise_size

        def block(in_feat, out_feat, normalize=True):
            layers = [nn.Linear(in_feat, out_feat)]
            if normalize:
                layers.append(nn.BatchNorm1d(out_feat, 0.8))
            layers.append(nn.LeakyReLU(0.2, inplace=True))
            return layers

        self.model = nn.Sequential(
            *block(noise_size, 128, normalize=False),
            *block(128, 256),
            *block(256, 512),
            *block(512, 1024),
            nn.Linear(1024, int(np.prod(img_shape))),
            nn.Tanh(),
        )

    def forward(self, z):
        img = self.model(z)
        img = img.view(img.size(0), *self.img_shape)
        return img
class Discriminator(nn.Module):
    def __init__(self, img_shape):
        super().__init__()

        self.model = nn.Sequential(
            nn.Linear(int(np.prod(img_shape)), 512),
            nn.LeakyReLU(0.2, inplace=True),
            nn.Linear(512, 256),
            nn.LeakyReLU(0.2, inplace=True),
            nn.Linear(256, 1),
            nn.Sigmoid(),
        )

    def forward(self, img):
        img_flat = img.view(img.size(0), -1)
        validity = self.model(img_flat)

        return validity
generator = Generator(100, (1, 28, 28))
discriminator = Discriminator((1, 28, 28))

Build a Generative Adversarial Network Model

In MMEngine, we use ImgDataPreprocessor to normalize the data and convert the color channels.

from mmengine.model import ImgDataPreprocessor

data_preprocessor = ImgDataPreprocessor(mean=([127.5]), std=([127.5]))

The following code implements the basic algorithm of GAN. To implement the algorithm using MMEngine, you need to inherit from the BaseModel and implement the training process in the train_step. GAN requires alternating training of the generator and discriminator, which are implemented by train_discriminator and train_generator and implement disc_loss and gen_loss to calculate the discriminator loss function and generator loss function. More details about BaseModel, refer to Model tutorial.

import torch.nn.functional as F
from mmengine.model import BaseModel

class GAN(BaseModel):

    def __init__(self, generator, discriminator, noise_size,
                 data_preprocessor):
        super().__init__(data_preprocessor=data_preprocessor)
        assert generator.noise_size == noise_size
        self.generator = generator
        self.discriminator = discriminator
        self.noise_size = noise_size

    def train_step(self, data, optim_wrapper):
        # Acquiring and preprocessing data
        inputs_dict = self.data_preprocessor(data, True)
        # Training the discriminator
        disc_optimizer_wrapper = optim_wrapper['discriminator']
        with disc_optimizer_wrapper.optim_context(self.discriminator):
            log_vars = self.train_discriminator(inputs_dict,
                                                disc_optimizer_wrapper)

        # Training the generator
        set_requires_grad(self.discriminator, False)
        gen_optimizer_wrapper = optim_wrapper['generator']
        with gen_optimizer_wrapper.optim_context(self.generator):
            log_vars_gen = self.train_generator(inputs_dict,
                                                gen_optimizer_wrapper)

        set_requires_grad(self.discriminator, True)
        log_vars.update(log_vars_gen)

        return log_vars

    def forward(self, batch_inputs, data_samples=None, mode=None):
        return self.generator(batch_inputs)

    def disc_loss(self, disc_pred_fake, disc_pred_real):
        losses_dict = dict()
        losses_dict['loss_disc_fake'] = F.binary_cross_entropy(
            disc_pred_fake, 0. * torch.ones_like(disc_pred_fake))
        losses_dict['loss_disc_real'] = F.binary_cross_entropy(
            disc_pred_real, 1. * torch.ones_like(disc_pred_real))

        loss, log_var = self.parse_losses(losses_dict)
        return loss, log_var

    def gen_loss(self, disc_pred_fake):
        losses_dict = dict()
        losses_dict['loss_gen'] = F.binary_cross_entropy(
            disc_pred_fake, 1. * torch.ones_like(disc_pred_fake))
        loss, log_var = self.parse_losses(losses_dict)
        return loss, log_var

    def train_discriminator(self, inputs, optimizer_wrapper):
        real_imgs = inputs['inputs']
        z = torch.randn(
            (real_imgs.shape[0], self.noise_size)).type_as(real_imgs)
        with torch.no_grad():
            fake_imgs = self.generator(z)

        disc_pred_fake = self.discriminator(fake_imgs)
        disc_pred_real = self.discriminator(real_imgs)

        parsed_losses, log_vars = self.disc_loss(disc_pred_fake,
                                                 disc_pred_real)
        optimizer_wrapper.update_params(parsed_losses)
        return log_vars

    def train_generator(self, inputs, optimizer_wrapper):
        real_imgs = inputs['inputs']
        z = torch.randn(real_imgs.shape[0], self.noise_size).type_as(real_imgs)

        fake_imgs = self.generator(z)

        disc_pred_fake = self.discriminator(fake_imgs)
        parsed_loss, log_vars = self.gen_loss(disc_pred_fake)

        optimizer_wrapper.update_params(parsed_loss)
        return log_vars

The function, set_requires_grad, is used to lock the weights of the discriminator when training the generator.

def set_requires_grad(nets, requires_grad=False):
    """Set requires_grad for all the networks.

    Args:
        nets (nn.Module | list[nn.Module]): A list of networks or a single
            network.
        requires_grad (bool): Whether the networks require gradients or not.
    """
    if not isinstance(nets, list):
        nets = [nets]
    for net in nets:
        if net is not None:
            for param in net.parameters():
                param.requires_grad = requires_grad

model = GAN(generator, discriminator, 100, data_preprocessor)

Building an Optimizer

MMEngine uses OptimWrapper to wrap optimizers. For multiple optimizers, we use OptimWrapperDict to further wrap OptimWrapper. More details about optimizers, refer to the Optimizer tutorial.

from mmengine.optim import OptimWrapper, OptimWrapperDict

opt_g = torch.optim.Adam(generator.parameters(), lr=0.0001, betas=(0.5, 0.999))
opt_g_wrapper = OptimWrapper(opt_g)

opt_d = torch.optim.Adam(
    discriminator.parameters(), lr=0.0001, betas=(0.5, 0.999))
opt_d_wrapper = OptimWrapper(opt_d)

opt_wrapper_dict = OptimWrapperDict(
    generator=opt_g_wrapper, discriminator=opt_d_wrapper)

Training with Runner

The following code demonstrates how to use Runner for model training. More details about Runner, please refer to the Runner tutorial.

train_cfg = dict(by_epoch=True, max_epochs=220)
runner = Runner(
    model,
    work_dir='runs/gan/',
    train_dataloader=train_dataloader,
    train_cfg=train_cfg,
    optim_wrapper=opt_wrapper_dict)
runner.train()

Till now, we have completed an example of training a GAN. The following code can be used to view the results generated by the GAN we just trained.

GAN generate an image

If you want to learn more about using MMEngine to implement GAN and generative models, we highly recommend you try the generative framework MMGeneration based on MMEngine.

Runner

Welcome to the tutorial of runner, the core of MMEngine’s user interface!

The runner, as an “integrator” in MMEngine, covers all aspects of the framework and shoulders the responsibility of organizing and scheduling nearly all modules. Therefore, the code logic in it has to take into account various situations, making it relatively hard to understand. But don’t worry! In this tutorial, we will leave out some messy details and have a quick overview of commonly used APIs, functionalities, and examples. Hopefully, this should provide you with a clear and easy-to-understand user interface. After reading through this tutorial, you will be able to:

  • Master the common usage and configuration of the runner

  • Learn the best practice - writing config files - of the runner

  • Know about the basic dataflow and execution order

  • Feel by yourself the advantages of using runner (perhaps)

Example codes of the runner

To build your training pipeline with a runner, there are typically two ways to get started:

Pros and cons lie in both approaches. For the former one, beginners may be lost in a vast number of configurable arguments. For the latter one, beginners may find it hard to get a good reference, since neither an over-simplified nor an over-detailed reference is conducive to them.

We argue that the key to learning runner is using it as a memo. You should remember its most commonly used arguments and only focus on those less used when in need, since default values usually work fine. In the following, we will provide a beginner-friendly example to illustrate the most commonly used arguments of the runner, along with advanced guidelines for those less used.

A beginer-friendly example

Hint

In this tutorial, we hope you can focus more on overall architecture instead of implementation details. This “top-down” way of thinking is exactly what we advocate. Don’t worry, you will definitely have plenty of opportunities and guidance afterward to focus on modules you want to improve.

Before running the actual example below, you should first run this piece of code for the preparation of the model, dataset, and metric. However, these implementations are not important in this tutorial and you can simply look through
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset

from mmengine.model import BaseModel
from mmengine.evaluator import BaseMetric
from mmengine.registry import MODELS, DATASETS, METRICS


@MODELS.register_module()
class MyAwesomeModel(BaseModel):
    def __init__(self, layers=4, activation='relu') -> None:
        super().__init__()
        if activation == 'relu':
            act_type = nn.ReLU
        elif activation == 'silu':
            act_type = nn.SiLU
        elif activation == 'none':
            act_type = nn.Identity
        else:
            raise NotImplementedError
        sequence = [nn.Linear(2, 64), act_type()]
        for _ in range(layers-1):
            sequence.extend([nn.Linear(64, 64), act_type()])
        self.mlp = nn.Sequential(*sequence)
        self.classifier = nn.Linear(64, 2)

    def forward(self, data, labels, mode):
        x = self.mlp(data)
        x = self.classifier(x)
        if mode == 'tensor':
            return x
        elif mode == 'predict':
            return F.softmax(x, dim=1), labels
        elif mode == 'loss':
            return {'loss': F.cross_entropy(x, labels)}


@DATASETS.register_module()
class MyDataset(Dataset):
    def __init__(self, is_train, size):
        self.is_train = is_train
        if self.is_train:
            torch.manual_seed(0)
            self.labels = torch.randint(0, 2, (size,))
        else:
            torch.manual_seed(3407)
            self.labels = torch.randint(0, 2, (size,))
        r = 3 * (self.labels+1) + torch.randn(self.labels.shape)
        theta = torch.rand(self.labels.shape) * 2 * torch.pi
        self.data = torch.vstack([r*torch.cos(theta), r*torch.sin(theta)]).T

    def __getitem__(self, index):
        return self.data[index], self.labels[index]

    def __len__(self):
        return len(self.data)


@METRICS.register_module()
class Accuracy(BaseMetric):
    def __init__(self):
        super().__init__()

    def process(self, data_batch, data_samples):
        score, gt = data_samples
        self.results.append({
            'batch_size': len(gt),
            'correct': (score.argmax(dim=1) == gt).sum().cpu(),
        })

    def compute_metrics(self, results):
        total_correct = sum(r['correct'] for r in results)
        total_size = sum(r['batch_size'] for r in results)
        return dict(accuracy=100*total_correct/total_size)
Click to show a long example. Be well prepared
from torch.utils.data import DataLoader, default_collate
from torch.optim import Adam
from mmengine.runner import Runner


runner = Runner(
    # your model
    model=MyAwesomeModel(
        layers=2,
        activation='relu'),
    # work directory for saving checkpoints and logs
    work_dir='exp/my_awesome_model',

    # training data
    train_dataloader=DataLoader(
        dataset=MyDataset(
            is_train=True,
            size=10000),
        shuffle=True,
        collate_fn=default_collate,
        batch_size=64,
        pin_memory=True,
        num_workers=2),
    # training configurations
    train_cfg=dict(
        by_epoch=True,   # display in epoch number instead of iterations
        max_epochs=10,
        val_begin=2,     # start validation from the 2nd epoch
        val_interval=1), # do validation every 1 epoch

    # OptimizerWrapper, new concept in MMEngine for richer optimization options
    # Default value works fine for most cases. You may check our documentations
    # for more details, e.g. 'AmpOptimWrapper' for enabling mixed precision
    # training.
    optim_wrapper=dict(
        optimizer=dict(
            type=Adam,
            lr=0.001)),
    # ParamScheduler to adjust learning rates or momentums during training
    param_scheduler=dict(
        type='MultiStepLR',
        by_epoch=True,
        milestones=[4, 8],
        gamma=0.1),

    # validation data
    val_dataloader=DataLoader(
        dataset=MyDataset(
            is_train=False,
            size=1000),
        shuffle=False,
        collate_fn=default_collate,
        batch_size=1000,
        pin_memory=True,
        num_workers=2),
    # validation configurations, usually leave it an empty dict
    val_cfg=dict(),
    # evaluation metrics and evaluator
    val_evaluator=dict(type=Accuracy),

    # following are advanced configurations, try to default when not in need
    # hooks are advanced usage, try to default when not in need
    default_hooks=dict(
        # the most commonly used hook for modifying checkpoint saving interval
        checkpoint=dict(type='CheckpointHook', interval=1)),

    # `luancher` and `env_cfg` responsible for distributed environment
    launcher='none',
    env_cfg=dict(
        cudnn_benchmark=False,   # whether enable cudnn_benchmark
        backend='nccl',   # distributed communication backend
        mp_cfg=dict(mp_start_method='fork')),  # multiprocessing configs
    log_level='INFO',

    # load model weights from given path. None for no loading.
    load_from=None
    # resume training from the given path
    resume=False
)

# start training your model
runner.train()

Explanations on example codes

Really a long piece of code, isn’t it! However, if you read through the above example, you may have already understood the training process in general even without knowing any implementation details, thanks to the compactness and readability of runner codes (probably). This is what MMEngine expects: a structured, modular, and standardized training process that allows for more reliable reproductions and clearer comparisons.

The above example may lead you to the following confusion:

There are too many arguments!

Don’t worry. As we mentioned before, use runner as a memo. The runner covers all aspects just to ensure you won’t miss something important. You don’t actually need to configure everything. The simple example in 15 minutes still works fine, and it can be even more simplified by removing val_evaluator, val_dataloader, and val_cfg without breaking down. All configurable arguments are driven by your demands. Those not in your focus usually work fine by default.

Why are some arguments passed as dicts?

Well, this is related to MMEngine’s style. In MMEngine, we provide 2 different styles of runner construction: a) manual construction and b) construction via registry. If you are confused, the following example will give a good illustration:

from mmengine.model import BaseModel
from mmengine.runner import Runner
from mmengine.registry import MODELS # root registry for your custom model

@MODELS.register_module() # decorator for registration
class MyAwesomeModel(BaseModel): # your custom model
    def __init__(self, layers=18, activation='silu'):
        ...

# An example of manual construction
runner = Runner(
    model=dict(
        type='MyAwesomeModel',
        layers=50,
        activation='relu'),
    ...
)

# An example of construction via registry
model = MyAwesomeModel(layers=18, activation='relu')
runner = Runner(
    model=model,
    ...
)

Similar to the above example, most arguments in the runner accept both 2 types of inputs. They are conceptually equivalent. The difference is, in the former style, the module (passed in as a dict) will be built in the runner when actually needed, while in the latter style, the module has been built before being passed to the runner. The following figure illustrates the core idea of registry: it maintains the mapping between a module’s build method and its registry name. If you want to learn more about the full usage of the registry, you are recommended to read Registry tutorial.

Runner Registry Illustration

You might still be confused after the explanation. Why should we let the Runner build modules from dicts? What are the benefits? If you have such questions, then we are proud to answer: “Absolutely - no benefits!” In fact, module construction via registry only works to its best advantage when combined with a configuration file. It is still far from the best practice to write as the above example. We provide it here just to make sure you can read and get used to this writing style, which may facilitate your understanding of the actual best practice we will soon talk about - the configuration file. Stay tuned!

If you as a beginner do not immediately understand, it doesn’t matter too much, because manual construction is still a good choice, especially for small-scale development and trial-and-error due to its being IDE friendly. However, you are still expected to read and get used to the writing style via registry, so that you can avoid being unnecessarily confused and puzzled in subsequent tutorials.

Where can I find the possible configuration options for the xxx argument?

You will find extensive instructions and examples in those tutorials of the corresponding modules. You can also find all possible arguments in Runner’s API documentation. If neither of the above resolves your query, you are always encouraged to start a topic in our discussion forum. It also helps us improve documentations.

I come from repositoried like MMDet/MMCls... Why does this example differ from what I've been exposed to?

Downstream repositories in OpenMMLab have widely adopted the writing style of config files. In the following chapter, we will show the usage of config files, the best practice of the runner in MMEngine, based on the above example with a slight variation.

Best practice of the Runner - config files

MMEngine provides a powerful config file system that supports Python syntax. You can almost seamlessly (which we will illustrate below) convert from the previous sample code to a config file. Here is an example:

# Save the following codes in example_config.py
# Almost copied from the above example, with some commas removed
model = dict(type='MyAwesomeModel',
    layers=2,
    activation='relu')
work_dir = 'exp/my_awesome_model'

train_dataloader = dict(
    dataset=dict(type='MyDataset',
        is_train=True,
        size=10000),
    sampler=dict(
        type='DefaultSampler',
        shuffle=True),
    collate_fn=dict(type='default_collate'),
    batch_size=64,
    pin_memory=True,
    num_workers=2)
train_cfg = dict(
    by_epoch=True,
    max_epochs=10,
    val_begin=2,
    val_interval=1)
optim_wrapper = dict(
    optimizer=dict(
        type='Adam',
        lr=0.001))
param_scheduler = dict(
    type='MultiStepLR',
    by_epoch=True,
    milestones=[4, 8],
    gamma=0.1)

val_dataloader = dict(
    dataset=dict(type='MyDataset',
        is_train=False,
        size=1000),
    sampler=dict(
        type='DefaultSampler',
        shuffle=False),
    collate_fn=dict(type='default_collate'),
    batch_size=1000,
    pin_memory=True,
    num_workers=2)
val_cfg = dict()
val_evaluator = dict(type='Accuracy')

default_hooks = dict(
    checkpoint=dict(type='CheckpointHook', interval=1))
launcher = 'none'
env_cfg = dict(
    cudnn_benchmark=False,
    backend='nccl',
    mp_cfg=dict(mp_start_method='fork'))
log_level = 'INFO'
load_from = None
resume = False

Given the above config file, we can simply load configurations and run the training pipeline in a few lines of codes as follows:

from mmengine.config import Config
from mmengine.runner import Runner
config = Config.fromfile('example_config.py')
runner = Runner.from_cfg(config)
runner.train()

Note

Although it supports Python syntax, a valid config file needs to meet the condition that all variables must be Python built-in types such as str, dict and int. Therefore, the config system is highly dependent on the registry mechanism to enable construction from built-in types to other types such as nn.Module.

Note

When using config files, you typically don’t need to manually register every module. For instance, all optimizers in torch.optim including Adam and SGD have already been registered in mmengine.optim. The rule of thumb is, try to directly access modules provided by PyTorch, and only start to register them manually after error occurs.

Note

When using config files, the implementations of your custom modules may be stored in separate files and thus not registered properly, which will lead to errors in the build process. You may find solutions in Registry tutorial by searching for custom_imports.

Writing config files of the runner has been widely adopted in downstream repositories in OpenMMLab projects. It has been a de facto convention and best practice. The config files are far more featured than illustrated above. You can refer to Config tutorial for more advanced features including keywords inheriting and overriding.

Basic dataflow

Hint

In this chapter, we’ll dive deeper into the runner to illustrate dataflow and data format convention between modules managed by the runner. It may be relatively abstract and dry if you haven’t built a training pipeline with MMEngine. Therefore, you are free to skip for now and read it in conjunction with practice in the future when in need.

Now let’s dive slightly deeper into the runner, and illustrate the dataflow and data format convention under the hood (or, under the engine)!

Basic Dataflow

The diagram above illustrates the basic dataflow of the runner, where the dashed border, gray filled shapes represent different data formats, while solid boxes represent modules/methods. Due to the great flexibility and extensibility of MMEngine, you can always inherit some key base classes and override their methods, so the above diagram doesn’t always hold. It only holds when you are not customizing your own Runner or TrainLoop, and you are not overriding train_step, val_step or test_step method in your custom model. Actually, this is common for most tasks like detection and segmentation, as referred to Model tutorial.

Can you state the exact type of each data item shown in the diagram?

Unfortunately, this is not possible. Although we did heavy type annotations in MMEngine, Python is still a highly dynamic programming language, and deep learning as a data-centric system needs to be flexible enough to deal with a wide range of complex data sources. You always have full freedom to decide when you need (and sometimes must) break type conventions. Therefore, when you are customizing your module (e.g. val_evaluator), you need to make sure its input is compatible with upstream (e.g. model) output and its output can be parsed by downstream. MMEngine puts the flexibility of handling data in the hands of the user, and thus also requires the user to ensure compatibility of dataflow, which, in fact, is not that difficult once you get started.

The uniformity of data formats has always been a problem in deep learning. We are trying to improve it in MMEngine in our own way. If you are interested, you can refer to BaseDataset and BaseDataElement - but please note that they are mainly geared towards advanced users.

What's the data format convention between dataloader, model and evaluator?

For the basic dataflow shown in the diagram above, the data transfer between the above three modules can be represented by the following pseudo-code:

# training
for data_batch in train_dataloader:
    data_batch = data_preprocessor(data_batch)
    if isinstance(data_batch, dict):
        losses = model.forward(**data_batch, mode='loss')
    elif isinstance(data_batch, (list, tuple)):
        losses = model.forward(*data_batch, mode='loss')
    else:
        raise TypeError()

# validation
for data_batch in val_dataloader:
    data_batch = data_preprocessor(data_batch)
    if isinstance(data_batch, dict):
        outputs = model.forward(**data_batch, mode='predict')
    elif isinstance(data_batch, (list, tuple)):
        outputs = model.forward(**data_batch, mode='predict')
    else:
        raise TypeError()
    evaluator.process(data_samples=outputs, data_batch=data_batch)
metrics = evaluator.evaluate(len(val_dataloader.dataset))

The key points of the above pseudo-code is:

  • Outputs of data_preprocessor are passed to model after unpacking

  • The data_samples argument of the evaluator receives the prediction results of the model, while the data_batch argument receives the raw data coming from dataloader

What is data_preprocessor? Can I do image pre-processing such as crop and resize in it?

Though drawn separately in the diagram, data_preprocessor is a part of the model and thus can be found in Model tutorial in DataPreprocessor chapter.

In most cases, data_preprocessor needs no special attention or manual configuration. The default data_preprocessor will only do data transfer between host and GPU devices. However, if your model has incompatible inputs format with dataloader’s output, you can also customize you own data_preprocessor for data formatting.

Image pre-processing such as crop and resize is more recommended in data transforms module, but for batch-related data transforms (e.g. batch-resize), it can be implemented here.

Why does module produce 3 different outputs? What is the meaning of "loss", "predict" and "tensor"?

As described in get started in 15 minutes, you need to implement 3 data paths in your custom model’s forward function to suit different pipelines for training, validation and testing. This is further discussed in Model tutorial.

I can see that the red line is for training process and the blue line for validation/testing, but what is the green line?

Currently model outputs in “tensor” mode has not been officially used in runner. The “tensor” mode can output some intermediate results and thus facilitating debugging process.

What if I override methods such as train_step? Will the diagram totally fail?

The behavior of default train_step, val_step and test_step covers the dataflow from data_preprocessor to model outputs and optim_wrapper. The rest of the diagram will not be spoiled.

Why use the runner? (Optional reading)

Hint

Contents in this chapter will not teach you how to use the runner and MMEngine. If you are being pushed by your employer/advisor/DDL to work out a result in a few hours, it may not help you and you can feel free to skip it. However, we highly recommend taking time to read through this chapter, since it will help you better understand the aim and style of MMEngine.

Relax, time for some philosophy

Congratulations for reading through the runner tutorial, a long, long but kind of interesting (hope so) tutorial! Please believe that all of these - this tutorial, the runner, MMEngine - are intended to make things easier for you.

The runner is the “manager” of all modules in MMEngine. In the runner, all the distinct modules - whether visible ones like model and dataset, or obscure ones like logging, distributed environment and random seed - are getting organized and scheduled. The runner deals with the complex relationship between different modules and provides you with a clear, easy-to-understand and configurable interface. The benefits of this design are:

  1. You can modify or add your codes without spoiling your whole codebase. For example, you may start with single GPU training and you can always add a few lines of configuration codes to enable multi GPUs or even multi nodes training.

  2. You can continuously benefit from new features without worrying about backward compatibility. Mixed precision training, visualization, state of the art distributed training methods, various device backends… We will continue to absorb the best suggestions and cutting-edge technologies from the community while ensuring backward compatibility, and provide them to you in a clear interface.

  3. You can focus on your own awesome ideas without being bothered by other annoying and irrelevant details. The default values will handle most cases.

So, MMEngine and the runner will truly make things easier for you. With only a little effort on migration, your code and experiments will evolve with MMEngine. With a little more effort, the config file system allows you to manage your data, model, and experiments more efficiently. Convenience and reliability are the aims we strive for.

The blue one, or the red one - are you prepared to use MMEngine?

Suggestions on next steps

If you want to:

Write your own model structure

Refer to Model tutorial

Use your own datasets

Refer to Dataset and DataLoader tutorial

Change evaluation metrics

Refer to Evaluation tutorial

Do something related to optimizers or mixed-precision training

Refer to OptimWrapper tutorial

Schedule learning rates or other parameters during training

Refer to Parameter Scheduler tutorial

Something not mentioned above
  • “Common Usage” section to the left contains more example codes

  • “Advanced tutorials” to the left consists of more contents for experienced developers to make more flexible extensions to the training pipeline

  • Hook provides some flexible modifications without spoiling your codes

  • If none of the above solves your problem, you are always welcome to start a topic in our discussion forum!

Dataset and DataLoader

Hint

If you have never been exposed to PyTorch’s Dataset and DataLoader classes, you are recommended to read through PyTorch official tutorial to get familiar with some basic concepts.

Datasets and DataLoaders are necessary components in MMEngine’s training pipeline. They are conceptually derived from and consistent with PyTorch. Typically, a dataset defines the quantity, parsing, and pre-processing of the data, while a dataloader iteratively loads data according to settings such as batch_size, shuffle, num_workers, etc. Datasets are encapsulated with dataloaders and they together constitute the data source.

In this tutorial, we will step through their usage in MMEngine runner from the outside (dataloader) to the inside (dataset) and give some practical examples. After reading through this tutorial, you will be able to:

  • Master the configuration of dataloaders in MMEngine

  • Learn to use existing datasets (e.g. those from torchvision) from config files

  • Know about building and using your own dataset

Details on dataloader

Dataloaders can be configured in MMEngine’s Runner with 3 arguments:

  • train_dataloader: Used in Runner.train() to provide training data for models

  • val_dataloader: Used in Runner.val() or in Runner.train() at regular intervals for model evaluation

  • test_dataloader: Used in Runner.test() for the final test

MMEngine has full support for PyTorch native DataLoader objects. Therefore, you can simply pass your valid, already built dataloaders to the runner, as shown in getting started in 15 minutes. Meanwhile, thanks to the Registry Mechanism of MMEngine, those arguments also accept dicts as inputs, as illustrated in the following example (referred to as example 1). The keys in the dictionary correspond to arguments in DataLoader’s init function.

runner = Runner(
    train_dataloader=dict(
        batch_size=32,
        sampler=dict(
            type='DefaultSampler',
            shuffle=True),
        dataset=torchvision.datasets.CIFAR10(...),
        collate_fn=dict(type='default_collate')
    )
)

When passed to the runner in the form of a dict, the dataloader will be lazily built in the runner when actually needed.

Note

For more configurable arguments of the DataLoader, please refer to PyTorch API documentation

Note

If you are interested in the details of the building procedure, you may refer to build_dataloader

You may find example 1 differs from that in getting started in 15 minutes in some arguments. Indeed, due to some obscure conventions in MMEngine, you can’t seamlessly switch it to a dict by simply replacing DataLoader with dict. We will discuss the differences between our convention and PyTorch’s in the following sections, in case you run into trouble when using config files.

sampler and shuffle

One obvious difference is that we add a sampler argument to the dict. This is because we require sampler to be explicitly specified when using a dict as a dataloader. Meanwhile, shuffle is also removed from DataLoader arguments, because it conflicts with sampler in PyTorch, as referred to in PyTorch DataLoader API documentation.

Note

In fact, shuffle is just a notation for convenience in PyTorch implementation. If shuffle is set to True, the dataloader will automatically switch to RandomSampler

With a sampler argument, codes in example 1 is nearly equivalent to code block below

from mmengine.dataset import DefaultSampler

dataset = torchvision.datasets.CIFAR10(...)
sampler = DefaultSampler(dataset, shuffle=True)

runner = Runner(
    train_dataloader=DataLoader(
        batch_size=32,
        sampler=sampler,
        dataset=dataset,
        collate_fn=default_collate
    )
)

Warning

The equivalence of the above codes holds only if: 1) you are training with a single process, and 2) no randomness argument is passed to the runner. This is due to the fact that sampler should be built after distributed environment setup to be correct. The runner will guarantee the correct order and proper random seed by applying lazy initialization techniques, which is only possible for dict inputs. Instead, when building a sampler manually, it requires extra work and is highly error-prone. Therefore, the code block above is just for illustration and definitely not recommended. We strongly suggest passing sampler as a dict to avoid potential problems.

DefaultSampler

The above example may make you wonder what a DefaultSampler is, why use it and whether there are other options. In fact, DefaultSampler is a built-in sampler in MMEngine which eliminates the gap between distributed and non-distributed training and thus enabling a seamless conversion between them. If you have the experience of using DistributedDataParallel in PyTorch, you may be impressed by having to change the sampler argument to make it correct. However, in MMEngine, you don’t need to bother with this DefaultSampler.

DefaultSampler accepts the following arguments:

  • shuffle: Set to True to load data in the dataset in random order

  • seed: Random seed used to shuffle the dataset. Typically it doesn’t require manual configuration here because the runner will handle it with randomness configuration

  • round_up: When set this to True, this is the same behavior as setting drop_last=False in PyTorch DataLoader. You should take care of it when doing migration from PyTorch.

Note

For more details about DefaultSampler, please refer to its API docs

DefaultSampler handles most of the cases. We ensure that error-prone details such as random seeds are handled properly when you are using it in a runner. This prevents you from getting into troubles with distributed training. Apart from DefaultSampler, you may also be interested in InfiniteSampler for iteration-based training pipelines. If you have more advanced demands, you may want to refer to the codes of these two built-in samplers to implement your own one and register it to DATA_SAMPLERS registry.

@DATA_SAMPLERS.register_module()
class MySampler(Sampler):
    pass

runner = Runner(
    train_dataloader=dict(
        sampler=dict(type='MySampler'),
        ...
    )
)

The obscure collate_fn

Among the arguments of PyTorch DataLoader, collate_fn is often ignored by users, but in MMEngine you must pay special attention to it. When you pass the dataloader argument as a dict, MMEngine will use the built-in pseudo_collate by default, which is significantly different from that, default_collate, in PyTorch. Therefore, when doing a migration from PyTorch, you have to explicitly specify the collate_fn in config files to be consistent in behavior.

Note

MMEngine uses pseudo_collate as default value is mainly due to historical compatibility reasons. You don’t have to look deeply into it. You can just know about it and avoid potential errors.

MMEngine provides 2 built-in collate_fn:

  • pseudo_collate: Default value in MMEngine. It won’t concatenate data through batch index. Detailed explanations can be found in pseudo_collate API doc

  • default_collate: It behaves almost identically to PyTorch’s default_collate. It will transfer data into Tensor and concatenate them through batch index. More details and slight differences from PyTorch can be found in default_collate API doc

If you want to use a custom collate_fn, you can register it to COLLATE_FUNCTIONS registry.

@COLLATE_FUNCTIONS.register_module()
def my_collate_func(data_batch: Sequence) -> Any:
    pass

runner = Runner(
    train_dataloader=dict(
        ...
        collate_fn=dict(type='my_collate_func')
    )
)

Details on dataset

Typically, datasets define the quantity, parsing, and pre-processing of the data. It is encapsulated in dataloader, allowing the latter to load data in batches. Since we fully support PyTorch DataLoader, the dataset is also compatible. Meanwhile, thanks to the registry mechanism, when a dataloader is given as a dict, its dataset argument can also be given as a dict, which enables lazy initialization in the runner. This mechanism allows for writing config files.

Use torchvision datasets

torchvision provides various open datasets. They can be directly used in MMEngine as shown in getting started in 15 minutes, where a CIFAR10 dataset is used together with torchvision’s built-in data transforms.

However, if you want to use the dataset in config files, registration is needed. What’s more, if you also require data transforms in torchvision, some more registrations are required. The following example illustrates how to do it.

import torchvision.transforms as tvt
from mmengine.registry import DATASETS, TRANSFORMS
from mmengine.dataset.base_dataset import Compose

# register CIFAR10 dataset in torchvision
# data transforms should also be built here
@DATASETS.register_module(name='Cifar10', force=False)
def build_torchvision_cifar10(transform=None, **kwargs):
    if isinstance(transform, dict):
        transform = [transform]
    if isinstance(transform, (list, tuple)):
        transform = Compose(transform)
    return torchvision.datasets.CIFAR10(**kwargs, transform=transform)

# register data transforms in torchvision
DATA_TRANSFORMS.register_module('RandomCrop', module=tvt.RandomCrop)
DATA_TRANSFORMS.register_module('RandomHorizontalFlip', module=tvt.RandomHorizontalFlip)
DATA_TRANSFORMS.register_module('ToTensor', module=tvt.ToTensor)
DATA_TRANSFORMS.register_module('Normalize', module=tvt.Normalize)

# specify in runner
runner = Runner(
    train_dataloader=dict(
        batch_size=32,
        sampler=dict(
            type='DefaultSampler',
            shuffle=True),
        dataset=dict(type='Cifar10',
            root='data/cifar10',
            train=True,
            download=True,
            transform=[
                dict(type='RandomCrop', size=32, padding=4),
                dict(type='RandomHorizontalFlip'),
                dict(type='ToTensor'),
                dict(type='Normalize', **norm_cfg)])
    )
)

Note

The above example makes extensive use of the registry mechanism and borrows the Compose module from MMEngine. If you urge to use torchvision dataset in your config files, you can refer to it and make some slight modifications. However, we recommend you borrow datasets from downstream repos such as MMDet, MMCls, etc. This may give you a better experience.

Customize your dataset

You are free to customize your own datasets, as you would with PyTorch. You can also copy existing datasets from your previous PyTorch projects. If you want to learn how to customize your dataset, please refer to PyTorch official tutorials

Use MMEngine BaseDataset

Apart from directly using PyTorch native Dataset class, you can also use MMEngine’s built-in class BaseDataset to customize your own one, as referred to BaseDataset tutorial. It makes some conventions on the format of annotation files, which makes the data interface more unified and multi-task training more convenient. Meanwhile, BaseDataset can easily cooperate with built-in data transforms in MMEngine, which releases you from writing one from scratch.

Currently, BaseDataset has been widely used in downstream repos of OpenMMLab 2.0 projects.

Model

Runner and model

As mentioned in basic dataflow, the dataflow between DataLoader, model and evaluator follows some rules. Don’t remember clearly? Let’s review it:

# Training process
for data_batch in train_dataloader:
    data_batch = model.data_preprocessor(data_batch, training=True)
    if isinstance(data_batch, dict):
        losses = model(**data_batch, mode='loss')
    elif isinstance(data_batch, (list, tuple)):
        losses = model(*data_batch, mode='loss')
    else:
        raise TypeError()
# Validation process
for data_batch in val_dataloader:
    data_batch = model.data_preprocessor(data_batch, training=False)
    if isinstance(data_batch, dict):
        outputs = model(**data_batch, mode='predict')
    elif isinstance(data_batch, (list, tuple)):
        outputs = model(**data_batch, mode='predict')
    else:
        raise TypeError()
    evaluator.process(data_samples=outputs, data_batch=data_batch)
metrics = evaluator.evaluate(len(val_dataloader.dataset))

In runner tutorial, we simply mentioned the relationship between DataLoader, model and evaluator, and introduced the concept of data_preprocessor. You may have a certain understanding of the model. However, during the running of Runner, the situation is far more complex than the above pseudo-code.

In order to focus your attention on the algorithm itself, and ignore the complex relationship between the model, DataLoader and evaluator, we designed BaseModel. In most cases, the only thing you need to do is to make your model inherit from BaseModel, and implement the forward as required to perform the training, testing, and validation process.

Before continuing reading the model tutorial, let’s throw out two questions that we hope you will find the answers after reading the model tutorial:

  1. When do we update the parameters of model? and how to update the parameters by a custom optimization process?

  2. Why is the concept of data_preprocessor necessary? What functions can it perform?

Interface introduction

Usually, we should define a model to implement the body of the algorithm. In MMEngine, model will be managed by Runner, and need to implement some interfaces, such as train_step, val_step, and test_step. For high-level tasks like detection, classification, and segmentation, the interfaces mentioned above commonly implement a standard workflow. For example, train_step will calculate the loss and update the parameters of the model, and val_step/test_step will calculate the metrics and return the predictions. Therefore, MMEnine abstracts the BaseModel to implement the common workflow.

Benefits from the BaseModel, we only need to make the model inherit from BaseModel, and implement the forward function to perform the training, testing, and validation process.

Note

BaseModel inherits from BaseModule,which can be used to initialize the model parameters dynamically.

forward: The arguments of forward need to match with the data given by DataLoader. If the DataLoader samples a tuple data, forward needs to accept the value of unpacked *data. If DataLoader returns a dict data, forward needs to accept the key-value of unpacked **data. forward also accepts mode parameter, which is used to control the running branch:

  • mode='loss': loss mode is enabled in training process, and forward returns a differentiable loss dict. Each key-value pair in loss dict will be used to log the training status and optimize the parameters of model. This branch will be called by train_step

  • mode='predict': predict mode is enabled in validation/testing process, and forward will return predictions, which matches with arguments of process. Repositories of OpenMMLab have a more strict rules. The predictions must be a list and each element of it must be a BaseDataElement. This branch will be called by val_step

  • mode='tensor': In tensor and predict modes, forward will return the predictions. The difference is that forward will return a tensor or a container or tensor which has not been processed by a series of post-process methods, such as non-maximum suppression (NMS). You can customize your post-process method after getting the result of tensor mode.

train_step: Get the loss dict by calling forward with loss mode. BaseModel implements a standard optimization process as follows:

def train_step(self, data, optim_wrapper):
    # See details in the next section
    data = self.data_preprocessor(data, training=True)
    # `loss` mode, return a loss dict. Actually train_step accepts
    #  both tuple  dict input, and unpack it with ** or *
    loss = self(**data, mode='loss')
    # Parse the loss dict and return the parsed losses for optimization
    # and log_vars for logging
    parsed_losses, log_vars = self.parse_losses()
    optim_wrapper.update_params(parsed_losses)  # 更新参数
    return log_vars

val_step: Get the predictions by calling forward with predict mode.

def val_step(self, data, optim_wrapper):
    data = self.data_preprocessor(data, training=False)
    outputs = self(**data, mode='predict')
    return outputs

test_step: There is no difference between val_step and test_step in BaseModel. But we can customize it in the subclasses, for example, you can get validation loss in val_step.

Understand the interfaces of BaseModel, now we are able to come up with a more complete pseudo-code:

# training
for data_batch in train_dataloader:
    loss_dict = model.train_step(data_batch)
# validation
for data_batch in val_dataloader:
    preds = model.test_step(data_batch)
    evaluator.process(data_samples=outputs, data_batch=data_batch)
metrics = evaluator.evaluate(len(val_dataloader.dataset))

Great!, ignoring Hook, the pseudo-code above almost implements the main logic in loop! Let’s go back to 15 minutes to get started with MMEngine, we may truly understand what MMResNet has done:

import torch.nn.functional as F
import torchvision
from mmengine.model import BaseModel

class MMResNet50(BaseModel):
    def __init__(self):
        super().__init__()
        self.resnet = torchvision.models.resnet50()

    def forward(self, imgs, labels, mode):
        x = self.resnet(imgs)
        if mode == 'loss':
            return {'loss': F.cross_entropy(x, labels)}
        elif mode == 'predict':
            return x, labels

    # train_step, val_step and test_step have been implemented in BaseModel.
    # We list the equivalent code here for better understanding
    def train_step(self, data, optim_wrapper):
        data = self.data_preprocessor(data)
        loss = self(*data, mode='loss')
        parsed_losses, log_vars = self.parse_losses()
        optim_wrapper.update_params(parsed_losses)
        return log_vars

    def val_step(self, data, optim_wrapper):
        data = self.data_preprocessor(data)
        outputs = self(*data, mode='predict')
        return outputs

    def test_step(self, data, optim_wrapper):
        data = self.data_preprocessor(data)
        outputs = self(*data, mode='predict')
        return outputs

Now, you may have a deeper understanding of dataflow, and can answer the first question in Runner and model.

BaseModel.train_step implements the standard optimization, and if we want to customize a new optimization process, we can override it in the subclass. However, it is important to note that we need to make sure that train_step returns a loss dict.

DataPreprocessor

If your computer is equipped with a GPU (or other hardware that can accelerate training, such as MPS, IPU, etc.), when you run the 15 minutes tutorial, you will see that the program is running on the GPU, but, when does MMEngine move the data and model from the CPU to the GPU?

In fact, the Runner will move the model to the specified device during the construction, while the data will be moved to the specified device at the self.data_preprocessor(data) mentioned in the code snippet of the previous section. The moved data will be further passed to the model.

Makes sense but it’s weird, isn’t it? At this point you may be wondering:

  1. MMResNet50 does not define data_preprocessor, but why it can still access data_preprocessor and move data to GPU?

  2. Why BaseModel does not move data by data = data.to(device), but needs the DataPreprocessor to move data?

The answer to the first question is that: MMResNet50 inherit from BaseModel, and super().__init__ will build a default data_preprocessor for it. The equivalent implementation of the default one is like this:

class BaseDataPreprocessor(nn.Module):
    def forward(self, data, training=True):  # ignore the training parameter here
        # suppose data given by CIFAR10 is a tuple. Actually
        # BaseDataPreprocessor could move various type of data
        # to target device.
        return tuple(_data.cuda() for _data in data)

BaseDataPreprocessor will move the data to the specified device.

Before answering the second question, let’s think about a few more questions

  1. Where should we perform normalization? transform or Model?

    It sounds reasonable to put it in transform to take advantage of Dataloader’s multi-process acceleration, and in the model to move it to GPU to use GPU resources to accelerate normalization. However, while we are debating whether CPU normalization is faster than GPU normalization, the time of data moving from CPU to GPU is much longer than the former.

    In fact, for less computationally intensive operations like normalization, it takes much less time than data transferring, which has a higher priority for being optimized. If I could move the data to the specified device while it is still in uint8 and before it is normalized (the size of normalized float data is 4 times larger than that of unit8), it would reduce the bandwidth and greatly improve the efficiency of data transferring. This “lagged” normalization behavior is one of the main reasons why we designed the DataPreprocessor. The data preprocessor moves the data first and then normalizes it.

  2. How we implement the data augmentation like MixUp and Mosaic?

    Although it seems that MixUp and Mosaic are just special data transformations that should be implemented in transform. However, considering that these two transformations involve fusing multiple images into one, it would be very difficult to implement them in transform since the current paradigm of transform is to do various enhancements on one image. It would be hard to read additional images from dataset because the dataset is not accessible in the transform. However, if we implement Mosaic or Mixup based on the batch_data sampled from Dataloader, everything becomes easy. We can access multiple images at the same time, and we can easily perform the image fusion operation.

    class MixUpDataPreprocessor(nn.Module):
        def __init__(self, num_class, alpha):
            self.alpha = alpha
    
        def forward(self, data, training=True):
            data = tuple(_data.cuda() for _data in data)
            # Only perform MixUp in training mode
            if not training:
                return data
    
            label = F.one_hot(label)  # label to OneHot
            batch_size = len(label)
            index = torch.randperm(batch_size)  # Get the index of fused image
            img, label = data
            lam = np.random.beta(self.alpha, self.alpha)  # Fusion factor
    
            # MixUp
            img = lam * img + (1 - lam) * img[index, :]
            label = lam * batch_scores + (1 - lam) * batch_scores[index, :]
            # Since the returned label is onehot encoded, the `forward` of the
            # model should also be adjusted.
            return tuple(img, label)
    

    Therefore, besides data transferring and normalization, another major function of data_preprocessor is BatchAugmentation. The modularity of the data preprocessor also helps us to achieve a free combination between algorithms and data augmentation.

  3. What should we do if the data sampled from the DataLoader does not match the model input, should I modify the DataLoader or the model interface?

    The answer is: neither is appropriate. The ideal solution is to do the adaptation without breaking the existing interface between the model and the DataLoader. DataPreprocessor could also handle this, you can customize your DataPreprocessor to convert the incoming to the target type.

By now, You must understand the rationale of the data preprocessor and can confidently answer the two questions posed at the beginning of the tutorial! But you may still wonder what is the optim_wrapper passed to train_step, and how do the predictions returned by test_step and val_step relate to the evaluator. You will find more introduction in the evaluation tutorial and the optimizer wrapper tutorial.

Evaluation

Coming soon. Please refer to chinese documentation.

OptimWrapper

In previous tutorials of runner and model, we have more or less mentioned the concept of OptimWrapper, but we have not introduced why we need it and what are the advantages of OptimWrapper compared to Pytorch’s native optimizer. In this tutorial, we will help you understand the advantages and demonstrate how to use the wrapper.

As its name suggests, OptimWrapper is a high-level abstraction of PyTorch’s native optimizer, which provides a unified set of interfaces while adding more functionality. OptimWrapper supports different training strategies, including mixed precision training, gradient accumulation, and gradient clipping. We can choose the appropriate training strategy according to our needs. OptimWrapper also defines a standard process for parameter updating based on which users can switch between different training strategies for the same set of code.

OptimWrapper vs Optimizer

Now we use both the native optimizer of PyTorch and the OptimWrapper in MMEngine to perform single-precision training, mixed-precision training, and gradient accumulation to show the difference in implementations.

Model training

1.1 Single-precision training with SGD in PyTorch

import torch
from torch.optim import SGD
import torch.nn as nn
import torch.nn.functional as F

inputs = [torch.zeros(10, 1, 1)] * 10
targets = [torch.ones(10, 1, 1)] * 10
model = nn.Linear(1, 1)
optimizer = SGD(model.parameters(), lr=0.01)
optimizer.zero_grad()

for input, target in zip(inputs, targets):
    output = model(input)
    loss = F.l1_loss(output, target)
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

1.2 Single-precision training with OptimWrapper in MMEngine

from mmengine.optim import OptimWrapper

optim_wrapper = OptimWrapper(optimizer=optimizer)

for input, target in zip(inputs, targets):
    output = model(input)
    loss = F.l1_loss(output, target)
    optim_wrapper.update_params(loss)

image

The OptimWrapper.update_params achieves the standard process for gradient computation, parameter updating, and gradient zeroing, which can be used to update the model parameters directly.

2.1 Mixed-precision training with SGD in PyTorch

from torch.cuda.amp import autocast

model = model.cuda()
inputs = [torch.zeros(10, 1, 1, 1)] * 10
targets = [torch.ones(10, 1, 1, 1)] * 10

for input, target in zip(inputs, targets):
    with autocast():
        output = model(input.cuda())
    loss = F.l1_loss(output, target.cuda())
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

2.2 Mixed-precision training with OptimWrapper in MMEngine

from mmengine.optim import AmpOptimWrapper

optim_wrapper = AmpOptimWrapper(optimizer=optimizer)

for input, target in zip(inputs, targets):
    with optim_wrapper.optim_context(model):
        output = model(input.cuda())
    loss = F.l1_loss(output, target.cuda())
    optim_wrapper.update_params(loss)

image

To enable mixed precision training, users need to use AmpOptimWrapper.optim_context which is similar to the autocast for enabling the context for mixed precision training. In addition, AmpOptimWrapper.optim_context can accelerate the gradient accumulation during the distributed training, which will be introduced in the next example.

3.1 Mixed-precision training and gradient accumulation with SGD in PyTorch

for idx, (input, target) in enumerate(zip(inputs, targets)):
    with autocast():
        output = model(input.cuda())
    loss = F.l1_loss(output, target.cuda())
    loss.backward()
    if idx % 2 == 0:
        optimizer.step()
        optimizer.zero_grad()

3.2 Mixed-precision training and gradient accumulation with OptimWrapper in MMEngine

optim_wrapper = AmpOptimWrapper(optimizer=optimizer, accumulative_counts=2)

for input, target in zip(inputs, targets):
    with optim_wrapper.optim_context(model):
        output = model(input.cuda())
    loss = F.l1_loss(output, target.cuda())
    optim_wrapper.update_params(loss)

image

We only need to configure the accumulative_counts parameter and call the update_params interface to achieve the gradient accumulation function. Besides, in the distributed training scenario, if we configure the gradient accumulation with optim_context context enabled, we can avoid unnecessary gradient synchronization during the gradient accumulation step.

The OptimWrapper also provides a more fine-grained interface for users to customize with their own parameter update logics.

  • backward: Accept a loss dictionary, and compute the gradient of parameters.

  • step: Same as optimizer.step, and update the parameters.

  • zero_grad: Same as optimizer.zero_grad, and zero the gradient of parameters

We can use the above interface to implement the same logic of parameters updating as the Pytorch optimizer.

for idx, (input, target) in enumerate(zip(inputs, targets)):
    optimizer.zero_grad()
    with optim_wrapper.optim_context(model):
        output = model(input.cuda())
    loss = F.l1_loss(output, target.cuda())
    optim_wrapper.backward(loss)
    if idx % 2 == 0:
        optim_wrapper.step()
        optim_wrapper.zero_grad()

We can also configure a gradient clipping strategy for the OptimWrapper.

# based on torch.nn.utils.clip_grad_norm_ method
optim_wrapper = AmpOptimWrapper(
    optimizer=optimizer, clip_grad=dict(max_norm=1))

# based on torch.nn.utils.clip_grad_value_ method
optim_wrapper = AmpOptimWrapper(
    optimizer=optimizer, clip_grad=dict(clip_value=0.2))

Get learning rate/momentum

The OptimWrapper provides the get_lr and get_momentum for the convenience of getting the learning rate and momentum of the first parameter group in the optimizer.

import torch.nn as nn
from torch.optim import SGD

from mmengine.optim import OptimWrapper

model = nn.Linear(1, 1)
optimizer = SGD(model.parameters(), lr=0.01)
optim_wrapper = OptimWrapper(optimizer)

print(optimizer.param_groups[0]['lr'])  # 0.01
print(optimizer.param_groups[0]['momentum'])  # 0
print(optim_wrapper.get_lr())  # {'lr': [0.01]}
print(optim_wrapper.get_momentum())  # {'momentum': [0]}
0.01
0
{'lr': [0.01]}
{'momentum': [0]}

Export/load state dicts

Similar to the optimizer, the OptimWrapper provides the state_dict and load_state_dict interfaces for exporting and loading the optimizer states. For the AmpOptimWrapper, it can export mixed-precision training parameters as well.

import torch.nn as nn
from torch.optim import SGD
from mmengine.optim import OptimWrapper, AmpOptimWrapper

model = nn.Linear(1, 1)
optimizer = SGD(model.parameters(), lr=0.01)

optim_wrapper = OptimWrapper(optimizer=optimizer)
amp_optim_wrapper = AmpOptimWrapper(optimizer=optimizer)

# export state dicts
optim_state_dict = optim_wrapper.state_dict()
amp_optim_state_dict = amp_optim_wrapper.state_dict()

print(optim_state_dict)
print(amp_optim_state_dict)
optim_wrapper_new = OptimWrapper(optimizer=optimizer)
amp_optim_wrapper_new = AmpOptimWrapper(optimizer=optimizer)

# load state dicts
amp_optim_wrapper_new.load_state_dict(amp_optim_state_dict)
optim_wrapper_new.load_state_dict(optim_state_dict)
{'state': {}, 'param_groups': [{'lr': 0.01, 'momentum': 0, 'dampening': 0, 'weight_decay': 0, 'nesterov': False, 'maximize': False, 'foreach': None, 'params': [0, 1]}]}
{'state': {}, 'param_groups': [{'lr': 0.01, 'momentum': 0, 'dampening': 0, 'weight_decay': 0, 'nesterov': False, 'maximize': False, 'foreach': None, 'params': [0, 1]}], 'loss_scaler': {'scale': 65536.0, 'growth_factor': 2.0, 'backoff_factor': 0.5, 'growth_interval': 2000, '_growth_tracker': 0}}

Use multiple optimizers

Considering that algorithms like GANs usually need to use multiple optimizers to train the generator and the discriminator, MMEngine provides a container class called OptimWrapperDict to manage them. OptimWrapperDict stores the sub-OptimWrapper in the form of dict, and can be accessed and traversed just like a dict.

Unlike regular OptimWrapper, OptimWrapperDict does not provide methods such as update_prarms, optim_context, backward, step, etc. Therefore, it cannot be used directly to train models. We suggest implementing the logic of parameter updating by accessing the sub-OptimWarpper in OptimWrapperDict directly.

Users may wonder why not just use dict to manage multiple optimizers since OptimWrapperDict does not have training capabilities. Actually, the core function of OptimWrapperDict is to support exporting or loading the state dictionary of all sub-OptimWrapper and to support getting learning rates and momentums as well. Without OptimWrapperDict, MMEngine needs to do a lot of if-else in OptimWrapper to get the states of the OptimWrappers.

from torch.optim import SGD
import torch.nn as nn

from mmengine.optim import OptimWrapper, OptimWrapperDict

gen = nn.Linear(1, 1)
disc = nn.Linear(1, 1)
optimizer_gen = SGD(gen.parameters(), lr=0.01)
optimizer_disc = SGD(disc.parameters(), lr=0.01)

optim_wapper_gen = OptimWrapper(optimizer=optimizer_gen)
optim_wapper_disc = OptimWrapper(optimizer=optimizer_disc)
optim_dict = OptimWrapperDict(gen=optim_wapper_gen, disc=optim_wapper_disc)

print(optim_dict.get_lr())  # {'gen.lr': [0.01], 'disc.lr': [0.01]}
print(optim_dict.get_momentum())  # {'gen.momentum': [0], 'disc.momentum': [0]}
{'gen.lr': [0.01], 'disc.lr': [0.01]}
{'gen.momentum': [0], 'disc.momentum': [0]}

As shown in the above example, OptimWrapperDict exports learning rates and momentums for all OptimWrappers easily, and OptimWrapperDict can export and load all the state dicts in a similar way.

Configure the OptimWapper in Runner

We first need to configure the optimizer for the OptimWrapper. MMEngine automatically adds all optimizers in PyTorch to the OPTIMIZERS registry, and users can specify the optimizers they need in the form of a dict. All supported optimizers in PyTorch are listed here.

Now we take setting up a SGD OptimWrapper as an example.

optimizer = dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0001)
optim_wrapper = dict(type='OptimWrapper', optimizer=optimizer)

Here we have set up an OptimWrapper with a SGD optimizer with the learning rate and momentum parameters as specified. Since OptimWrapper is designed for standard single precision training, we can also omit the type field in the configuration:

optimizer = dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0001)
optim_wrapper = dict(optimizer=optimizer)

To enable mixed-precision training and gradient accumulation, we change type to AmpOptimWrapper and specify the accumulative_counts parameter.

optimizer = dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0001)
optim_wrapper = dict(type='AmpOptimWrapper', optimizer=optimizer, accumulative_counts=2)

Note

If you are new to reading the MMEngine tutorial and are not familiar with concepts such as configs and registries, it is recommended to skip the following advanced tutorials for now and read other documents first. Of course, if you already have a good understanding of this prerequisite knowledge, we highly recommend reading the advanced part which covers:

  1. How to customize the learning rate, decay coefficient, and other parameters of the model parameters in the configuration of OptimWrapper.

  2. how to customize the construction policy of the optimizer.

Apart from the pre-requisite knowledge of the configs and the registries, it is recommended to have a thorough understanding of the native construction of PyTorch optimizer before starting the advanced tutorials.

Advanced usages

PyTorch’s optimizer allows different hyperparameters to be set for each parameter in the model, such as using different learning rates for the backbone and head for a classification model.

from torch.optim import SGD
import torch.nn as nn

model = nn.ModuleDict(dict(backbone=nn.Linear(1, 1), head=nn.Linear(1, 1)))
optimizer = SGD([{'params': model.backbone.parameters()},
     {'params': model.head.parameters(), 'lr': 1e-3}],
    lr=0.01,
    momentum=0.9)

In the above example, we set a learning rate of 0.01 for the backbone, while another learning rate of 1e-3 for the head. Users can pass a list of dictionaries containing the different parts of the model’s parameters and their corresponding hyperparameters to the optimizer, allowing for fine-grained adjustment of the model optimization.

In MMEngine, the optimizer wrapper constructor allows users to set hyperparameters in different parts of the model directly by setting the paramwise_cfg in the configuration file rather than by modifying the code of building the optimizer.

Set different hyperparamters for different types of parameters

The default optimizer wrapper constructor in MMEngine supports setting different hyperparameters for different types of parameters in the model. For example, we can set norm_decay_mult=0 for paramwise_cfg to set the weight decay factor to 0 for the weight and bias of the normalization layer to implement the trick of not decaying the weight of the normalization layer as mentioned in the Bag of Tricks.

Here, we set the weight decay coefficient in all normalization layers (head.bn) in ToyModel to 0 as follows.

from mmengine.optim import build_optim_wrapper
from collections import OrderedDict

class ToyModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.backbone = nn.ModuleDict(
            dict(layer0=nn.Linear(1, 1), layer1=nn.Linear(1, 1)))
        self.head = nn.Sequential(
            OrderedDict(
                linear=nn.Linear(1, 1),
                bn=nn.BatchNorm1d(1)))


optim_wrapper = dict(
    optimizer=dict(type='SGD', lr=0.01, weight_decay=0.0001),
    paramwise_cfg=dict(norm_decay_mult=0))
optimizer = build_optim_wrapper(ToyModel(), optim_wrapper)
08/23 22:02:43 - mmengine - INFO - paramwise_options -- backbone.layer0.bias:lr=0.01
08/23 22:02:43 - mmengine - INFO - paramwise_options -- backbone.layer0.bias:weight_decay=0.0001
08/23 22:02:43 - mmengine - INFO - paramwise_options -- backbone.layer1.bias:lr=0.01
08/23 22:02:43 - mmengine - INFO - paramwise_options -- backbone.layer1.bias:weight_decay=0.0001
08/23 22:02:43 - mmengine - INFO - paramwise_options -- head.linear.bias:lr=0.01
08/23 22:02:43 - mmengine - INFO - paramwise_options -- head.linear.bias:weight_decay=0.0001
08/23 22:02:43 - mmengine - INFO - paramwise_options -- head.bn.weight:weight_decay=0.0
08/23 22:02:43 - mmengine - INFO - paramwise_options -- head.bn.bias:weight_decay=0.0

In addition to configuring the weight decay, paramwise_cfg of MMEngine’s default optimizer wrapper constructor supports the following hyperparameters as well.

lr_mult: Learning rate for all parameters.

decay_mult: Decay coefficient for all parameters.

bias_lr_mult: Learning rate coefficient of the bias (excluding bias of normalization layer and offset of the deformable convolution).

bias_decay_mult: Weight decay coefficient of the bias (excluding bias of normalization layer and offset of the deformable convolution).

norm_decay_mult: Weight decay coefficient for weights and bias of the normalization layer.

flat_decay_mult: Weight decay coefficient of the one-dimension parameters.

dwconv_decay_mult: Decay coefficient of the depth-wise convolution.

bypass_duplicate: Whether to skip duplicate parameters, default to False.

dcn_offset_lr_mult: Learning rate of the deformable convolution.

Set different hyperparamters for different model modules

In addition, as shown in the PyTorch code above, in MMEngine we can also set different hyperparameters for any module in the model by setting custom_keys in paramwise_cfg.

If we want to set the learning rate and the decay coefficient to 0 for backbone.layer0, and set the learning rate to 0.001 for the rest of the modules in the backbone. At the same time, we want to keep all the learning rate to 0.001 for the head module. We can do it in this way:

optim_wrapper = dict(
    optimizer=dict(type='SGD', lr=0.01, weight_decay=0.0001),
    paramwise_cfg=dict(
        custom_keys={
            'backbone.layer0': dict(lr_mult=0, decay_mult=0),
            'backbone': dict(lr_mult=1),
            'head': dict(lr_mult=0.1)
        }))
optimizer = build_optim_wrapper(ToyModel(), optim_wrapper)
08/23 22:02:43 - mmengine - INFO - paramwise_options -- backbone.layer0.weight:lr=0.0
08/23 22:02:43 - mmengine - INFO - paramwise_options -- backbone.layer0.weight:weight_decay=0.0
08/23 22:02:43 - mmengine - INFO - paramwise_options -- backbone.layer0.weight:lr_mult=0
08/23 22:02:43 - mmengine - INFO - paramwise_options -- backbone.layer0.weight:decay_mult=0
08/23 22:02:43 - mmengine - INFO - paramwise_options -- backbone.layer0.bias:lr=0.0
08/23 22:02:43 - mmengine - INFO - paramwise_options -- backbone.layer0.bias:weight_decay=0.0
08/23 22:02:43 - mmengine - INFO - paramwise_options -- backbone.layer0.bias:lr_mult=0
08/23 22:02:43 - mmengine - INFO - paramwise_options -- backbone.layer0.bias:decay_mult=0
08/23 22:02:43 - mmengine - INFO - paramwise_options -- backbone.layer1.weight:lr=0.01
08/23 22:02:43 - mmengine - INFO - paramwise_options -- backbone.layer1.weight:weight_decay=0.0001
08/23 22:02:43 - mmengine - INFO - paramwise_options -- backbone.layer1.weight:lr_mult=1
08/23 22:02:43 - mmengine - INFO - paramwise_options -- backbone.layer1.bias:lr=0.01
08/23 22:02:43 - mmengine - INFO - paramwise_options -- backbone.layer1.bias:weight_decay=0.0001
08/23 22:02:43 - mmengine - INFO - paramwise_options -- backbone.layer1.bias:lr_mult=1
08/23 22:02:43 - mmengine - INFO - paramwise_options -- head.linear.weight:lr=0.001
08/23 22:02:43 - mmengine - INFO - paramwise_options -- head.linear.weight:weight_decay=0.0001
08/23 22:02:43 - mmengine - INFO - paramwise_options -- head.linear.weight:lr_mult=0.1
08/23 22:02:43 - mmengine - INFO - paramwise_options -- head.linear.bias:lr=0.001
08/23 22:02:43 - mmengine - INFO - paramwise_options -- head.linear.bias:weight_decay=0.0001
08/23 22:02:43 - mmengine - INFO - paramwise_options -- head.linear.bias:lr_mult=0.1
08/23 22:02:43 - mmengine - INFO - paramwise_options -- head.bn.weight:lr=0.001
08/23 22:02:43 - mmengine - INFO - paramwise_options -- head.bn.weight:weight_decay=0.0001
08/23 22:02:43 - mmengine - INFO - paramwise_options -- head.bn.weight:lr_mult=0.1
08/23 22:02:43 - mmengine - INFO - paramwise_options -- head.bn.bias:lr=0.001
08/23 22:02:43 - mmengine - INFO - paramwise_options -- head.bn.bias:weight_decay=0.0001
08/23 22:02:43 - mmengine - INFO - paramwise_options -- head.bn.bias:lr_mult=0.1

The state dictionary of the above model can be printed as the following:

for name, val in ToyModel().named_parameters():
    print(name)
backbone.layer0.weight
backbone.layer0.bias
backbone.layer1.weight
backbone.layer1.bias
head.linear.weight
head.linear.bias
head.bn.weight
head.bn.bias

Each field in custom_keys is defined as follows.

  1. 'backbone': dict(lr_mult=1): Set the learning rate of the parameter whose name is prefixed with backbone to 1.

  2. 'backbone.layer0': dict(lr_mult=0, decay_mult=0): Set the learning rate of the parameter with the prefix backbone.layer0 to 0 and the decay coefficient to 0. This configuration has a higher priority than the first one.

  3. 'head': dict(lr_mult=0.1): Set the learning rate of the parameter whose name is prefixed with head to 0.1.

Customize optimizer construction policies

Like other modules in MMEngine, the optimizer wrapper constructor is also managed by the registry. We can customize the hyperparameter policies by implementing custom optimizer wrapper constructors.

For example, we can implement an optimizer wrapper constructor called LayerDecayOptimWrapperConstructor that automatically set decreasing learning rates for layers of different depths of the model.

from mmengine.optim import DefaultOptimWrapperConstructor
from mmengine.registry import OPTIM_WRAPPER_CONSTRUCTORS
from mmengine.logging import print_log


@OPTIM_WRAPPER_CONSTRUCTORS.register_module(force=True)
class LayerDecayOptimWrapperConstructor(DefaultOptimWrapperConstructor):

    def __init__(self, optim_wrapper_cfg, paramwise_cfg=None):
        super().__init__(optim_wrapper_cfg, paramwise_cfg=None)
        self.decay_factor = paramwise_cfg.get('decay_factor', 0.5)

        super().__init__(optim_wrapper_cfg, paramwise_cfg)

    def add_params(self, params, module, prefix='' ,lr=None):
        if lr is None:
            lr = self.base_lr

        for name, param in module.named_parameters(recurse=False):
            param_group = dict()
            param_group['params'] = [param]
            param_group['lr'] = lr
            params.append(param_group)
            full_name = f'{prefix}.{name}' if prefix else name
            print_log(f'{full_name} : lr={lr}', logger='current')

        for name, module in module.named_children():
            chiled_prefix = f'{prefix}.{name}' if prefix else name
            self.add_params(
                params, module, chiled_prefix, lr=lr * self.decay_factor)


class ToyModel(nn.Module):

    def __init__(self) -> None:
        super().__init__()
        self.layer = nn.ModuleDict(dict(linear=nn.Linear(1, 1)))
        self.linear = nn.Linear(1, 1)


model = ToyModel()

optim_wrapper = dict(
    optimizer=dict(type='SGD', lr=0.01, weight_decay=0.0001),
    paramwise_cfg=dict(decay_factor=0.5),
    constructor='LayerDecayOptimWrapperConstructor')

optimizer = build_optim_wrapper(model, optim_wrapper)
08/23 22:20:26 - mmengine - INFO - layer.linear.weight : lr=0.0025
08/23 22:20:26 - mmengine - INFO - layer.linear.bias : lr=0.0025
08/23 22:20:26 - mmengine - INFO - linear.weight : lr=0.005
08/23 22:20:26 - mmengine - INFO - linear.bias : lr=0.005

When add_params is called for the first time, the params argument is an empty list and the module is the ToyModel instance. Please refer to the Optimizer Wrapper Constructor Documentation for detailed explanations on overloading.

Similarly, if we want to construct multiple optimizers, we also need to implement a custom constructor.

@OPTIM_WRAPPER_CONSTRUCTORS.register_module()
class MultipleOptimiWrapperConstructor:
    ...

Adjust hyperparameters during training

The hyperparameters in the optimizer can only be set to a fixed value at the time it is constructed, and you cannot adjust parameters such as the learning rate during training by just using the optimizer wrapper. In MMEngine, we have implemented a parameter scheduler that allows the tuning of parameters during training. For the usage of the parameter scheduler, please refer to the Parameter Scheduler

Parameter Scheduler

During neural network training, optimization hyperparameters (e.g. learning rate) are usually adjusted along with the training process. One of the simplest and most common learning rate adjustment strategies is multi-step learning rate decay, which reduces the learning rate to a fraction at regular intervals. PyTorch provides LRScheduler to implement various learning rate adjustment strategies. In MMEngine, we have extended it and implemented a more general ParamScheduler. It can adjust optimization hyperparameters such as learning rate and momentum. It also supports the combination of multiple schedulers to create more complex scheduling strategies.

Usage

We first introduce how to use PyTorch’s torch.optim.lr_scheduler to adjust learning rate.

How to use PyTorch's builtin learning rate scheduler?

Here is an example which refers from PyTorch official documentation:

Initialize an ExponentialLR object, and call the step method after each training epoch.

import torch
from torch.optim import SGD
from torch.optim.lr_scheduler import ExponentialLR

model = torch.nn.Linear(1, 1)
dataset = [torch.randn((1, 1, 1)) for _ in range(20)]
optimizer = SGD(model, 0.1)
scheduler = ExponentialLR(optimizer, gamma=0.9)

for epoch in range(10):
    for data in dataset:
        optimizer.zero_grad()
        output = model(data)
        loss = 1 - output
        loss.backward()
        optimizer.step()
    scheduler.step()

mmengine.optim.scheduler supports most of PyTorch’s learning rate schedulers such as ExponentialLR, LinearLR, StepLR, MultiStepLR, etc. Please refer to parameter scheduler API documentation for all of the supported schedulers.

MMEngine also supports adjusting momentum with parameter schedulers. To use momentum schedulers, replace LR in the class name to Momentum, such as ExponentialMomentumLinearMomentum. Further, we implement the general parameter scheduler ParamScheduler, which is used to adjust the specified hyperparameters in the optimizer, such as weight_decay, etc. This feature makes it easier to apply some complex hyperparameter tuning strategies.

Different from the above example, MMEngine usually does not need to manually implement the training loop and call optimizer.step(). The runner will automatically manage the training progress and control the execution of the parameter scheduler through ParamSchedulerHook.

Use a single LRScheduler

If only one scheduler needs to be used for the entire training process, there is no difference with PyTorch’s learning rate scheduler.

# build the scheduler manually
from torch.optim import SGD
from mmengine.runner import Runner
from mmengine.optim.scheduler import MultiStepLR

optimizer = SGD(model.parameters(), lr=0.01, momentum=0.9)
param_scheduler = MultiStepLR(optimizer, milestones=[8, 11], gamma=0.1)

runner = Runner(
    model=model,
    optim_wrapper=dict(
        optimizer=optimizer),
    param_scheduler=param_scheduler,
    ...
    )

image

If using the runner with the registry and config file, we can specify the scheduler by setting the param_scheduler field in the config. The runner will automatically build a parameter scheduler based on this field:

# build the scheduler with config file
param_scheduler = dict(type='MultiStepLR', by_epoch=True, milestones=[8, 11], gamma=0.1)

Note that the parameter by_epoch is added here, which controls the frequency of learning rate adjustment. When set to True, it means adjusting by epoch. When set to False, it means adjusting by iteration. The default value is True.

In the above example, it means to adjust according to epochs. At this time, the unit of the parameters is epoch. For example, [8, 11] in milestones means that the learning rate will be multiplied by 0.1 at the end of the 8 and 11 epoch.

When the frequency is modified, the meaning of the count-related settings of the scheduler will be changed accordingly. When by_epoch=True, the numbers in milestones indicate at which epoch the learning rate decay is performed, and when by_epoch=False it indicates at which iteration the learning rate decay is performed.

Here is an example of adjusting by iterations: At the end of the 600th and 800th iterations, the learning rate will be multiplied by 0.1 times.

param_scheduler = dict(type='MultiStepLR', by_epoch=False, milestones=[600, 800], gamma=0.1)

image

If users want to use the iteration-based frequency while filling the scheduler config settings by epoch, MMEngine’s scheduler also provides an automatic conversion method. Users can call the build_iter_from_epoch method and provide the number of iterations for each training epoch to construct a scheduler object updated by iterations:

epoch_length = len(train_dataloader)
param_scheduler = MultiStepLR.build_iter_from_epoch(optimizer, milestones=[8, 11], gamma=0.1, epoch_length=epoch_length)

If using config to build a scheduler, just add convert_to_iter_based=True to the field. The runner will automatically call build_iter_from_epoch to convert the epoch-based config to an iteration-based scheduler object:

param_scheduler = dict(type='MultiStepLR', by_epoch=True, milestones=[8, 11], gamma=0.1, convert_to_iter_based=True)

Below is a Cosine Annealing learning rate scheduler that is updated by epoch, where the learning rate is only modified after each epoch:

param_scheduler = dict(type='CosineAnnealingLR', by_epoch=True, T_max=12)

image

After automatically conversion, the learning rate is updated by iteration. As you can see from the graph below, the learning rate changes more smoothly.

param_scheduler = dict(type='CosineAnnealingLR', by_epoch=True, T_max=12, convert_to_iter_based=True)

image

Combine multiple LRSchedulers (e.g. learning rate warm-up)

In the training process of some algorithms, the learning rate is not adjusted according to a certain scheduling strategy from beginning to end. The most common example is learning rate warm-up.

For example, in the first few iterations, a linear strategy is used to increase the learning rate from a small value to normal, and then another strategy is applied.

MMEngine supports combining multiple schedulers together. Just modify the param_scheduler field in the config file to a list of scheduler config, and the ParamSchedulerHook can automatically process the scheduler list. The following example implements learning rate warm-up.

param_scheduler = [
    # Linear learning rate warm-up scheduler
    dict(type='LinearLR',
         start_factor=0.001,
         by_epoch=False,  # Updated by iterations
         begin=0,
         end=50),  # Warm up for the first 50 iterations
    # The main LRScheduler
    dict(type='MultiStepLR',
         by_epoch=True,  # Updated by epochs
         milestones=[8, 11],
         gamma=0.1)
]

image

Note that the begin and end parameters are added here. These two parameters specify the valid interval of the scheduler. The valid interval usually only needs to be set when multiple schedulers are combined, and can be ignored when using a single scheduler. When the begin and end parameters are specified, it means that the scheduler only takes effect in the [begin, end) interval, and the unit is determined by the by_epoch parameter.

In the above example, the by_epoch of LinearLR in the warm-up phase is False, which means that the scheduler only takes effect in the first 50 iterations. After more than 50 iterations, the scheduler will no longer take effect, and the second scheduler, which is MultiStepLR, will control the learning rate. When combining different schedulers, the by_epoch parameter does not have to be the same for each scheduler.

Here is another example:

param_scheduler = [
    # Use a linear warm-up at [0, 100) iterations
    dict(type='LinearLR',
         start_factor=0.001,
         by_epoch=False,
         begin=0,
         end=100),
    # Use a cosine learning rate at [100, 900) iterations
    dict(type='CosineAnnealingLR',
         T_max=800,
         by_epoch=False,
         begin=100,
         end=900)
]

image

The above example uses a linear learning rate warm-up for the first 100 iterations, and then uses a cosine annealing learning rate scheduler with a period of 800 from the 100th to the 900th iteration.

Users can combine any number of schedulers. If the valid intervals of two schedulers are not connected to each other which leads to an interval that is not covered, the learning rate of this interval remains unchanged. If the valid intervals of the two schedulers overlap, the adjustment of the learning rate will be triggered in the order of the scheduler config (similar with ChainedScheduler).

We recommend using different learning rate scheduling strategies in different stages of training to avoid overlapping of the valid intervals. Be careful If you really need to stack two schedulers overlapped. We recommend using learning rate visualization tool to visualize the learning rate after stacking, to avoid the adjustment not as expected.

How to adjust other hyperparameters

Momentum

Like learning rate, momentum is a schedulable hyperparameter in the optimizer’s parameter group. The momentum scheduler is used in exactly the same way as the learning rate scheduler. Just add the momentum scheduler config to the list in the param_scheduler field.

Example:

param_scheduler = [
    # the lr scheduler
    dict(type='LinearLR', ...),
    # the momentum scheduler
    dict(type='LinearMomentum',
         start_factor=0.001,
         by_epoch=False,
         begin=0,
         end=1000)
]

Generic parameter scheduler

MMEngine also provides a set of generic parameter schedulers for scheduling other hyperparameters in the param_groups of the optimizer. Change LR in the class name of the learning rate scheduler to Param, such as LinearParamScheduler. Users can schedule the specific hyperparameters by setting the param_name variable of the scheduler.

Here is an example:

param_scheduler = [
    dict(type='LinearParamScheduler',
         param_name='lr',  # adjust the 'lr' in `optimizer.param_groups`
         start_factor=0.001,
         by_epoch=False,
         begin=0,
         end=1000)
]

By setting the param_name to 'lr', this parameter scheduler is equivalent to LinearLRScheduler.

In addition to learning rate and momentum, users can also schedule other parameters in optimizer.param_groups. The schedulable parameters depend on the optimizer used. For example, when using the SGD optimizer with weight_decay, the weight_decay can be adjusted as follows:

param_scheduler = [
    dict(type='LinearParamScheduler',
         param_name='weight_decay',  # adjust 'weight_decay' in `optimizer.param_groups`
         start_factor=0.001,
         by_epoch=False,
         begin=0,
         end=1000)
]

Hook

Hook programming is a programming pattern in which a mount point is set in one or more locations of a program. When the program runs to a mount point, all methods registered to it at runtime are automatically called. Hook programming can increase the flexibility and extensibility of the program, since users can register custom methods to the mount point to be called without modifying the code in the program.

Built-in Hooks

MMEngine encapsules many ultilities as built-in hooks. These hooks are divided into two categories, namely default hooks and custom hooks. The former refers to those registered with the Runner by default, while the latter refers to those registered by the user on demand.

Each hook has a corresponding priority. At each mount point, hooks with higher priority are called earlier by the Runner. When sharing the same priority, the hooks are called in their registration order. The priority list is as follows.

  • HIGHEST (0)

  • VERY_HIGH (10)

  • HIGH (30)

  • ABOVE_NORMAL (40)

  • NORMAL (50)

  • BELOW_NORMAL (60)

  • LOW (70)

  • VERY_LOW (90)

  • LOWEST (100)

default hooks

Name

Function

Priority

RuntimeInfoHook

update runtime information into message hub

VERY_HIGH (10)

IterTimerHook

Update the time spent during iteration into message hub

NORMAL (50)

DistSamplerSeedHook

Ensure distributed Sampler shuffle is active

NORMAL (50)

LoggerHook

Collect logs from different components of Runner and write them to terminal, JSON file, tensorboard and wandb .etc

BELOW_NORMAL (60)

ParamSchedulerHook

update some hyper-parameters of optimizer

LOW (70)

CheckpointHook

Save checkpoints periodically

VERY_LOW (90)

custom hooks

Name

Function

Priority

EMAHook

apply Exponential Moving Average (EMA) on the model during training

NORMAL (50)

EmptyCacheHook

Releases all unoccupied cached GPU memory during the process of training

NORMAL (50)

SyncBuffersHook

Synchronize model buffers at the end of each epoch

NORMAL (50)

Note

It is not recommended to modify the priority of the default hooks, as hooks with lower priority may depend on hooks with higher priority. For example, CheckpointHook needs to have a lower priority than ParamSchedulerHook so that the saved optimizer state is correct. Also, the priority of custom hooks defaults to NORMAL (50).

The two types of hooks are set differently in the Runner, with the configuration of default hooks being passed to the default_hooks parameter of the Runner and the configuration of custom hooks being passed to the custom_hooks parameter, as follows.

from mmengine.runner import Runner
default_hooks = dict(
    runtime_info=dict(type='RuntimeInfoHook'),
    timer=dict(type='IterTimerHook'),
    sampler_seed=dict(type='DistSamplerSeedHook'),
    logger=dict(type='LoggerHook'),
    param_scheduler=dict(type='ParamSchedulerHook'),
    checkpoint=dict(type='CheckpointHook', interval=1),
)
custom_hooks = [dict(type='EmptyCacheHook')]
runner = Runner(default_hooks=default_hooks, custom_hooks=custom_hooks, ...)
runner.train()

CheckpointHook

CheckpointHook saves the checkpoints at a given interval. In the case of distributed training, only the master process will save the checkpoints. The main features of CheckpointHook is as follows.

  • Save checkpoints by interval, and support saving them by epoch or iteration

  • Save the most recent checkpoints

  • Save the best checkpoints

  • Specify the path to save the checkpoints

For more features, please read the CheckpointHook API documentation.

The four features mentioned above are described below.

  • Save checkpoints by interval, and support saving them by epoch or iteration

    Suppose we train a total of 20 epochs and want to save the checkpoints every 5 epochs, the following configuration will help us achieve this requirement.

    # the default value of by_epoch is True
    default_hooks = dict(checkpoint=dict(type='CheckpointHook', interval=5, by_epoch=True))
    

    If you want to save checkpoints by iteration, you can set by_epoch to False and interval=5 to save them every 5 iterations.

    default_hooks = dict(checkpoint=dict(type='CheckpointHook', interval=5, by_epoch=False))
    
  • Save the most recent checkpoints

    If you only want to keep a certain number of checkpoints, you can set the max_keep_ckpts parameter. When the number of checkpoints saved exceeds max_keep_ckpts, the previous checkpoints will be deleted.

    default_hooks = dict(checkpoint=dict(type='CheckpointHook', interval=5, max_keep_ckpts=2))
    

    The above config shows that if a total of 20 epochs are trained, the model will be saved at epochs 5, 10, 15, and 20, but the checkpoint epoch_5.pth will be deleted at epoch 15, and at epoch 20 the checkpoint epoch_10.pth will be deleted, so that only the epoch_15.pth and epoch_20.pth will be saved.

  • Save the best checkpoints

    If you want to save the best checkpoints of the validation set for the training process, you can set the save_best parameter. If set to 'auto', the current checkpoint are judged to be best based on the first evaluation metric of the validation set (the evaluation metrics returned by evaluator are an ordered dictionary).

    default_hooks = dict(checkpoint=dict(type='CheckpointHook', save_best='auto'))
    

    You can also directly specify the value of save_best as the evaluation metric, for example, in a classification task, you can specify save_best='top-1', then the current checkpoint will be judged as best based on the value of 'top-1'.

    In addition to the save_best parameter, other parameters related to saving the best checkpoint are rule, greater_keys and less_keys, which are used to imply whether its good to have large value or not. For example, if you specify save_best='top-1', you can specify rule='greater' to imply that the larger the value, the better the checkpoint.

  • Specify the path to save the checkpoints

    The checkpoints are saved in work_dir by default, but the path can be changed by setting out_dir.

    default_hooks = dict(checkpoint=dict(type='CheckpointHook', interval=5, out_dir='/path/of/directory'))
    

LoggerHook collects logs from different components of Runner and write them to terminal, JSON file, tensorboard and wandb .etc.

If we want to output (or save) the logs every 20 iterations, we can set the interval parameter and configure it as follows.

default_hooks = dict(logger=dict(type='LoggerHook', interval=20))

If you are interested in how MMEngine manages logging, you can refer to logging.

ParamSchedulerHook

ParamSchedulerHook iterates through all optimizer parameter schedulers of the Runner and calls their step method to update the optimizer parameters in order. See Parameter Schedulers for more details about what are parameter schedulers.

ParamSchedulerHook is registered to the Runner by default and has no configurable parameters, so there is no need to configure it.

IterTimerHook

IterTimerHook is used to record the time taken to load data and iterate once.

IterTimerHook is registered to the Runner by default and has no configurable parameters, so there is no need to configure it.

DistSamplerSeedHook

DistSamplerSeedHook calls the step method of the Sampler during distributed training to ensure that the shuffle operation takes effect.

DistSamplerSeedHook is registered to the Runner by default and has no configurable parameters, so there is no need to configure it.

RuntimeInfoHook

RuntimeInfoHook will update the current runtime information (e.g. epoch, iter, max_epochs, max_iters, lr, metrics, etc.) to the message hub at different mount points in the Runner so that other modules without access to the Runner can obtain this information.

RuntimeInfoHook is registered to the Runner by default and has no configurable parameters, so there is no need to configure it.

EMAHook

EMAHook performs an exponential moving average operation on the model during training, with the aim of improving the robustness of the model. Note that the model generated by exponential moving average is only used for validation and testing, and does not affect training.

custom_hooks = [dict(type='EMAHook')]
runner = Runner(custom_hooks=custom_hooks, ...)
runner.train()

EMAHook uses ExponentialMovingAverage by default, with optional values of StochasticWeightAverage and MomentumAnnealingEMA. Other averaging strategies can be used by setting ema_type.

custom_hooks = [dict(type='EMAHook', ema_type='StochasticWeightAverage')]

See EMAHook API Reference for more usage.

EmptyCacheHook

EmptyCacheHook calls torch.cuda.empty_cache() to release all unoccupied cached GPU memory. The timing of releasing memory can be controlled by setting parameters like before_epoch, after_iter, and after_epoch, meaning before the start of each epoch, after each iteration, and after each epoch respectively.

# The release operation is performed at the end of each epoch
custom_hooks = [dict(type='EmptyCacheHook', after_epoch=True)]
runner = Runner(custom_hooks=custom_hooks, ...)
runner.train()

SyncBuffersHook

SyncBuffersHook synchronizes the buffer of the model at the end of each epoch during distributed training, e.g. running_mean and running_var of the BN layer.

custom_hooks = [dict(type='SyncBuffersHook')]
runner = Runner(custom_hooks=custom_hooks, ...)
runner.train()

Customize Your Hooks

If the built-in hooks provided by MMEngine do not cover your demands, you are encouraged to customize your own hooks by simply inheriting the base hook class and overriding the corresponding mount point methods.

For example, if you want to check whether the loss value is valid, i.e. not infinite, during training, you can simply override the after_train_iter method as below. The check will be performed after each training iteration.

import torch
from mmengine.registry import HOOKS
from mmengine.hooks import Hook
@HOOKS.register_module()
class CheckInvalidLossHook(Hook):
    """Check invalid loss hook.
    This hook will regularly check whether the loss is valid
    during training.
    Args:
        interval (int): Checking interval (every k iterations).
            Defaults to 50.
    """
    def __init__(self, interval=50):
        self.interval = interval
    def after_train_iter(self, runner, batch_idx, data_batch=None, outputs=None):
        """All subclasses should override this method, if they need any
        operations after each training iteration.
        Args:
            runner (Runner): The runner of the training process.
            batch_idx (int): The index of the current batch in the train loop.
            data_batch (dict or tuple or list, optional): Data from dataloader.
            outputs (dict, optional): Outputs from model.
        """
        if self.every_n_train_iters(runner, self.interval):
            assert torch.isfinite(outputs['loss']),\
                runner.logger.info('loss become infinite or NaN!')

We simply pass the hook config to the custom_hooks parameter of the Runner, which will register the hooks when the Runner is initialized.

from mmengine.runner import Runner
custom_hooks = dict(
    dict(type='CheckInvalidLossHook', interval=50)
)
runner = Runner(custom_hooks=custom_hooks, ...)
runner.train()  # start training

Then the loss value are checked after iteration.

Note that the priority of the custom hook is NORMAL (50) by default, if you want to change the priority of the hook, then you can set the priority key in the config.

custom_hooks = dict(
    dict(type='CheckInvalidLossHook', interval=50, priority='ABOVE_NORMAL')
)

You can also set priority when defining classes.

@HOOKS.register_module()
class CheckInvalidLossHook(Hook):
    priority = 'ABOVE_NORMAL'

Registry

OpenMMLab supports a rich collection of algorithms and datasets, therefore, many modules with similar functionality are implemented. For example, the implementations of ResNet and SE-ResNet are based on the classes ResNet and SEResNet, respectively, which have similar functions and interfaces and belong to the model components of the algorithm library. To manage these functionally similar modules, MMEngine implements the registry. Most of the algorithm libraries in OpenMMLab use registry to manage their modules, including MMDetection, MMDetection3D, MMClassification and MMEditing, etc.

What is a registry

The registry in MMEngine can be considered as a union of a mapping table and a build function of modules. The mapping table maintains a mapping from strings to classes or functions, allowing the user to find the corresponding class or function with its name/notation. For example, the mapping from the string "ResNet" to the ResNet class. The module build function defines how to find the corresponding class or function based on a string and how to instantiate the class or call the function. For example, finding nn.BatchNorm2d and instantiating the BatchNorm2d module by the string "bn", or finding the build_batchnorm2d function by the string "build_batchnorm2d" and then returning the result. The registries in MMEngine use the build_from_cfg function by default to find and instantiate the class or function corresponding to the string.

The classes or functions managed by a registry usually have similar interfaces and functionality, so the registry can be treated as an abstraction of those classes or functions. For example, the registry MODELS can be treated as an abstraction of all models, which manages classes such as ResNet, SEResNet and RegNetX and constructors such as build_ResNet, build_SEResNet and build_RegNetX.

Getting started

There are three steps required to use the registry to manage modules in the codebase.

  1. Create a registry.

  2. Create a build method for instantiating the class (optional because in most cases you can just use the default method).

  3. Add the module to the registry

Suppose we want to implement a series of activation modules and want to be able to switch to different modules by just modifying the configuration without modifying the code.

Let’s create a registry first.

from mmengine import Registry
# `scope` represents the domain of the registry. If not set, the default value is the package name.
# e.g. in mmdetection, the scope is mmdet
# `locations` indicates the location where the modules in this registry are defined.
# The Registry will automatically import the modules when building them according to these predefined locations.
ACTIVATION = Registry('activation', scope='mmengine', locations=['mmengine.models.activations'])

The module mmengine.models.activations specified by locations corresponds to the mmengine/models/activations.py file. When building modules with registry, the ACTIVATION registry will automatically import implemented modules from this file. Therefore, we can implement different activation layers in the mmengine/models/activations.py file, such as Sigmoid, ReLU, and Softmax.

import torch.nn as nn

# use the register_module
@ACTIVATION.register_module()
class Sigmoid(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x):
        print('call Sigmoid.forward')
        return x

@ACTIVATION.register_module()
class ReLU(nn.Module):
    def __init__(self, inplace=False):
        super().__init__()

    def forward(self, x):
        print('call ReLU.forward')
        return x

@ACTIVATION.register_module()
class Softmax(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x):
        print('call Softmax.forward')
        return x

The key of using the registry module is to register the implemented modules into the ACTIVATION registry. With the @ACTIVATION.register_module() decorator added before the implemented module, the mapping between strings and classes or functions can be built and maintained by ACTIVATION. We can achieve the same functionality with ACTIVATION.register_module(module=ReLU) as well.

By registering, we can create a mapping between strings and classes or functions via ACTIVATION.

print(ACTIVATION.module_dict)
# {
#     'Sigmoid': __main__.Sigmoid,
#     'ReLU': __main__.ReLU,
#     'Softmax': __main__.Softmax
# }

Note

The key to trigger the registry mechanism is to make the module imported. There are three ways to register a module into the registry

  1. Implement the module in the locations. The registry will automatically import modules in the predefined locations. This is to ease the usage of algorithm libraries so that users can directly use REGISTRY.build(cfg).

  2. Import the file manually. This is common when developers implement a new module in/out side the algorithm library.

  3. Use custom_imports field in config. Please refer to Importing custom Python modules for more details.

Once the implemented module is successfully registered, we can use the activation module in the configuration file.

import torch

input = torch.randn(2)

act_cfg = dict(type='Sigmoid')
activation = ACTIVATION.build(act_cfg)
output = activation(input)
# call Sigmoid.forward
print(output)

We can switch to ReLU by just changing this configuration.

act_cfg = dict(type='ReLU', inplace=True)
activation = ACTIVATION.build(act_cfg)
output = activation(input)
# call ReLU.forward
print(output)

If we want to check the type of input parameters (or any other operations) before creating an instance, we can implement a build method and pass it to the registry to implement a custom build process.

Create a build_activation function.

def build_activation(cfg, registry, *args, **kwargs):
    cfg_ = cfg.copy()
    act_type = cfg_.pop('type')
    print(f'build activation: {act_type}')
    act_cls = registry.get(act_type)
    act = act_cls(*args, **kwargs, **cfg_)
    return act

Pass the buid_activation to build_func.

ACTIVATION = Registry('activation', build_func=build_activation, scope='mmengine', locations=['mmengine.models.activations'])

@ACTIVATION.register_module()
class Tanh(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x):
        print('call Tanh.forward')
        return x

act_cfg = dict(type='Tanh')
activation = ACTIVATION.build(act_cfg)
output = activation(input)
# build activation: Tanh
# call Tanh.forward
print(output)

Note

In the above example, we demonstrate how to customize the method of building an instance of a class using the build_func. This is similar to the default build_from_cfg method. In most cases, using the default method will be fine.

MMEngine’s registry can register classes as well as functions.

FUNCTION = Registry('function', scope='mmengine')

@FUNCTION.register_module()
def print_args(**kwargs):
    print(kwargs)

func_cfg = dict(type='print_args', a=1, b=2)
func_res = FUNCTION.build(func_cfg)

Advanced usage

The registry in MMEngine supports hierarchical registration, which enables cross-project calls, meaning that modules from one project can be used in another project. Though there are other ways to implement this, the registry provides a much easier solution.

To easily make cross-library calls, MMEngine provides twenty root registries, including:

  • RUNNERS: the registry for Runner.

  • RUNNER_CONSTRUCTORS: the constructors for Runner.

  • LOOPS: manages training, validation and testing processes, such as EpochBasedTrainLoop.

  • HOOKS: the hooks, such as CheckpointHook, and ParamSchedulerHook.

  • DATASETS: the datasets.

  • DATA_SAMPLERS: Sampler of DataLoader, used to sample the data.

  • TRANSFORMS: various data preprocessing methods, such as Resize, and Reshape.

  • MODELS: various modules of the model.

  • MODEL_WRAPPERS: model wrappers for parallelizing distributed data, such as MMDistributedDataParallel.

  • WEIGHT_INITIALIZERS: the tools for weight initialization.

  • OPTIMIZERS: registers all Optimizers and custom Optimizers in PyTorch.

  • OPTIM_WRAPPER: the wrapper for Optimizer-related operations such as OptimWrapper, and AmpOptimWrapper.

  • OPTIM_WRAPPER_CONSTRUCTORS: the constructors for optimizer wrappers.

  • PARAM_SCHEDULERS: various parameter schedulers, such as MultiStepLR.

  • METRICS: the evaluation metrics for computing model accuracy, such as Accuracy.

  • EVALUATOR: one or more evaluation metrics used to calculate the model accuracy.

  • TASK_UTILS: the task-intensive components, such as AnchorGenerator, and BboxCoder.

  • VISUALIZERS: the management drawing module that draws prediction boxes on images, such as DetVisualizer.

  • VISBACKENDS: the backend for storing training logs, such as LocalVisBackend, and TensorboardVisBackend.

  • LOG_PROCESSORS: controls the log statistics window and statistics methods, by default we use LogProcessor. You may customize LogProcessor if you have special needs.

Use the module of the parent node

Let’s define a RReLU module in MMEngine and register it to the MODELS root registry.

import torch.nn as nn
from mmengine import Registry, MODELS

@MODELS.register_module()
class RReLU(nn.Module):
    def __init__(self, lower=0.125, upper=0.333, inplace=False):
        super().__init__()

    def forward(self, x):
        print('call RReLU.forward')
        return x

Now suppose there is a project called MMAlpha, which also defines a MODELS and sets its parent node to the MODELS of MMEngine, which creates a hierarchical structure.

from mmengine import Registry, MODELS as MMENGINE_MODELS

MODELS = Registry('model', parent=MMENGINE_MODELS, scope='mmalpha', locations=['mmalpha.models'])

The following figure shows the hierarchy of MMEngine and MMAlpha.

The count_registered_modules function can be used to print the modules that have been registered to MMEngine and their hierarchy.

from mmengine.registry import count_registered_modules

count_registered_modules()

We define a customized LogSoftmax module in MMAlpha and register it to the MODELS in MMAlpha.

@MODELS.register_module()
class LogSoftmax(nn.Module):
    def __init__(self, dim=None):
        super().__init__()

    def forward(self, x):
        print('call LogSoftmax.forward')
        return x

Here we use the LogSoftmax in the configuration of MMAlpha.

model = MODELS.build(cfg=dict(type='LogSoftmax'))

We can also use the modules of the parent node MMEngine here in the MMAlpha.

model = MODELS.build(cfg=dict(type='RReLU', lower=0.2))
# scope is optional
model = MODELS.build(cfg=dict(type='mmengine.RReLU'))

If no prefix is added, the build method will first find out if the module exists in the current node and return it if there is one. Otherwise, it will continue to look up the parent nodes or even the ancestor node until it finds the module. If the same module exists in both the current node and the parent nodes, we need to specify the scope prefix to indicate that we want to use the module of the parent nodes.

import torch

input = torch.randn(2)
output = model(input)
# call RReLU.forward
print(output)

Use the module of a sibling node

In addition to using the module of the parent nodes, users can also call the module of a sibling node.

Suppose there is another project called MMBeta, which, like MMAlpha, defines MODELS and set its parent node to MMEngine.

from mmengine import Registry, MODELS as MMENGINE_MODELS

MODELS = Registry('model', parent=MMENGINE_MODELS, scope='mmbeta')

The following figure shows the registry structure of MMAlpha and MMBeta.

Now we call the modules of MMAlpha in MMBeta.

model = MODELS.build(cfg=dict(type='mmalpha.LogSoftmax'))
output = model(input)
# call LogSoftmax.forward
print(output)

Calling a module of a sibling node requires the scope prefix to be specified in type, so the above configuration requires the prefix mmalpha.

However, if you need to call several modules of a sibling node, each with a prefix, this requires a lot of modification. Therefore, MMEngine introduces the DefaultScope, with which Registry can easily support temporary switching of the current node to the specified node.

If you need to switch the current node to the specified node temporarily, just set _scope_ to the scope of the specified node in cfg.

model = MODELS.build(cfg=dict(type='LogSoftmax', _scope_='mmalpha'))
output = model(input)
# call LogSoftmax.forward
print(output)

Config

MMEngine implements an abstract configuration class (Config) to provide a unified configuration access interface for users. Config supports different type of configuration file, including python, json and yaml, and you can choose the type according to your preference. Config overrides some magic method, which could help you access the data stored in Config just like getting values from dict, or getting attributes from instances. Besides, Config also provides an inheritance mechanism, which could help you better organize and manage the configuration files.

Before starting the tutorial, let’s download the configuration files needed in the tutorial (it is recommended to execute them in a temporary directory to facilitate deleting these files latter.):

wget https://raw.githubusercontent.com/open-mmlab/mmengine/main/docs/resources/config/config_sgd.py
wget https://raw.githubusercontent.com/open-mmlab/mmengine/main/docs/resources/config/cross_repo.py
wget https://raw.githubusercontent.com/open-mmlab/mmengine/main/docs/resources/config/custom_imports.py
wget https://raw.githubusercontent.com/open-mmlab/mmengine/main/docs/resources/config/demo_train.py
wget https://raw.githubusercontent.com/open-mmlab/mmengine/main/docs/resources/config/example.py
wget https://raw.githubusercontent.com/open-mmlab/mmengine/main/docs/resources/config/learn_read_config.py
wget https://raw.githubusercontent.com/open-mmlab/mmengine/main/docs/resources/config/my_module.py
wget https://raw.githubusercontent.com/open-mmlab/mmengine/main/docs/resources/config/optimizer_cfg.py
wget https://raw.githubusercontent.com/open-mmlab/mmengine/main/docs/resources/config/predefined_var.py
wget https://raw.githubusercontent.com/open-mmlab/mmengine/main/docs/resources/config/refer_base_var.py
wget https://raw.githubusercontent.com/open-mmlab/mmengine/main/docs/resources/config/resnet50_delete_key.py
wget https://raw.githubusercontent.com/open-mmlab/mmengine/main/docs/resources/config/resnet50_lr0.01.py
wget https://raw.githubusercontent.com/open-mmlab/mmengine/main/docs/resources/config/resnet50_runtime.py
wget https://raw.githubusercontent.com/open-mmlab/mmengine/main/docs/resources/config/resnet50.py
wget https://raw.githubusercontent.com/open-mmlab/mmengine/main/docs/resources/config/runtime_cfg.py
wget https://raw.githubusercontent.com/open-mmlab/mmengine/main/docs/resources/config/modify_base_var.py

Read the configuration file

Config provides a uniform interface Config.fromfile() to read and parse configuration files.

A valid configuration file should define a set of key-value pairs, and here are a few examples:

Python:

test_int = 1
test_list = [1, 2, 3]
test_dict = dict(key1='value1', key2=0.1)

Json:

{
  "test_int": 1,
  "test_list": [1, 2, 3],
  "test_dict": {"key1": "value1", "key2": 0.1}
}

YAML:

test_int: 1
test_list: [1, 2, 3]
test_dict:
  key1: "value1"
  key2: 0.1

For the above three formats, assuming the file names are config.py, config.json, and config.yml. Loading these files with Config.fromfile('config.xxx') will return the same result, which contain test_int, test_list and test_dict 3 variables.

Let’s take config.py as an example:

from mmengine.config import Config

cfg = Config.fromfile('learn_read_config.py')
print(cfg)
Config (path: learn_read_config.py): {'test_int': 1, 'test_list': [1, 2, 3], 'test_dict': {'key1': 'value1', 'key2': 0.1}}

How to use Config

After loading the configuration file, we can access the data stored in Config instance just like getting/setting values from dict, or getting/setting attributes from instances.

print(cfg.test_int)
print(cfg.test_list)
print(cfg.test_dict)
cfg.test_int = 2

print(cfg['test_int'])
print(cfg['test_list'])
print(cfg['test_dict'])
cfg['test_list'][1] = 3
print(cfg['test_list'])
1
[1, 2, 3]
{'key1': 'value1', 'key2': 0.1}
2
[1, 2, 3]
{'key1': 'value1', 'key2': 0.1}
[1, 3, 3]

Note

The dict object parsed by Config will be converted to ConfigDict, and then we can access the value of the dict the same as accessing the attribute of an instance.

We can use the Config combination with the Registry to build registered instance easily.

Here is an example of defining optimizers in a configuration file.

config_sgd.py

optimizer = dict(type='SGD', lr=0.1, momentum=0.9, weight_decay=0.0001)

Suppose we have defined a registry OPTIMIZERS, which includes various optimizers. Then we can build the optimizer as below

from mmengine import Config, optim
from mmengine.registry import OPTIMIZERS

import torch.nn as nn

cfg = Config.fromfile('config_sgd.py')

model = nn.Conv2d(1, 1, 1)
cfg.optimizer.params = model.parameters()
optimizer = OPTIMIZERS.build(cfg.optimizer)
print(optimizer)
SGD (
Parameter Group 0
    dampening: 0
    foreach: None
    lr: 0.1
    maximize: False
    momentum: 0.9
    nesterov: False
    weight_decay: 0.0001
)

Inheritance between configuration files

Sometimes, the difference between two different configuration files is so small that only one field may be changed. Therefore, it’s unwise to copy and paste everything only to modify one line, which makes it hard for us to locate the specific difference after a long time.

In another case, multiple configuration files may have the same batch of fields, and we have to copy and paste them in different configuration files. It will also be hard to maintain these fields in a long time.

We address these issues with inheritance mechanism, detailed as below.

Overview of inheritance mechanism

Here is an example to illustrate the inheritance mechanism.

optimizer_cfg.py

optimizer = dict(type='SGD', lr=0.02, momentum=0.9, weight_decay=0.0001)

resnet50.py

_base_ = ['optimizer_cfg.py']
model = dict(type='ResNet', depth=50)

Although we don’t define optimizer in resnet50.py, since we wrote _base_ = ['optimizer_cfg.py'], it will inherit the fields defined in optimizer_cfg.py.

cfg = Config.fromfile('resnet50.py')
print(cfg.optimizer)
{'type': 'SGD', 'lr': 0.02, 'momentum': 0.9, 'weight_decay': 0.0001}

_base_ is a reserved field for the configuration file. It specifies the inherited base files for the current file. Inheriting multiple files will get all the fields at the same time, but it requires that there are no repeated fields defined in all base files.

runtime_cfg.py

gpu_ids = [0, 1]

resnet50_runtime.py

_base_ = ['optimizer_cfg.py', 'runtime_cfg.py']
model = dict(type='ResNet', depth=50)

In this case, reading the resnet50_runtime.py will give you 3 fields model, optimizer, and gpu_ids.

cfg = Config.fromfile('resnet50_runtime.py')
print(cfg.optimizer)
{'type': 'SGD', 'lr': 0.02, 'momentum': 0.9, 'weight_decay': 0.0001}

By this way, we can disassemble the configuration file, define some general configuration files, and inherit them in the specific configuration file. This could avoid defining a lot of duplicated contents in multiple configuration files.

Modify the inherited fields

Sometimes, we want to modify some of the fields in the inherited files. For example we want to modify the learning rate from 0.02 to 0.01 after inheriting optimizer_cfg.py.

In this case, you can simply redefine the fields in the new configuration file. Note that since the optimizer field is a dictionary, we only need to redefine the modified fields. This rule also applies to adding fields.

resnet50_lr0.01.py

_base_ = ['optimizer_cfg.py', 'runtime_cfg.py']
model = dict(type='ResNet', depth=50)
optimizer = dict(lr=0.01)

After reading this configuration file, you can get the desired result.

cfg = Config.fromfile('resnet50_lr0.01.py')
print(cfg.optimizer)
{'type': 'SGD', 'lr': 0.01, 'momentum': 0.9, 'weight_decay': 0.0001}

For non-dictionary fields, such as integers, strings, lists, etc., they can be completely overwritten by redefining them. For example, the code block below will change the value of the gpu_ids to [0].

_base_ = ['optimizer_cfg.py', 'runtime_cfg.py']
model = dict(type='ResNet', depth=50)
gpu_ids = [0]

Delete key in dict

Sometimes we not only want to modify or add the keys, but also want to delete them. In this case, we need to set _delete_=True in the target field(dict) to delete all the keys that do not appear in the newly defined dictionary.

resnet50_delete_key.py

_base_ = ['optimizer_cfg.py', 'runtime_cfg.py']
model = dict(type='ResNet', depth=50)
optimizer = dict(_delete_=True, type='SGD', lr=0.01)

At this point, optimizer will only have the keys type and lr. momentum and weight_decay will no longer exist.

cfg = Config.fromfile('resnet50_delete_key.py')
print(cfg.optimizer)
{'type': 'SGD', 'lr': 0.01}

Reference of the inherited file

Sometimes we want to reuse the field defined in _base_, we can get a copy of the corresponding variable by using {{_base_.xxxx}}:

refer_base_var.py

_base_ = ['resnet50.py']
a = {{_base_.model}}

After parsing, the value of a becomes model defined in resnet50.py

cfg = Config.fromfile('refer_base_var.py')
print(cfg.a)
{'type': 'ResNet', 'depth': 50}

We can use this way to get the variables defined in _base_ in the json, yaml, and python configuration files.

Although this way is general for all types of files, there are some syntactic limitations that prevent us from taking full advantage of the dynamic nature of the python configuration file. For example, if we want to modify a variable defined in _base_:

_base_ = ['resnet50.py']
a = {{_base_.model}}
a['type'] = 'MobileNet'

The Config is not able to parse such a configuration file (it will raise an error when parsing). The Config provides a more pythonic way to modify base variables for python configuration files.

modify_base_var.py

_base_ = ['resnet50.py']
a = _base_.model
a.type = 'MobileNet'
cfg = Config.fromfile('modify_base_var.py')
print(cfg.a)
{'type': 'MobileNet', 'depth': 50}

Dump the configuration file

The user may pass some parameters to modify some fields of the configuration file at the entry point of the training script. Therefore, we provide the dump method to export the changed configuration file.

Similar to reading the configuration file, the user can choose the format of the dumped file by using cfg.dump('config.xxx'). dump can also export configuration files with inheritance relationships, and the dumped files can be used independently without the files defined in _base_.

Based on the resnet50.py defined above, we can load and dump it like this:

cfg = Config.fromfile('resnet50.py')
cfg.dump('resnet50_dump.py')

resnet50_dump.py

optimizer = dict(type='SGD', lr=0.02, momentum=0.9, weight_decay=0.0001)
model = dict(type='ResNet', depth=50)

Similarly, we can dump configuration files in json, yaml format:

resnet50_dump.yaml

model:
  depth: 50
  type: ResNet
optimizer:
  lr: 0.02
  momentum: 0.9
  type: SGD
  weight_decay: 0.0001

resnet50_dump.json

{"optimizer": {"type": "SGD", "lr": 0.02, "momentum": 0.9, "weight_decay": 0.0001}, "model": {"type": "ResNet", "depth": 50}}

In addition, `dump` can also dump `cfg` loaded from a dictionary.

```python
cfg = Config(dict(a=1, b=2))
cfg.dump('dump_dict.py')

dump_dict.py

a=1
b=2

Advanced usage

In this section, we’ll introduce some advanced usage of the Config, and some tips that could make it easier for users to develop and use downstream repositories.

Predefined fields

Sometimes we need some fields in the configuration file, which are related to the path to the workspace. For example, we define a working directory in the configuration file that holds the models and logs for this set of experimental configurations. We expect to have different working directories for different configuration files. A common choice is to use the configuration file name directly as part of the working directory name. Taking predefined_var.py as an example:

work_dir = './work_dir/{{fileBasenameNoExtension}}'

Here {{fileBasenameNoExtension}} means the filename without suffix .py of the config file, and the variable in {{}} will be interpreted as predefined_var

cfg = Config.fromfile('./predefined_var.py')
print(cfg.work_dir)
./work_dir/predefined_var

Currently, there are 4 predefined fields referenced from the relevant fields defined in VS Code.

  • {{fileDirname}} - the directory name of the current file, e.g. /home/your-username/your-project/folder

  • {{fileBasename}} - the filename of the current file, e.g. file.py

  • {{fileBasenameNoExtension}} - the filename of the current file without the extension, e.g. file

  • {{fileExtname}} - the extension of the current file, e.g. .py

Modify the fields in command line

Sometimes we only want to modify part of the configuration and do not want to modify the configuration file itself. For example, if we want to change the learning rate during the experiment but do not want to write a new configuration file, the common practice is to pass the parameters at the command line to override the relevant configuration.

If we want to modify some internal parameters, such as the learning rate of the optimizer, the number of channels in the convolution layer etc., Config provides a standard procedure that allows us to modify the parameters at any level easily from the command line.

Training script:

demo_train.py

import argparse

from mmengine.config import Config, DictAction


def parse_args():
    parser = argparse.ArgumentParser(description='Train a model')
    parser.add_argument('config', help='train config file path')
    parser.add_argument(
        '--cfg-options',
        nargs='+',
        action=DictAction,
        help='override some settings in the used config, the key-value pair '
        'in xxx=yyy format will be merged into config file. If the value to '
        'be overwritten is a list, it should be like key="[a,b]" or key=a,b '
        'It also allows nested list/tuple values, e.g. key="[(a,b),(c,d)]" '
        'Note that the quotation marks are necessary and that no white space '
        'is allowed.')

    args = parser.parse_args()
    return args


def main():
    args = parse_args()
    cfg = Config.fromfile(args.config)
    if args.cfg_options is not None:
        cfg.merge_from_dict(args.cfg_options)
    print(cfg)


if __name__ == '__main__':
    main()

The sample configuration file is as follows.

example.py

model = dict(type='CustomModel', in_channels=[1, 2, 3])
optimizer = dict(type='SGD', lr=0.01)

We can modify the internal fields from the command line by . For example, if we want to modify the learning rate, we only need to execute the script like this:

python demo_train.py ./example.py --cfg-options optimizer.lr=0.1
Config (path: ./example.py): {'model': {'type': 'CustomModel', 'in_channels': [1, 2, 3]}, 'optimizer': {'type': 'SGD', 'lr': 0.1}}

We successfully modified the learning rate from 0.01 to 0.1. If we want to change a list or a tuple, such as in_channels in the above example. We need to put double quotes around (), [] when assigning the value on the command line.

python demo_train.py ./example.py --cfg-options model.in_channels="[1, 1, 1]"
Config (path: ./example.py): {'model': {'type': 'CustomModel', 'in_channels': [1, 1, 1]}, 'optimizer': {'type': 'SGD', 'lr': 0.01}}

Note

The standard procedure only supports modifying String, Integer, Floating Point, Boolean, None, List, and Tuple fields from the command line. For the elements of list and tuple instance, each of them must be one of the above seven types.

Note

The behavior of DictAction is similar with "extend". It stores a list, and extends each argument value to the list, like:

python demo_train.py ./example.py --cfg-options optimizer.type="Adam" --cfg-options model.in_channels="[1, 1, 1]"
Config (path: ./example.py): {'model': {'type': 'CustomModel', 'in_channels': [1, 1, 1]}, 'optimizer': {'type': 'Adam', 'lr': 0.01}}

import the custom module

If we customize a module and register it into the corresponding registry, could we directly build it from the configuration file as the previous section does? The answer is “I don’t know” since I’m not sure the registration process has been triggered. To solve this “unknown” case, Config provides the custom_imports function, to make sure your module could be registered as expected.

For example, we customize an optimizer:

from mmengine.registry import OPTIMIZERS

@OPTIMIZERS.register_module()
class CustomOptim:
    pass

A matched config file:

my_module.py

optimizer = dict(type='CustomOptim')

To make sure CustomOptim will be registered, we should set the custom_imports field like this:

custom_imports.py

custom_imports = dict(imports=['my_module'], allow_failed_imports=False)
optimizer = dict(type='CustomOptim')

And then, once the custom_imports can be loaded successfully, we can build the CustomOptim from the custom_imports.py.

cfg = Config.fromfile('custom_imports.py')

from mmengine.registry import OPTIMIZERS

custom_optim = OPTIMIZERS.build(cfg.optimizer)
print(custom_optim)
<my_module.CustomOptim object at 0x7f6983a87970>

Inherit configuration files across repository

It is annoying to copy a large number of configuration files when developing a new repository based on some existing repositories. To address this issue, Config support inherit configuration files from other repositories. For example, based on MMDetection, we want to develop a repository, we can use the MMDetection configuration file like this:

cross_repo.py

_base_ = [
    'mmdet::_base_/schedules/schedule_1x.py',
    'mmdet::_base_/datasets/coco_instance.py',
    'mmdet::_base_/default_runtime.py',
    'mmdet::_base_/models/faster_rcnn_r50_fpn.py',
]
cfg = Config.fromfile('cross_repo.py')
print(cfg.train_cfg)
{'type': 'EpochBasedTrainLoop', 'max_epochs': 12, 'val_interval': 1, '_scope_': 'mmdet'}

Config will parse mmdet:: to find mmdet package and inherits the specified configuration file. Actually, as long as the setup.py of the repository(package) conforms to MMEngine Installation specification, Config can use {package_name}:: to inherit the specific configuration file.

Get configuration files across repository

Config also provides get_config and get_model to get the configuration file and the trained model from the downstream repositories.

The usage of get_config and get_model are similar to the previous section:

An example of get_config:

from mmengine.hub import get_config

cfg = get_config(
    'mmdet::faster_rcnn/faster_rcnn_r50_fpn_1x_coco.py', pretrained=True)
print(cfg.model_path)
https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_fpn_1x_coco/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth

An example of get_model:

from mmengine.hub import get_model

model = get_model(
    'mmdet::faster_rcnn/faster_rcnn_r50_fpn_1x_coco.py', pretrained=True)
print(type(model))
http loads checkpoint from path: https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_fpn_1x_coco/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth
<class 'mmdet.models.detectors.faster_rcnn.FasterRCNN'>

BaseDataset

Introduction

The Dataset class in the algorithm toolbox is responsible for providing input data for the model during the training/testing process. The Dataset class in each algorithm toolbox under OpenMMLab projects has some common characteristics and requirements, such as the need for efficient internal data storage format, support for the concatenation of different datasets, dataset repeated sampling, and so on.

Therefore, MMEngine implements BaseDataset which provides some basic interfaces and implements some DatasetWrappers with the same interfaces. Most of the Dataset Classes in the OpenMMLab algorithm toolbox meet the interface defined by the BaseDataset and use the same DatasetWrappers.

The basic function of the BaseDataset is to load the dataset information. Here, we divide the dataset information into two categories. One is meta information, which represents the information related to the dataset itself and sometimes needs to be obtained by the model or other external components. For example, the meta information of the dataset generally includes the category information classes in the image classification task, since the classification model usually needs to record the category information of the dataset. The other is data information, which defines the file path and corresponding label information of specific data info. In addition, another function of the BaseDataset is to continuously send data into the data pipeline for data preprocessing.

The standard data annotation file

In order to unify the dataset interface of different tasks and facilitate multiple tasks training in one model, OpenMMLab formulate the OpenMMLab 2.0 dataset format specification. Dataset annotation files should conform to this specification, and the BaseDataset reads and parses data annotation files based on this specification. If the data annotation file provided by the user does not conform to the specified format, the user can choose to convert it to the specified format and use OpenMMLab’s algorithm toolbox to conduct algorithm training and testing based on the converted data annotation file.

The OpenMMLab 2.0 dataset format specification states that annotation files must be in the format of json or yaml, yml or pickle, pkl. The dictionary stored in the annotation file must contain two fields, metainfo and data_list. The metainfo is a dictionary containing meta information about the dataset. The data_list is a list in which each element is a dictionary and the dictionary defines a raw data info. Each raw data info contains one or more training/test samples.

Here is an example of a JSON annotation file (where each raw data info contains only one training/test sample):


{
    'metainfo':
        {
            'classes': ('cat', 'dog'),
            ...
        },
    'data_list':
        [
            {
                'img_path': "xxx/xxx_0.jpg",
                'img_label': 0,
                ...
            },
            {
                'img_path': "xxx/xxx_1.jpg",
                'img_label': 1,
                ...
            },
            ...
        ]
}

We assume that the data is stored in the following path:

data
├── annotations
│   ├── train.json
├── train
│   ├── xxx/xxx_0.jpg
│   ├── xxx/xxx_1.jpg
│   ├── ...

The initialization process of the BaseDataset

The initialization process of the BaseDataset is shown as follows:

  1. load metainfo: Obtain the meta information of the dataset. The meta information can be obtained from three sources with the priority from high to low:

  • The dict of metainfo passed by the user in the __init__() function. The priority is high since the user can pass this argument when the BaseDataset is instantiated;

  • The dict of BaseDataset.METAINFO in the class attributes of BaseDataset. The priority is medium since the user can change the class attributes BaseDataset.METAINFO in the custom dataset class;

  • The dict of metainfo included in the annotation file. The priority is low since the annotation file is generally not changed.

If three sources have the same field, the source with the highest priority determines the value of the field. The priority comparison of these fields is: The fields in the metainfo dictionary passed by the user > The fields in the BaseDataset.METAINFO of BaseDataset > the fields in the metainfo of annotation file.

  1. join path: Process the path of datainfo and annotating files;

  2. build pipeline: Build data pipeline for the data preprocessing and data preparation;

  3. full init: Fully initializes the BaseDataset. This step mainly includes the following operations:

  • load data list: Read and parse the annotation files that meet the OpenMMLab 2.0 dataset format specification. In this step, the parse_data_info() method is called. This method is responsible for parsing each raw data info in the annotation file;

  • filter data (optional): Filters unnecessary data based on filter_cfg, such as data samples that do not contain annotations. By default, there is no filtering operation, and downstream subclasses can override it according to their own needs.

  • get subset (optional): Sample a subset of dataset based on a given index or an integer value, such as only the first 10 samples for training/testing. By default, all data samples are used.

  • serialize data (optional): Serialize all data samples to save memory. Please see Save memory for more details. we serialize all data samples by default.

The parse_data_info() method in the BaseDataset is used to process a raw data info in the annotation file into one or more training/test data samples. The user needs to implement the parse_data_info() method if they want to customize dataset class.

The interface of BaseDataset

Once the BaseDataset is initialized, it supports __getitem__ method to index a data info and __len__ method to get the length of dataset, just like torch.utils.data.Dataset. The Basedataset provides the following interfaces:

  • metainfo: Return the meta information with a dictionary value.

  • get_data_info(idx): Return the full data information of the given idx, and the return value is a dictionary.

  • __getitem__(idx): Return the results of data pipeline(The input data of model) of the given ‘idx’, and the return value is a dictionary.

  • __len__(): Return the length of the dataset. The return value is an integer.

  • get_subset_(indices): Modify the original dataset class in inplace according to indices. If indices is int, then the original dataset class contains only the first few data samples. If indices is Sequence[int], the raw dataset class contains data samples specified according to Sequence[int].

  • get_subset(indices): Return a new sub-dataset class according to indices, i.e., re-copies a sub-dataset. If indices is int, the returned sub-dataset object contains only the first few data samples. If indices is Sequence[int], the returned sub-dataset object contains the data samples specified according to Sequence[int].

Customize dataset class based on BaseDataset

We can customize the dataset class based on BaseDataset, after we understand the initialization process of BaseDataset and the provided interfaces of BaseDataset.

Annotation files that meet the OpenMMLab 2.0 dataset format specification

As mentioned above, users can overload parse_data_info() to load annotation files that meet the OpenMMLab 2.0 dataset format specification. Here is an example of using BaseDataset to implement a specific dataset.

import os.path as osp

from mmengine.dataset import BaseDataset


class ToyDataset(BaseDataset):

    # Take the above annotation file as example. The raw_data_info represents a dictionary in the data_list list:
    # {
    #    'img_path': "xxx/xxx_0.jpg",
    #    'img_label': 0,
    #    ...
    # }
    def parse_data_info(self, raw_data_info):
        data_info = raw_data_info
        img_prefix = self.data_prefix.get('img_path', None)
        if img_prefix is not None:
            data_info['img_path'] = osp.join(
                img_prefix, data_info['img_path'])
        return data_info

Using Customized dataset class

The ToyDataset can be instantiated with the following configuration, once it has been defined:


class LoadImage:

    def __call__(self, results):
        results['img'] = cv2.imread(results['img_path'])
        return results

class ParseImage:

    def __call__(self, results):
        results['img_shape'] = results['img'].shape
        return results

pipeline = [
    LoadImage(),
    ParseImage(),
]

toy_dataset = ToyDataset(
    data_root='data/',
    data_prefix=dict(img_path='train/'),
    ann_file='annotations/train.json',
    pipeline=pipeline)

At the same time, the external interface provided by the BaseDataset can be used to access specific data sample information:

toy_dataset.metainfo
# dict(classes=('cat', 'dog'))

toy_dataset.get_data_info(0)
# {
#     'img_path': "data/train/xxx/xxx_0.jpg",
#     'img_label': 0,
#     ...
# }

len(toy_dataset)
# 2

toy_dataset[0]
# {
#     'img_path': "data/train/xxx/xxx_0.jpg",
#     'img_label': 0,
#     'img': a ndarray with shape (H, W, 3), which denotes the value of the image,
#     'img_shape': (H, W, 3) ,
#     ...
# }

# The `get_subset` interface does not modify the original dataset class, i.e. make a complete copy of it
sub_toy_dataset = toy_dataset.get_subset(1)
len(toy_dataset), len(sub_toy_dataset)
# 2, 1

# The `get_subset_` interface modify the original dataset class in inplace
toy_dataset.get_subset_(1)
len(toy_dataset)
# 1

Following the above steps, we can see how to customize a dataset based on the BaseDataset and how to use the customized dataset.

Customize dataset for videos

In the above examples, each raw data info of the annotation file contains only one training/test sample (usually in the image field). If each raw data info contains several training/test samples (usually in the video domain), we only need to ensure that the return value of parse_data_info() is list[dict]:

from mmengine.dataset import BaseDataset


class ToyVideoDataset(BaseDataset):

    # raw_data_info is still a dict, but it contains multiple samples
    def parse_data_info(self, raw_data_info):
        data_list = []

        ...

        for ... :

            data_info = dict()

            ...

            data_list.append(data_info)

        return data_list

The usage of ToyVideoDataset is similar to that of ToyDataset, which will not be repeated here.

Annotation files that do not meet the OpenMMLab 2.0 dataset format specification

For annotated files that do not meet the OpenMMLab 2.0 dataset format specification, there are two ways to use:

  1. Convert the annotation files that do not meet the specifications into the annotation files that do meet the specifications, and then use the BaseDataset in the above way.

  2. Implement a new dataset class that inherits from the BaseDataset and overloads the load_data_list(self): function of the BaseDataset to handle annotation files that don’t meet the specification and guarantee a return value of list[dict], where each dict represents a data sample.

Other features of BaseDataset

The BaseDataset also contains the following features:

lazy init

When the BaseDataset is instantiated, the annotation file needs to be read and parsed, therefore it will take some time. However, in some cases, such as the visualization of prediction, only the meta information of the BaseDataset is required, and reading and parsing the annotation file may not be necessary. To save time on instantiating the BaseDataset in this case, the BaseDataset supports lazy init:

pipeline = [
    LoadImage(),
    ParseImage(),
]

toy_dataset = ToyDataset(
    data_root='data/',
    data_prefix=dict(img_path='train/'),
    ann_file='annotations/train.json',
    pipeline=pipeline,
    # Pass the lazy_init variable in here
    lazy_init=True)

When lazy_init=True, the initialization of ToyDataset’s only performs steps 1, 2, and 3 of the BaseDataset initialization process. At this time, toy_dataset was not fully initialized, since toy_dataset will not read and parse the annotation file. The toy_dataset only set the meta information of the dataset (metainfo).

Naturally, if you need to access specific data information later, you can manually call the toy_dataset.full_init() interface to perform the complete initialization process, during which the data annotation file will be read and parsed. Calling the get_data_info (independence idx), __len__ (), __getitem__ (independence idx), get_subset_ (indices) and get_subset(indices) interface will also automatically call the full_init() interface to perform the full initialization process (only on the first call, later calls will not call the full_init() interface repeatedly):

# Full initialization
toy_dataset.full_init()

# After initialization, you can now get the data info
len(toy_dataset)
# 2
toy_dataset[0]
# {
#     'img_path': "data/train/xxx/xxx_0.jpg",
#     'img_label': 0,
#     'img': a ndarray with shape (H, W, 3), which denotes the value the image,
#     'img_shape': (H, W, 3) ,
#     ...
# }

Notice:

Performing full initialization by calling the __getitem__() interface directly carries some risks: If a dataset object is not fully initialized by setting lazy_init=True firstly, then it is directly sent to the dataloader. Different dataloader workers will read and parse the annotation file at the same time in the subsequent data reading process. Although this may work normally, it consumes a lot of time and memory. Therefore, it is recommended to manually call the full_init() interface to perform the full initialization process before you need to access specific data.

The above is not fully initialized by setting lazy_init=True, and then complete initialization according to the demand, called lazy init.

Save memory

In the specific process of reading data, the dataloader will usually prefetch data from multiple dataloader workers, and multiple workers have complete dataset object backup, so there will be multiple copies of the same data_list in the memory. In order to save this part of memory consumption, The BaseDataset can serialize data_list into memory in advance, so that multiple workers can share the same copy of data_list, so as to save memory.

By default, the BaseDataset stores the serialization of data_list into memory. It is also possible to control whether the data will be serialized into memory ahead of time by using the serialize_data argument (default is True) :

pipeline = [
    LoadImage(),
    ParseImage(),
]

toy_dataset = ToyDataset(
    data_root='data/',
    data_prefix=dict(img_path='train/'),
    ann_file='annotations/train.json',
    pipeline=pipeline,
    # Pass the serialize data argument in here
    serialize_data=False)

The above example does not store the data_list serialization into memory in advance, so it is not recommended to instantiate the dataset class, when using the dataloader to open multiple dataloader workers to load the data.

DatasetWrappers

In addition to BaseDataset, MMEngine also provides several DatasetWrappers: ConcatDataset, RepeatDataset, ClassBalancedDataset. These dataset wrappers also support lazy init and have memory-saving features.

ConcatDataset

MMEngine provides a ConcatDataset wrapper to concatenate datasets in the following way:

from mmengine.dataset import ConcatDataset

pipeline = [
    LoadImage(),
    ParseImage(),
]

toy_dataset_1 = ToyDataset(
    data_root='data/',
    data_prefix=dict(img_path='train/'),
    ann_file='annotations/train.json',
    pipeline=pipeline)

toy_dataset_2 = ToyDataset(
    data_root='data/',
    data_prefix=dict(img_path='val/'),
    ann_file='annotations/val.json',
    pipeline=pipeline)

toy_dataset_12 = ConcatDataset(datasets=[toy_dataset_1, toy_dataset_2])

The above example combines the train set and the val set of the dataset into one large dataset.

RepeatDataset

MMEngine provides RepeatDataset wrapper to repeat a dataset several times, as follows:

from mmengine.dataset import RepeatDataset

pipeline = [
    LoadImage(),
    ParseImage(),
]

toy_dataset = ToyDataset(
    data_root='data/',
    data_prefix=dict(img_path='train/'),
    ann_file='annotations/train.json',
    pipeline=pipeline)

toy_dataset_repeat = RepeatDataset(dataset=toy_dataset, times=5)

The above example samples the train set of the dataset five times.

ClassBalancedDataset

MMEngine provides ClassBalancedDataset wrapper to repeatedly sample the corresponding samples based on the frequency of category occurrence in the dataset.

Notice:

The ClassBalancedDataset wrapper assumes that the wrapped dataset class supports the get_cat_ids(idx) method, which returns a list. The list contains the categories of data_info given by ‘idx’. The usage is as follows:

from mmengine.dataset import BaseDataset, ClassBalancedDataset

class ToyDataset(BaseDataset):

    def parse_data_info(self, raw_data_info):
        data_info = raw_data_info
        img_prefix = self.data_prefix.get('img_path', None)
        if img_prefix is not None:
            data_info['img_path'] = osp.join(
                img_prefix, data_info['img_path'])
        return data_info

    # The necessary method that needs to return the category of data sample
    def get_cat_ids(self, idx):
        data_info = self.get_data_info(idx)
        return [int(data_info['img_label'])]

pipeline = [
    LoadImage(),
    ParseImage(),
]

toy_dataset = ToyDataset(
    data_root='data/',
    data_prefix=dict(img_path='train/'),
    ann_file='annotations/train.json',
    pipeline=pipeline)

toy_dataset_repeat = ClassBalancedDataset(dataset=toy_dataset, oversample_thr=1e-3)

The above example resamples the train set of the dataset with oversample_thr=1e-3. Specifically, for categories whose frequency is less than 1e-3 in the dataset, samples corresponding to this category will be sampled repeatedly; otherwise, samples will not be sampled repeatedly. Please refer to the API documentation of ClassBalancedDataset for specific sampling policies.

Customize DatasetWrapper

Since the BaseDataset support lazy init, some rules need to be followed when customizing the DatasetWrapper. Here is an example to show how to customize the DatasetWrapper:

from mmengine.dataset import BaseDataset
from mmengine.registry import DATASETS


@DATASETS.register_module()
class ExampleDatasetWrapper:

    def __init__(self, dataset, lazy_init=False, ...):
        # Build the source dataset(self.dataset)
        if isinstance(dataset, dict):
            self.dataset = DATASETS.build(dataset)
        elif isinstance(dataset, BaseDataset):
            self.dataset = dataset
        else:
            raise TypeError(
                'elements in datasets sequence should be config or '
                f'`BaseDataset` instance, but got {type(dataset)}')
        # Record the meta information of source dataset
        self._metainfo = self.dataset.metainfo

        '''
        1. Implement some code here to record some of the hyperparameters used to wrap the dataset.
        '''

        self._fully_initialized = False
        if not lazy_init:
            self.full_init()

    def full_init(self):
        if self._fully_initialized:
            return

        # Initialize the source dataset completely
        self.dataset.full_init()

        '''
        2. Implement some code here to wrap the source dataset.
        '''

        self._fully_initialized = True

    @force_full_init
    def _get_ori_dataset_idx(self, idx: int):

        '''
        3. Implement some code here to map the wrapped index `idx` to the index of the source dataset 'ori_idx'.
        '''
        ori_idx = ...

        return ori_idx

    # Provide the same external interface as `self.dataset `.
    @force_full_init
    def get_data_info(self, idx):
        sample_idx = self._get_ori_dataset_idx(idx)
        return self.dataset.get_data_info(sample_idx)

    # Provide the same external interface as `self.dataset `.
    def __getitem__(self, idx):
        if not self._fully_initialized:
            warnings.warn('Please call `full_init` method manually to '
                          'accelerate the speed.')
            self.full_init()

        sample_idx = self._get_ori_dataset_idx(idx)
        return self.dataset[sample_idx]

    # Provide the same external interface as `self.dataset `.
    @force_full_init
    def __len__(self):

        '''
        4. Implement some code here to calculate the length of the wrapped dataset.
        '''
        len_wrapper = ...

        return len_wrapper

    # Provide the same external interface as `self.dataset `.
    @property
    def metainfo(self)
        return copy.deepcopy(self._metainfo)

Data transform

In the OpenMMLab repositories, dataset construction and data preparation are decoupled from each other. Usually, the dataset construction only parses the dataset and records the basic information of each sample, while the data preparation is performed by a series of data transforms, such as data loading, preprocessing, and formatting based on the basic information of the samples.

To use Data Transforms

In MMEngine, we use various callable data transforms classes to perform data manipulation. These data transformation classes can accept several configuration parameters for instantiation and then process the input data dictionary by calling. Also, all data transforms accept a dictionary as input and output the processed data as a dictionary. A simple example is as belows:

Note

In MMEngine, we don’t have the implementations of data transforms. you can find the base data transform class and many other data transforms in MMCV. So you need to install MMCV before learning this tutorial, see the MMCV installation guild.

>>> import numpy as np
>>> from mmcv.transforms import Resize
>>>
>>> transform = Resize(scale=(224, 224))
>>> data_dict = {'img': np.random.rand(256, 256, 3)}
>>> data_dict = transform(data_dict)
>>> print(data_dict['img'].shape)
(224, 224, 3)

To use in Config Files

In config files, we can compose multiple data transforms as a list, called a data pipeline. And the data pipeline is an argument of the dataset.

Usually, a data pipeline consists of the following parts:

  1. Data loading, use LoadImageFromFile to load image files.

  2. Label loading, use LoadAnnotations to load the bboxes, semantic segmentation and keypoint annotations.

  3. Data processing and augmentation, like RandomResize.

  4. Data formatting, we use different data transforms for different tasks. And the data transform for specified task is implemented in the corresponding repository. For example, the data formatting transform for image classification task is PackClsInputs and it’s in MMClassification.

Here, taking the classification task as an example, we show a typical data pipeline in the figure below. For each sample, the basic information stored in the dataset is a dictionary as shown on the far left side of the figure, after which, every blue block represents a data transform, and in every data transform, we add some new fields (marked in green) or update some existing fields (marked in orange) in the data dictionary.

If want to use the above data pipeline in our config file, use the below settings:

test_dataloader = dict(
    batch_size=32,
    dataset=dict(
        type='ImageNet',
        data_root='data/imagenet',
        pipeline = [
            dict(type='LoadImageFromFile'),
            dict(type='Resize', size=256, keep_ratio=True),
            dict(type='CenterCrop', crop_size=224),
            dict(type='PackClsInputs'),
        ]
    )
)

Common Data Transforms

According to the functionality, the data transform classes can be divided into data loading, data pre-processing & augmentation and data formatting.

Data Loading

To support loading large-scale dataset, usually we won’t load all dense data during dataset construction, but only load the file path of these data. Therefore, we need to load these data in the data pipeline.

Data Transforms

Functionality

LoadImageFromFile

Load images according to the path.

LoadAnnotations

Load and format annotations information, including bbox, segmentation map and others.

Data Pre-processing & Augmentation

Data transforms for pre-processing and augmentation usually manipulate the image and annotation data, like cropping, padding, resizing and others.

Data Transforms

Functionality

Pad

Pad the margin of images.

CenterCrop

Crop the image and keep the center part.

Normalize

Normalize the image pixels.

Resize

Resize images to the specified scale or ratio.

RandomResize

Resize images to a random scale in the specified range.

RandomChoiceResize

Resize images to a random scale from several specified scales.

RandomGrayscale

Randomly grayscale images.

RandomFlip

Randomly flip images.

Data Formatting

Data formatting transforms will convert the data to some specified type.

Data Transforms

Functionality

ToTensor

Convert the data of specified field to torch.Tensor

ImageToTensor

Convert images to torch.Tensor in PyTorch format.

Custom Data Transform Classes

To implement a new data transform class, the class needs to inherit BaseTransform and implement transform method. Here, we use a simple flip transforms (MyFlip) as example:

import random
import mmcv
from mmcv.transforms import BaseTransform, TRANSFORMS

@TRANSFORMS.register_module()
class MyFlip(BaseTransform):
    def __init__(self, direction: str):
        super().__init__()
        self.direction = direction

    def transform(self, results: dict) -> dict:
        img = results['img']
        results['img'] = mmcv.imflip(img, direction=self.direction)
        return results

Then, we can instantiate a MyFlip object and use it to process our data dictionary.

import numpy as np

transform = MyFlip(direction='horizontal')
data_dict = {'img': np.random.rand(224, 224, 3)}
data_dict = transform(data_dict)
processed_img = data_dict['img']

Or, use it in the data pipeline by modifying our config file:

pipeline = [
    ...
    dict(type='MyFlip', direction='horizontal'),
    ...
]

Please note that to use the class in our config file, we need to confirm the MyFlip class will be imported during running.

Weight initialization

Usually, we’ll customize our module based on nn.Module, which is implemented by Native PyTorch. Also, torch.nn.init could help us initialize the parameters of the model easily. To simplify the process of model construction and initialization, MMEngine designed the BaseModule to help us define and initialize the model from config easily.

Initialize the model from config

The core function of BaseModule is that it could help us to initialize the model from config. Subclasses inherited from BaseModule could define the init_cfg in the __init__ function, and we can choose the method of initialization by configuring init_cfg.

Currently, we support the following initialization methods:

Initializer Registered name Function
ConstantInit Constant Initialize the weight and bias with a constant, commonly used for Convolution
XavierInit Xavier Initialize the weight by Xavier initialization, and initialize the bias with a constant
NormalInit Normal Initialize the weight by normal distribution, and initialize the bias with a constant
TruncNormalInit TruncNormal Initialize the weight by truncated normal distribution, and initialize the bias with a constant,commonly used for Transformer
UniformInit Uniform Initialize the weight by uniform distribution, and initialize the bias with a constant,commonly used for convolution
KaimingInit Kaiming Initialize the weight by Kaiming initialization, and initialize the bias with a constant. Commonly used for convolution
Caffe2XavierInit Caffe2Xavier Xavier initialization in Caffe2, and Kaiming initialization in PyTorh with "fan_in" and "normal" mode. Commonly used for convolution
PretrainedInit Pretrained Initialize the model with the pretrained model

Initialize the model with pretrained model

Defining the ToyNet as below:

import torch
import torch.nn as nn

from mmengine.model import BaseModule


class ToyNet(BaseModule):

    def __init__(self, init_cfg=None):
        super().__init__(init_cfg)
        self.conv1 = nn.Linear(1, 1)


# Save the checkpoint.
toy_net = ToyNet()
torch.save(toy_net.state_dict(), './pretrained.pth')
pretrained = './pretrained.pth'

toy_net = ToyNet(init_cfg=dict(type='Pretrained', checkpoint=pretrained))

and then we can configure the init_cfg to make it load the pretrained model by calling initi_weights() after its construction.

# Initialize the model with the saved checkpoint.
toy_net.init_weights()
08/19 16:50:24 - mmengine - INFO - load model from: ./pretrained.pth
08/19 16:50:24 - mmengine - INFO - local loads checkpoint from path: ./pretrained.pth

If init_cfg is a dict, type means a kind of initializer registered in WEIGHT_INITIALIZERS. The Pretrained means PretrainedInit, which could help us to load the target checkpoint. All initializers have the same mapping relationship like Pretrained -> PretrainedInit, which strips the suffix Init of the class name. The checkpoint argument of PretrainedInit means the path of the checkpoint. It could be a local path or a URL.

Note

PretrainedInit has a higher priority than any other initializer. The loaded pretrained weights will overwrite the previous initialized weights.

Commonly used initialization methods

Similarly, we could use the Kaiming initialization just like Pretrained initializer. For example, we could make init_cfg=dict(type='Kaiming', layer='Conv2d') to initialize all Conv2d module with Kaiming initialization.

Sometimes we need to initialize the model with different initialization methods for different modules. For example, we could initialize the Conv2d module with Kaiming initialization and initialize the Linear module with Xavier initialization. We could make init_cfg=dict(type='Kaiming', layer='Conv2d'):

import torch.nn as nn

from mmengine.model import BaseModule


class ToyNet(BaseModule):

    def __init__(self, init_cfg=None):
        super().__init__(init_cfg)
        self.linear = nn.Linear(1, 1)
        self.conv = nn.Conv2d(1, 1, 1)


# Apply `Kaiming` initialization to `Conv2d` module and `Xavier` initialization to `Linear` module.
toy_net = ToyNet(
    init_cfg=[
        dict(type='Kaiming', layer='Conv2d'),
        dict(type='Xavier', layer='Linear')
    ], )
toy_net.init_weights()
08/19 16:50:24 - mmengine - INFO -
linear.weight - torch.Size([1, 1]):
XavierInit: gain=1, distribution=normal, bias=0

08/19 16:50:24 - mmengine - INFO -
linear.bias - torch.Size([1]):
XavierInit: gain=1, distribution=normal, bias=0

08/19 16:50:24 - mmengine - INFO -
conv.weight - torch.Size([1, 1, 1, 1]):
KaimingInit: a=0, mode=fan_out, nonlinearity=relu, distribution =normal, bias=0

08/19 16:50:24 - mmengine - INFO -
conv.bias - torch.Size([1]):
KaimingInit: a=0, mode=fan_out, nonlinearity=relu, distribution =normal, bias=0

layer could also be a list, each element of which means a type of applied module.

# Apply Kaiming initialization to `Conv2d` and `Linear` module.
toy_net = ToyNet(init_cfg=[dict(type='Kaiming', layer=['Conv2d', 'Linear'])], )
toy_net.init_weights()
08/19 16:50:24 - mmengine - INFO -
linear.weight - torch.Size([1, 1]):
KaimingInit: a=0, mode=fan_out, nonlinearity=relu, distribution =normal, bias=0

08/19 16:50:24 - mmengine - INFO -
linear.bias - torch.Size([1]):
KaimingInit: a=0, mode=fan_out, nonlinearity=relu, distribution =normal, bias=0

08/19 16:50:24 - mmengine - INFO -
conv.weight - torch.Size([1, 1, 1, 1]):
KaimingInit: a=0, mode=fan_out, nonlinearity=relu, distribution =normal, bias=0

08/19 16:50:24 - mmengine - INFO -
conv.bias - torch.Size([1]):
KaimingInit: a=0, mode=fan_out, nonlinearity=relu, distribution =normal, bias=0

More fine-grained initialization

Sometimes we need to initialize the same type of module with different types of initialization. For example, we’ve defined conv1 and conv2 submodules, and we want to initialize the conv1 with Kaiming initialization and conv2 with Xavier initialization. We could configure the init_cfg with override:

import torch.nn as nn

from mmengine.model import BaseModule


class ToyNet(BaseModule):

    def __init__(self, init_cfg=None):
        super().__init__(init_cfg)
        self.conv1 = nn.Conv2d(1, 1, 1)
        self.conv2 = nn.Conv2d(1, 1, 1)


# Apllly `Kaiming` initialization to `conv1` and `Xavier` initialization to `conv2`.
toy_net = ToyNet(
    init_cfg=[
        dict(
            type='Kaiming',
            layer=['Conv2d'],
            override=dict(name='conv2', type='Xavier')),
    ], )
toy_net.init_weights()
08/19 16:50:24 - mmengine - INFO -
conv1.weight - torch.Size([1, 1, 1, 1]):
KaimingInit: a=0, mode=fan_out, nonlinearity=relu, distribution =normal, bias=0

08/19 16:50:24 - mmengine - INFO -
conv1.bias - torch.Size([1]):
KaimingInit: a=0, mode=fan_out, nonlinearity=relu, distribution =normal, bias=0

08/19 16:50:24 - mmengine - INFO -
conv2.weight - torch.Size([1, 1, 1, 1]):
XavierInit: gain=1, distribution=normal, bias=0

08/19 16:50:24 - mmengine - INFO -
conv2.bias - torch.Size([1]):
KaimingInit: a=0, mode=fan_out, nonlinearity=relu, distribution =normal, bias=0

override could be understood as an nested init_cfg, which could also be a list or dict, and we should also set “type” for it. The difference is that we must set name in override to specify the applied scope for submodule. As the example above, we set name='conv2' to specify that the Xavier initialization is applied to all submodules of toy_net.conv2.

Customize the initialization method

Although the init_cfg could control the initialization method for different modules, we would have to register a new initialization method to WEIGHT_INITIALIZERS if we want to customize initialization process. It is not convenient right? Actually, we could also override the init_weights method to customize the initialization process.

Assuming we’ve defined the following modules:

  • ToyConv inherit from nn.Module, implements init_weightswhich initialize custom_weight(parameter of ToyConv) with 1 and initialize custom_bias with 0

  • ToyNet defines a ToyConv submodule.

ToyNet.init_weights will call init_weights of all submodules sequentially.

import torch
import torch.nn as nn

from mmengine.model import BaseModule


class ToyConv(nn.Module):

    def __init__(self):
        super().__init__()
        self.custom_weight = nn.Parameter(torch.empty(1, 1, 1, 1))
        self.custom_bias = nn.Parameter(torch.empty(1))

    def init_weights(self):
        with torch.no_grad():
            self.custom_weight = self.custom_weight.fill_(1)
            self.custom_bias = self.custom_bias.fill_(0)


class ToyNet(BaseModule):

    def __init__(self, init_cfg=None):
        super().__init__(init_cfg)
        self.conv1 = nn.Conv2d(1, 1, 1)
        self.conv2 = nn.Conv2d(1, 1, 1)
        self.custom_conv = ToyConv()


toy_net = ToyNet(
    init_cfg=[
        dict(
            type='Kaiming',
            layer=['Conv2d'],
            override=dict(name='conv2', type='Xavier'))
    ])

toy_net.init_weights()
08/19 16:50:24 - mmengine - INFO -
conv1.weight - torch.Size([1, 1, 1, 1]):
KaimingInit: a=0, mode=fan_out, nonlinearity=relu, distribution =normal, bias=0

08/19 16:50:24 - mmengine - INFO -
conv1.bias - torch.Size([1]):
KaimingInit: a=0, mode=fan_out, nonlinearity=relu, distribution =normal, bias=0

08/19 16:50:24 - mmengine - INFO -
conv2.weight - torch.Size([1, 1, 1, 1]):
XavierInit: gain=1, distribution=normal, bias=0

08/19 16:50:24 - mmengine - INFO -
conv2.bias - torch.Size([1]):
KaimingInit: a=0, mode=fan_out, nonlinearity=relu, distribution =normal, bias=0

08/19 16:50:24 - mmengine - INFO -
custom_conv.custom_weight - torch.Size([1, 1, 1, 1]):
Initialized by user-defined `init_weights` in ToyConv

08/19 16:50:24 - mmengine - INFO -
custom_conv.custom_bias - torch.Size([1]):
Initialized by user-defined `init_weights` in ToyConv

Conclusion

1. Configure init_cfg to initialize model

  • Commonly used for the initialization of Conv2d, Linear and other underlying module. All initialization methods should be managed by WEIGHT_INITIALIZERS

  • Dynamic initialization controlled by init_cfg

2. Customize init_weights

  • Compared to configuring the init_cfg, implementing the init_weights is simpler and does not require registration. However, it is not as flexible as init_cfg, and it is not possible to initialize the module dynamically.

Note

  • The priorify of init_weights is higher than init_cfg

  • Runner will call init_weights in Runner.train()

Ininitailize module with function

As mentioned in prior section, we could customize our initialization in init_weights. To make it more convenient to initialize modules, MMEngine provides a series of module initialization functions to initialize the whole module based on torch.nn.init. For example, we want to initialize the weights of the convolutional layer with normal distribution and initialize the bias of the convolutional layer with a constant. The implementation of torch.nn.init is as follows:

from torch.nn.init import normal_, constant_
import torch.nn as nn

model = nn.Conv2d(1, 1, 1)
normal_(model.weight, mean=0, std=0.01)
constant_(model.bias, val=0)
Parameter containing:
tensor([0.], requires_grad=True)

The above process is actually a standard process for initializing a convolutional module with normal distribution, so MMEngine simplifies this by implementing a series of common module initialization functions. Compared with torch.nn.init, the module initialization functions could accept the convolution module directly:

from mmengine.model import normal_init

normal_init(model, mean=0, std=0.01, bias=0)

Similarly, we could also use Kaiming initialization and Xavier initialization:

from mmengine.model import kaiming_init, xavier_init

kaiming_init(model)
xavier_init(model)

Currently, MMEngine provide the following initialization function:

Initialization function Function
constant_init Initialize the weight and bias with a constant, commonly used for Convolution
xavier_init Initialize the weight by Xavier initialization, and initialize the bias with a constant
normal_init Initialize the weight by normal distribution, and initialize the bias with a constant
trunc_normal_init Initialize the weight by truncated normal distribution, and initialize the bias with a constant,commonly used for Transformer
uniform_init Initialize the weight by uniform distribution, and initialize the bias with a constant,commonly used for convolution
kaiming_init Initialize the weight by Kaiming initialization, and initialize the bias with a constant. Commonly used for convolution
caffe2_xavier_init Xavier initialization in Caffe2, and Kaiming initialization in PyTorh with "fan_in" and "normal" mode. Commonly used for convolution
bias_init_with_prob Initialize the bias with the probability

Visualization

Visualization provides an intuitive explanation of the training and testing process of the deep learning model.

MMEngine provides Visualizer to visualize and store the state and intermediate results of the model training and testing process, with the following features:

  • It supports basic drawing interface and feature map visualization

  • It enables recording training states (such as loss and lr), performance evaluation metrics, and visualization results to a specified or multiple backends, including local device, TensorBoard, and WandB.

  • It can be used in any location in the code base.

Basic Drawing APIs

Visualizer provides drawing APIs for common objects such as detection bboxes, points, text, lines, circles, polygons, and binary masks.

These APIs have the following features:

  • Can be called multiple times to achieve overlay drawing requirements.

  • All support multiple input types such as Tensor, Numpy array, etc.

Typical usages are as follows.

  1. Draw detection bboxes, masks, text, etc.

import torch
import mmcv
from mmengine.visualization import Visualizer

image = mmcv.imread('docs/en/_static/image/cat_dog.png', channel_order='rgb')
visualizer = Visualizer(image=image)
# single bbox formatted as [xyxy]
visualizer.draw_bboxes(torch.tensor([72, 13, 179, 147]))
# draw multiple bboxes
visualizer.draw_bboxes(torch.tensor([[33, 120, 209, 220], [72, 13, 179, 147]]))
visualizer.show()
visualizer.set_image(image=image)
visualizer.draw_texts("cat and dog", torch.tensor([10, 20]))
visualizer.show()

You can also customize things like color and width using the parameters in each API.

visualizer.set_image(image=image)
visualizer.draw_bboxes(torch.tensor([72, 13, 179, 147]), edge_colors='r', line_widths=3)
visualizer.draw_bboxes(torch.tensor([[33, 120, 209, 220]]),line_styles='--')
visualizer.show()
  1. Overlay display

These APIs can be called multiple times to get an overlay result.

visualizer.set_image(image=image)
visualizer.draw_bboxes(torch.tensor([[33, 120, 209, 220], [72, 13, 179, 147]]))
visualizer.draw_texts("cat and dog",
                      torch.tensor([10, 20])).draw_circles(torch.tensor([40, 50]), torch.tensor([20]))
visualizer.show()

Feature Map Visualization

Feature map visualization has many functions. Currently, we only support single feature map visualization.

@staticmethod
def draw_featmap(featmap: torch.Tensor, # input format must be CHW
                 overlaid_image: Optional[np.ndarray] = None, # if image data is input at the same time, the feature map will be overlaid on the image
                 channel_reduction: Optional[str] = 'squeeze_mean', # strategy to reduce multiple channels into a single channel
                 topk: int = 10, # topk feature maps to show
                 arrangement: Tuple[int, int] = (5, 2), # the layout when multiple channels are expanded into multiple images
                 resize_shape:Optional[tuple] = None, # scale the feature map
                 alpha: float = 0.5) -> np.ndarray: # overlay ratio between input image and generated feature map

The main features can be concluded as follows:

  • As the input Tensor usually includes multiple channels, channel_reduction can reduce them into a single channel and overlay the result to the image.

    • squeeze_mean reduces the input channel C into a single channel using the mean function, so the output dimension becomes (1, H, W)

    • select_max select the channel with the maximum activation, where ‘activation’ refers to the sum across spatial dimensions of a channel.

    • None indicates that no reduction is needed, which allows the user to select the top k feature maps with the highest activation degree through the topk parameter.

  • topk is only valid when the channel_reduction is None. It selects the top k channels according to the activation degree and then displays them overlaid with the image. The display layout can be specified using the --arrangement parameter.

    • If topk is not -1, topk channels with the largest activation will be selected for display.

    • If topk is -1, channel number C must be either 1 or 3 to indicate if the input is a picture. Otherwise, an error will be raised to prompt the user to reduce the channel with channel_reduction.

  • Considering that the input feature map is usually very small, the function can upsample the feature map through resize_shape before the visualization.

For example, we would like to get the feature map from the layer4 output of a pre-trained ResNet18 model and visualize it.

  1. Reduce the multi-channel feature map into a single channel using select_max and display it.

import numpy as np
from torchvision.models import resnet18
from torchvision.transforms import Compose, Normalize, ToTensor

def preprocess_image(img, mean, std):
    preprocessing = Compose([
        ToTensor(),
        Normalize(mean=mean, std=std)
    ])
    return preprocessing(img.copy()).unsqueeze(0)

model = resnet18(pretrained=True)

def _forward(x):
    x = model.conv1(x)
    x = model.bn1(x)
    x = model.relu(x)
    x = model.maxpool(x)

    x1 = model.layer1(x)
    x2 = model.layer2(x1)
    x3 = model.layer3(x2)
    x4 = model.layer4(x3)
    return x4

model.forward = _forward

image_norm = np.float32(image) / 255
input_tensor = preprocess_image(image_norm,
                                mean=[0.485, 0.456, 0.406],
                                std=[0.229, 0.224, 0.225])
feat = model(input_tensor)[0]

visualizer = Visualizer()
drawn_img = visualizer.draw_featmap(feat, channel_reduction='select_max')
visualizer.show(drawn_img)

Since the output feat feature map size is 7x7, the visualization effect is not good if we directly work on it. Users can scale the feature map by overlaying the input image or the resize_shape parameter. If the size of the incoming image is not the same as the size of the feature map, the feature map will be forced to be resampled to the same spatial size as the input image.

drawn_img = visualizer.draw_featmap(feat, image, channel_reduction='select_max')
visualizer.show(drawn_img)
  1. Select the top five channels with the highest activation in the multi-channel feature map by setting topk=5, then format them into a 2x3 layout.

drawn_img = visualizer.draw_featmap(feat, image, channel_reduction=None, topk=5, arrangement=(2, 3))
visualizer.show(drawn_img)

Users can set their own desired layout through arrangement.

drawn_img = visualizer.draw_featmap(feat, image, channel_reduction=None, topk=5, arrangement=(4, 2))
visualizer.show(drawn_img)

Basic Storage APIs

Once the drawing is completed, users can choose to display the result directly or save it to different backends. The backends currently supported by MMEngine include local storage, Tensorboard and WandB. The data supported include drawn pictures, scalars, and configurations.

  1. Save the result image

Suppose you want to save to your local device.

visualizer = Visualizer(image=image, vis_backends=[dict(type='LocalVisBackend')], save_dir='temp_dir')

visualizer.draw_bboxes(torch.tensor([[33, 120, 209, 220], [72, 13, 179, 147]]))
visualizer.draw_texts("cat and dog", torch.tensor([10, 20]))
visualizer.draw_circles(torch.tensor([40, 50]), torch.tensor([20]))

# temp_dir/vis_data/vis_image/demo_0.png will be generated
visualizer.add_image('demo', visualizer.get_image())

The zero in the result file name is used to distinguish different steps.

# temp_dir/vis_data/vis_image/demo_1.png will be generated
visualizer.add_image('demo', visualizer.get_image(), step=1)
# temp_dir/vis_data/vis_image/demo_3.png will be generated
visualizer.add_image('demo', visualizer.get_image(), step=3)

If you want to switch to other backends, you can change the configuration file like this:

# TensorboardVisBackend
visualizer = Visualizer(image=image, vis_backends=[dict(type='TensorboardVisBackend')], save_dir='temp_dir')
# WandbVisBackend
visualizer = Visualizer(image=image, vis_backends=[dict(type='WandbVisBackend')], save_dir='temp_dir')
  1. Store feature maps

visualizer = Visualizer(vis_backends=[dict(type='LocalVisBackend')], save_dir='temp_dir')
drawn_img = visualizer.draw_featmap(feat, image, channel_reduction=None, topk=5, arrangement=(2, 3))
# temp_dir/vis_data/vis_image/feat_0.png will be generated
visualizer.add_image('feat', drawn_img)
  1. Save scalar data such as loss

# temp_dir/vis_data/scalars.json will be generated
# save loss
visualizer.add_scalar('loss', 0.2, step=0)
visualizer.add_scalar('loss', 0.1, step=1)
# save acc
visualizer.add_scalar('acc', 0.7, step=0)
visualizer.add_scalar('acc', 0.8, step=1)

Multiple scalar data can also be saved at once.

# New contents will be added to the temp_dir/vis_data/scalars.json
visualizer.add_scalars({'loss': 0.3, 'acc': 0.8}, step=3)
  1. Save configurations

from mmengine import Config
cfg=Config.fromfile('tests/data/config/py_config/config.py')
# temp_dir/vis_data/config.py will be saved
visualizer.add_config(cfg)

Various Storage Backends

Any Visualizer can be configured with any number of storage backends. Visualizer will loop through all the configured backends and save the results to each one.

visualizer = Visualizer(image=image, vis_backends=[dict(type='TensorboardVisBackend'),
                                                   dict(type='LocalVisBackend')],
                        save_dir='temp_dir')
# temp_dir/vis_data/events.out.tfevents.xxx files will be generated
visualizer.draw_bboxes(torch.tensor([[33, 120, 209, 220], [72, 13, 179, 147]]))
visualizer.draw_texts("cat and dog", torch.tensor([10, 20]))
visualizer.draw_circles(torch.tensor([40, 50]), torch.tensor([20]))

visualizer.add_image('demo', visualizer.get_image())

Note: If there are multiple backends used at the same time, the name field must be specified. Otherwise, it is impossible to distinguish which backend it is.

visualizer = Visualizer(image=image, vis_backends=[dict(type='TensorboardVisBackend', name='tb_1', save_dir='temp_dir_1'),
                                                   dict(type='TensorboardVisBackend', name='tb_2', save_dir='temp_dir_2'),
                                                   dict(type='LocalVisBackend', name='local')],
                        save_dir='temp_dir')

Visualize at Anywhere

During the development, users may need to add visualization functions somewhere in their codes and save the results to different backends, which is very common for analysis and debugging. Visualizer in MMEngine can obtain the data from the same visualizers and then visualize them.

Users only need to instantiate the visualizer through get_instance during initialization. The visualizer obtained this way is unique and globally accessible. Then it can be accessed anywhere in the code through Visualizer.get_current_instance().

# call during the initialization stage
visualizer1 = Visualizer.get_instance(name='vis', vis_backends=[dict(type='LocalVisBackend')])

# call anywhere
visualizer2 = Visualizer.get_current_instance()
visualizer2.add_scalar('map', 0.7, step=0)

assert id(visualizer1) == id(visualizer2)

It can also be initialized globally through the config field.

from mmengine.registry import VISUALIZERS

visualizer_cfg=dict(
                type='Visualizer',
                name='vis_new',
                vis_backends=[dict(type='LocalVisBackend')])
VISUALIZERS.build(visualizer_cfg)

Customize Storage Backends and Visualizers

  1. Call a specific storage backend

The storage backend only provides basic functions such as saving configurations and scalars. However, users may want to utilize other powerful backend features like WandB and Tensorboard. Therefore, the storage backend provides the experiment attribute to facilitate users to obtain backend objects and meet various customized functions.

For example, WandB provides an API to display tables. Users can obtain the WandB objects through the experiment attribute and then call a specific API to save the data as a table to show.

visualizer = Visualizer(image=image, vis_backends=[dict(type='WandbVisBackend')],
                        save_dir='temp_dir')

# get WandB object
wandb = visualizer.get_backend('WandbVisBackend').experiment
# add data to the table
table = wandb.Table(columns=["step", "mAP"])
table.add_data(1, 0.2)
table.add_data(2, 0.5)
table.add_data(3, 0.9)
# save
wandb.log({"table": table})
  1. Customize storage backends

Users only need to inherit BaseVisBackend and implement various add_xx methods to customize the storage backend easily.

from mmengine.registry import VISBACKENDS
from mmengine.visualization import BaseVisBackend

@VISBACKENDS.register_module()
class DemoVisBackend(BaseVisBackend):
    def add_image(self, **kwargs):
        pass

visualizer = Visualizer(vis_backends=[dict(type='DemoVisBackend')], save_dir='temp_dir')
visualizer.add_image('demo',image)
  1. Customize visualizers

Similarly, users can easily customize the visualizer by inheriting Visualizer and implementing the functions they want to override.

In most cases, users need to override add_datasample. The data usually includes detection bboxes and instance masks from annotations or model predictions. This interface is for drawing datasample data for various downstream libraries. Taking MMDetection as an example, the datasample data usually includes labeled bboxs, labeled masks, predicted bboxs, or predicted masks. MMDetection will inherit Visualizer and implement the add_datasample interface, drawing the data related to the detection task.

from mmengine.registry import VISUALIZERS

@VISUALIZERS.register_module()
class DetLocalVisualizer(Visualizer):
    def add_datasample(self,
                       name,
                       image: np.ndarray,
                       data_sample: Optional['BaseDataElement'] = None,
                       draw_gt: bool = True,
                       draw_pred: bool = True,
                       show: bool = False,
                       wait_time: int = 0,
                       step: int = 0) -> None:
        pass

visualizer_cfg = dict(
    type='DetLocalVisualizer', vis_backends=[dict(type='WandbVisBackend')], name='visualizer')

# global initialize
VISUALIZERS.build(visualizer_cfg)

# call anywhere in your code
det_local_visualizer = Visualizer.get_current_instance()
det_local_visualizer.add_datasample('det', image, data_sample)

Abstract Data Element

Coming soon. Please refer to chinese documentation.

Distribution Communication

In distributed training, different processes sometimes need to apply different logics depending on their ranks, local_ranks, etc. They also need to communicate with each other and do synchronizations on data. These demands rely on distributed communication. PyTorch provides a set of basic distributed communication primitives. Based on these primitives, MMEngine provides some higher level APIs to meet more diverse demands. Using these APIs provided by MMEngine, modules can:

  • ignore the differences between distributed/non-distributed environment

  • deliver data in various types apart from Tensor

  • ignore the frameworks or backends used for communication

These APIs are roughly categorized into 3 types:

  • Initialization: init_dist for setting up distributed environment for the runner

  • Query & control: functions including get_world_size for querying world_size, rank and other distributed information

  • Collective communication: collective communication functions such as all_reduce

We will detail on these APIs in the following chapters.

Initialization

  • init_dist: Launch function of distributed training. Currently it supports 3 launchers including pytorch, slurm and MPI. It also setup the given communication backends, defaults to NCCL.

Query and control

The query and control functions are all argument free. They can be used in both distributed and non-distributed environment. Their functionalities are listed below:

  • get_world_size: Returns the number of processes in current process group. Returns 1 when non-distributed

  • get_rank: Returns the global rank of current process in current process group. Returns 0 when non-distributed

  • get_backend: Returns the communication backends used by current process group. Returns None when non-distributed

  • get_local_rank: Returns the local rank of current process in current process group. Returns 0 when non-distributed

  • get_local_size: Returns the number of processes which are both in current process group and on the same machine as the current process. Returns 1 when non-distributed

  • get_dist_info: Returns the world_size and rank of the current process group. Returns world_size = 1, rank = 0 when non-distributed

  • is_main_process: Returns True if current process is rank 0 in current process group, otherwise False . Always returns True when non-distributed

  • master_only: A function decorator. Functions decorated by master_only will only execute on rank 0 process.

  • barrier: A synchronization primitive. Every process will hold until all processes in the current process group reach the same barrier location

Collective communication

Collective communication functions are used for data transfer between processes in the same process group. We provide the following APIs based on PyTorch native functions including all_reduce, all_gather, gather, broadcast. These APIs are compatible with non-distributed environment and support more data types apart from Tensor.

  • all_reduce: AllReduce operation on Tensors in the current process group

  • all_gather: AllGather operation on Tensors in the current process group

  • gather: Gather Tensors in the current process group to a destinated rank

  • broadcast: Broadcast a Tensor to all processes in the current process group

  • sync_random_seed: Synchronize random seed between processes in the current process group

  • broadcast_object_list: Broadcast a list of Python objects. It requires the object can be serialized by Pickle.

  • all_reduce_dict: AllReduce operation on dict. It is based on broadcast and all_reduce.

  • all_gather_object: AllGather operations on any Python object than can be serialized by Pickle. It is based on all_gather

  • gather_object: Gather Python objects that can be serialized by Pickle

  • collect_results: Unified API for collecting a list of data in current process group. It support both CPU and GPU communication

Logging

Runner will produce a lot of logs during the running process, such as loss, iteration time, learning rate, etc. MMEngine implements a flexible logging system that allows us to choose different types of log statistical methods when configuring the runner. It could help us set/get the recorded log at any location in the code.

Flexible Logging System

Logging system is configured by passing a LogProcessor to the runner. If no log processor is passed, the runner will use the default log processor, which is equivalent to:

log_processor = dict(window_size=10, by_epoch=True, custom_cfg=None, num_digits=4)

The format of the output log is as follows:

import torch
import torch.nn as nn
from torch.utils.data import DataLoader

from mmengine.runner import Runner
from mmengine.model import BaseModel

train_dataset = [(torch.ones(1, 1), torch.ones(1, 1))] * 50
train_dataloader = DataLoader(train_dataset, batch_size=2)


class ToyModel(BaseModel):
    def __init__(self) -> None:
        super().__init__()
        self.linear = nn.Linear(1, 1)

    def forward(self, img, label, mode):
        feat = self.linear(img)
        loss1 = (feat - label).pow(2)
        loss2 = (feat - label).abs()
        return dict(loss1=loss1, loss2=loss2)

runner = Runner(
    model=ToyModel(),
    work_dir='tmp_dir',
    train_dataloader=train_dataloader,
    train_cfg=dict(by_epoch=True, max_epochs=1),
    optim_wrapper=dict(optimizer=dict(type='SGD', lr=0.01))
)
runner.train()
08/21 02:58:41 - mmengine - INFO - Epoch(train) [1][10/25]  lr: 1.0000e-02  eta: 0:00:00  time: 0.0019  data_time: 0.0004  loss1: 0.8381  loss2: 0.9007  loss: 1.7388
08/21 02:58:41 - mmengine - INFO - Epoch(train) [1][20/25]  lr: 1.0000e-02  eta: 0:00:00  time: 0.0029  data_time: 0.0010  loss1: 0.1978  loss2: 0.4312  loss: 0.6290

LogProcessor will output the log in the following format:

  • The prefix of the log:

    • epoch mode(by_epoch=True): Epoch(train) [{current_epoch}/{current_iteration}]/{dataloader_length}

    • iteration mode(by_epoch=False): Iter(train) [{current_iteration}/{max_iteration}])

  • Learning rate (lr): The learning rate of the last iteration.

  • Time:

    • time: The averaged time for inference of the last window_size iterations.

    • data_time: The averaged time for loading data of the last window_size iterations.

    • eta: The estimated time of arrival to finish the training.

  • Loss: The averaged loss output by model of the last window_size iterations.

Note

window_size=10 by default.

The significant digits(num_digits) of the log is 4 by default.

Output the value of all custom logs at the last iteration by default.

Based on the rules above, the code snippet will count the average value of the loss1 and the loss2 every 10 iterations.

If we want to count the global average value of loss1, we can set custom_cfg like this:

runner = Runner(
    model=ToyModel(),
    work_dir='tmp_dir',
    train_dataloader=train_dataloader,
    train_cfg=dict(by_epoch=True, max_epochs=1),
    optim_wrapper=dict(optimizer=dict(type='SGD', lr=0.01)),
    log_processor=dict(
        custom_cfg=[
            dict(data_src='loss1',  # original loss name: loss1
                 method_name='mean',  # statistical method: mean
                 window_size='global')])  # window_size: global
)
runner.train()
08/21 02:58:49 - mmengine - INFO - Epoch(train) [1][10/25]  lr: 1.0000e-02  eta: 0:00:00  time: 0.0026  data_time: 0.0007  loss1: 0.7381  loss2: 0.8446  loss: 1.5827
08/21 02:58:49 - mmengine - INFO - Epoch(train) [1][20/25]  lr: 1.0000e-02  eta: 0:00:00  time: 0.0030  data_time: 0.0012  loss1: 0.4521  loss2: 0.3939  loss: 0.5600

data_src means the original loss name, method_name means the statistic method, window_size means the window size of the statistic method. Since we want to count the global average value of loss1, we set window_size to global.

Currently, MMEngine supports the following statistical methods:

statistic method arguments function
mean window_size statistic the average log of the last `window_size`
min window_size statistic the minimum log of the last `window_size`
max window_size statistic the maximum log of the last `window_size`
current / statistic the latest

window_size mentioned above could be:

  • int number: The window size of the statistic method.

  • global: Equivalent to window_size=cur_iteration.

  • epoch: Equivalent to window_size=len(dataloader).

If we want to statistic the average value of loss1 of the last 10 iterations, and also want to statistic the global average value of loss1. We need to set log_name additionally:

runner = Runner(
    model=ToyModel(),
    work_dir='tmp_dir',
    train_dataloader=train_dataloader,
    train_cfg=dict(by_epoch=True, max_epochs=1),
    optim_wrapper=dict(optimizer=dict(type='SGD', lr=0.01)),
    log_processor=dict(
        custom_cfg=[
            # log_name means the second name of loss1
            dict(data_src='loss1', log_name='loss1_global', method_name='mean', window_size='global')])
)
runner.train()
08/21 18:39:32 - mmengine - INFO - Epoch(train) [1][10/25]  lr: 1.0000e-02  eta: 0:00:00  time: 0.0016  data_time: 0.0004  loss1: 0.1512  loss2: 0.3751  loss: 0.5264  loss1_global: 0.1512
08/21 18:39:32 - mmengine - INFO - Epoch(train) [1][20/25]  lr: 1.0000e-02  eta: 0:00:00  time: 0.0051  data_time: 0.0036  loss1: 0.0113  loss2: 0.0856  loss: 0.0970  loss1_global: 0.0813

Similarly, we can also statistic the global/local maximum value of loss at the same time.

runner = Runner(
    model=ToyModel(),
    work_dir='tmp_dir',
    train_dataloader=train_dataloader,
    train_cfg=dict(by_epoch=True, max_epochs=1),
    optim_wrapper=dict(optimizer=dict(type='SGD', lr=0.01)),
    log_processor=dict(custom_cfg=[
        # statistic loss1 with the local maximum value
        dict(data_src='loss1',
             log_name='loss1_local_max',
             window_size=10,
             method_name='max'),
        # statistic loss1 with the global maximum value
        dict(
            data_src='loss1',
            log_name='loss1_global_max',
            method_name='max',
            window_size='global')
    ]))
runner.train()
08/21 03:17:26 - mmengine - INFO - Epoch(train) [1][10/25]  lr: 1.0000e-02  eta: 0:00:00  time: 0.0021  data_time: 0.0006  loss1: 1.8495  loss2: 1.3427  loss: 3.1922  loss1_local_max: 2.8872  loss1_global_max: 2.8872
08/21 03:17:26 - mmengine - INFO - Epoch(train) [1][20/25]  lr: 1.0000e-02  eta: 0:00:00  time: 0.0024  data_time: 0.0010  loss1: 0.5464  loss2: 0.7251  loss: 1.2715  loss1_local_max: 2.8872  loss1_global_max: 2.8872

More examples can be found in log_processor.

Customize log

The logging system could not only log the loss, lr, .etc but also collect and output the custom log. For example, if we want to statistic the intermediate loss:

from mmengine.logging import MessageHub


class ToyModel(BaseModel):

    def __init__(self) -> None:
        super().__init__()
        self.linear = nn.Linear(1, 1)

    def forward(self, img, label, mode):
        feat = self.linear(img)
        loss_tmp = (feat - label).abs()
        loss = loss_tmp.pow(2)

        message_hub = MessageHub.get_current_instance()
        # update the intermediate `loss_tmp` in the message hub
        message_hub.update_scalar('train/loss_tmp', loss_tmp.sum())
        return dict(loss=loss)


runner = Runner(
    model=ToyModel(),
    work_dir='tmp_dir',
    train_dataloader=train_dataloader,
    train_cfg=dict(by_epoch=True, max_epochs=1),
    optim_wrapper=dict(optimizer=dict(type='SGD', lr=0.01)),
    log_processor=dict(
        custom_cfg=[
        # statistic the loss_tmp with the averaged value
            dict(
                data_src='loss_tmp',
                window_size=10,
                method_name='mean')
        ]
    )
)
runner.train()
08/21 03:40:31 - mmengine - INFO - Epoch(train) [1][10/25]  lr: 1.0000e-02  eta: 0:00:00  time: 0.0026  data_time: 0.0008  loss_tmp: 0.0097  loss: 0.0000
08/21 03:40:31 - mmengine - INFO - Epoch(train) [1][20/25]  lr: 1.0000e-02  eta: 0:00:00  time: 0.0028  data_time: 0.0013  loss_tmp: 0.0065  loss: 0.0000

The custom log will be recorded by updating the messagehub:

  1. Calling MessageHub.get_current_instance() to get the message of runner

  2. Calling MessageHub.update_scalar to update the custom log. The first argument means the log name with the mode prefix(train/val/test). The output log will only retain the log name without the mode prefix.

  3. Configure statistic method of loss_tmp in log_processor. If it is not configured, only the latest value of loss_tmp will be logged.

Export the debug log

Set log_level=DEBUG for runner, and the debug log will be exported to the work_dir:

runner = Runner(
    model=ToyModel(),
    work_dir='tmp_dir',
    train_dataloader=train_dataloader,
    log_level='DEBUG',
    train_cfg=dict(by_epoch=True, max_epochs=1),
    optim_wrapper=dict(optimizer=dict(type='SGD', lr=0.01)))
runner.train()
08/21 18:16:22 - mmengine - DEBUG - Get class `LocalVisBackend` from "vis_backend" registry in "mmengine"
08/21 18:16:22 - mmengine - DEBUG - An `LocalVisBackend` instance is built from registry, its implementation can be found in mmengine.visualization.vis_backend
08/21 18:16:22 - mmengine - DEBUG - Get class `RuntimeInfoHook` from "hook" registry in "mmengine"
08/21 18:16:22 - mmengine - DEBUG - An `RuntimeInfoHook` instance is built from registry, its implementation can be found in mmengine.hooks.runtime_info_hook
08/21 18:16:22 - mmengine - DEBUG - Get class `IterTimerHook` from "hook" registry in "mmengine"
...

Besides, logs of different ranks will be saved in debug mode if you are training your model with the shared storage. The hierarchy of the log is as follows:

./tmp
├── tmp.log
├── tmp_rank1.log
├── tmp_rank2.log
├── tmp_rank3.log
├── tmp_rank4.log
├── tmp_rank5.log
├── tmp_rank6.log
└── tmp_rank7.log
...
└── tmp_rank63.log

The log of Multiple machine with independent storage:

# device: 0:
work_dir/
└── exp_name_logs
    ├── exp_name.log
    ├── exp_name_rank1.log
    ├── exp_name_rank2.log
    ├── exp_name_rank3.log
    ...
    └── exp_name_rank7.log

# device: 7:
work_dir/
└── exp_name_logs
    ├── exp_name_rank56.log
    ├── exp_name_rank57.log
    ├── exp_name_rank58.log
    ...
    └── exp_name_rank63.log

File IO

MMEngine implements a unified set of file reading and writing interfaces in fileio module. With the fileio module, we can use the same function to handle different file formats, such as json, yaml and pickle. Other file formats can also be easily extended.

The fileio module also supports reading and writing files from a variety of file storage backends, including disk, Petrel (for internal use), Memcached, LMDB, and HTTP.

Load and dump data

MMEngine provides a universal API for loading and dumping data, currently supported formats are json, yaml, and pickle.

Load from disk or dump to disk

from mmengine import load, dump

# load data from a file
data = load('test.json')
data = load('test.yaml')
data = load('test.pkl')
# load data from a file-like object
with open('test.json', 'r') as f:
    data = load(f, file_format='json')

# dump data to a string
json_str = dump(data, file_format='json')

# dump data to a file with a filename (infer format from file extension)
dump(data, 'out.pkl')

# dump data to a file with a file-like object
with open('test.yaml', 'w') as f:
    data = dump(data, f, file_format='yaml')

Load from other backends or dump to other backends

from mmengine import load, dump

# load data from a file
data = load('s3://bucket-name/test.json')
data = load('s3://bucket-name/test.yaml')
data = load('s3://bucket-name/test.pkl')

# dump data to a file with a filename (infer format from file extension)
dump(data, 's3://bucket-name/out.pkl')

It is also very convenient to extend the API to support more file formats. All you need to do is to write a file handler inherited from BaseFileHandler and register it with one or several file formats.

from mmengine import register_handler, BaseFileHandler

# To register multiple file formats, a list can be used as the argument.
# @register_handler(['txt', 'log'])
@register_handler('txt')
class TxtHandler1(BaseFileHandler):

    def load_from_fileobj(self, file):
        return file.read()

    def dump_to_fileobj(self, obj, file):
        file.write(str(obj))

    def dump_to_str(self, obj, **kwargs):
        return str(obj)

Here is an example of PickleHandler:

from mmengine import BaseFileHandler
import pickle

class PickleHandler(BaseFileHandler):

    def load_from_fileobj(self, file, **kwargs):
        return pickle.load(file, **kwargs)

    def load_from_path(self, filepath, **kwargs):
        return super(PickleHandler, self).load_from_path(
            filepath, mode='rb', **kwargs)

    def dump_to_str(self, obj, **kwargs):
        kwargs.setdefault('protocol', 2)
        return pickle.dumps(obj, **kwargs)

    def dump_to_fileobj(self, obj, file, **kwargs):
        kwargs.setdefault('protocol', 2)
        pickle.dump(obj, file, **kwargs)

    def dump_to_path(self, obj, filepath, **kwargs):
        super(PickleHandler, self).dump_to_path(
            obj, filepath, mode='wb', **kwargs)

Load a text file as a list or dict

For example a.txt is a text file with 5 lines.

a
b
c
d
e

Load from disk

Use list_from_file to load the list from a.txt:

from mmengine import list_from_file

print(list_from_file('a.txt'))
# ['a', 'b', 'c', 'd', 'e']
print(list_from_file('a.txt', offset=2))
# ['c', 'd', 'e']
print(list_from_file('a.txt', max_num=2))
# ['a', 'b']
print(list_from_file('a.txt', prefix='/mnt/'))
# ['/mnt/a', '/mnt/b', '/mnt/c', '/mnt/d', '/mnt/e']

For example b.txt is a text file with 3 lines.

1 cat
2 dog cow
3 panda

Then use dict_from_file to load the dict from b.txt:

from mmengine import dict_from_file

print(dict_from_file('b.txt'))
# {'1': 'cat', '2': ['dog', 'cow'], '3': 'panda'}
print(dict_from_file('b.txt', key_type=int))
# {1: 'cat', 2: ['dog', 'cow'], 3: 'panda'}

Load from other backends

Use list_from_file to load the list from s3://bucket-name/a.txt:

from mmengine import list_from_file

print(list_from_file('s3://bucket-name/a.txt'))
# ['a', 'b', 'c', 'd', 'e']
print(list_from_file('s3://bucket-name/a.txt', offset=2))
# ['c', 'd', 'e']
print(list_from_file('s3://bucket-name/a.txt', max_num=2))
# ['a', 'b']
print(list_from_file('s3://bucket-name/a.txt', prefix='/mnt/'))
# ['/mnt/a', '/mnt/b', '/mnt/c', '/mnt/d', '/mnt/e']

Use dict_from_file to load the dict from s3://bucket-name/b.txt.

from mmengine import dict_from_file

print(dict_from_file('s3://bucket-name/b.txt'))
# {'1': 'cat', '2': ['dog', 'cow'], '3': 'panda'}
print(dict_from_file('s3://bucket-name/b.txt', key_type=int))
# {1: 'cat', 2: ['dog', 'cow'], 3: 'panda'}

Load and dump checkpoints

We can read the checkpoints from disk or internet in the following way:

import torch

filepath1 = '/path/of/your/checkpoint1.pth'
filepath2 = 'http://path/of/your/checkpoint3.pth'

# read checkpoints from disk
checkpoint = torch.load(filepath1)
# save checkpoints to disk
torch.save(checkpoint, filepath1)

# read checkpoints from internet
checkpoint = torch.utils.model_zoo.load_url(filepath2)

In MMEngine, reading and writing checkpoints in different storage forms can be uniformly implemented with load_checkpoint and save_checkpoint:

from mmengine import load_checkpoint, save_checkpoint

filepath1 = '/path/of/your/checkpoint1.pth'
filepath2 = 's3://bucket-name/path/of/your/checkpoint1.pth'
filepath3 = 'http://path/of/your/checkpoint3.pth'

# read checkpoints from disk
checkpoint = load_checkpoint(filepath1)
# save checkpoints to disk
save_checkpoint(checkpoint, filepath1)

# read checkpoints from s3
checkpoint = load_checkpoint(filepath2)
# save checkpoints to s3
save_checkpoint(checkpoint, filepath2)

# read checkpoints from internet
checkpoint = load_checkpoint(filepath3)

Global manager (ManagerMixin)

During the training process, it is inevitable that we need to access some variables globally. Here are some examples:

  • Accessing the logger in model to print some initialization information

  • Accessing the Visualizer anywhere to visualize the predictions and feature maps.

  • Accessing the scope in Registry to get the current scope.

In order to unify the mechanism to get the global variable built from different classes, MMEngine designs the ManagerMixin.

Interface introduction

  • get_instance(name=’’, **kwargs): Create or get the instance by name.

  • get_current_instance(): Get the currently built instance.

  • instance_name: Get the name of the instance.

How to use

  1. Define a class inherited from ManagerMixin

from mmengine.utils import ManagerMixin


class GlobalClass(ManagerMixin):
    def __init__(self, name, value):
        super().__init__(name)
        self.value = value

Note

Subclasses of ManagerMixin must accept name argument in __init__. The name argument is used to identify the instance, and you can get the instance by get_instance(name).

  1. Instantiate the instance anywhere. let’s take the hook as an example:

from mmengine import Hook

class CustomHook(Hook):
    def before_run(self, runner):
        GlobalClass.get_instance('mmengine', value=50)
        GlobalClass.get_instance(runner.experiment_name, value=100)

GlobalClass.get_instance({name}) will first check whether the instance with the name {name} has been built. If not, it will build a new instance with the name {name}, otherwise it will return the existing instance. As the above example shows, when we call GlobalClass.get_instance('mmengine') at the first time, it will build a new instance with the name mmengine. Then we call GlobalClass.get_instance(runner.experiment_name), it will also build a new instance with a different name.

Here we build two instances for the convenience of the subsequent introduction of get_current_instance.

  1. Accessing the instance anywhere

import torch.nn as nn


class CustomModule(nn.Module):
    def forward(self, x):
        value = GlobalClass.get_current_instance().value
        # Since the name of the latest built instance is
        # `runner.experiment_name`, value will be 100.

        value = GlobalClass.get_instance('mmengine').value
        # The value of instance with the name mmengine is 50.

        value = GlobalClass.get_instance('mmengine', 1000).value
        # `mmengine` instance has been built, an error will be raised
        # if `get_instance` accepts other parameters.

We can get the instance with the specified name by get_instance(name), or get the currently built instance by get_current_instance anywhere.

Warning

If the instance with the specified name has already been built, get_instance will raise an error if it accepts its construct parameters.

Use modules from other libraries

Based on MMEngine’s Registry and Config, users can build modules across libraries. For example, use MMClassification’s backbones in MMDetection, or MMDetection’s data transforms in MMRotate, or using MMDetection’s detectors in MMTracking.

Modules registered in the same registry tree can be called across libraries by adding the package name prefix before the module’s type in the config. Here are some common examples:

Use backbone across libraries

Taking the example of using MMClassification’s ConvNeXt in MMDetection:

Firstly, adding the custom_imports field to the config to register the backbones of MMClassification to the registry.

Secondly, adding the package name of MMClassification mmcls to the type of the backbone as a prefix: mmcls.ConvNeXt

# Use custom_imports to register mmcls models to the registry
custom_imports = dict(imports=['mmcls.models'], allow_failed_imports=False)

model = dict(
  type='MaskRCNN',
  data_preprocessor=dict(...),
  backbone=dict(
      type='mmcls.ConvNeXt', # Add mmcls prefix to enable cross-library mechanism
      arch='tiny',
      out_indices=[0, 1, 2, 3],
      drop_path_rate=0.4,
      layer_scale_init_value=1.0,
      gap_before_final_norm=False,
      init_cfg=dict(
          type='Pretrained',
          checkpoint=
          'https://download.openmmlab.com/mmclassification/v0/convnext/downstream/convnext-tiny_3rdparty_32xb128-noema_in1k_20220301-795e9634.pth',
          prefix='backbone.')),
  neck=dict(...),
  rpn_head=dict(...))

Use data transform across libraries

As with the example of backbone above, cross-library calls can be simply achieved by adding custom_imports and prefix in the config:

# Use custom_imports to register mmdet transforms to the registry
custom_imports = dict(imports=['mmdet.datasets.transforms'], allow_failed_imports=False)

# Add mmdet prefix to enable cross-library mechanism
train_pipeline=[
    dict(type='mmdet.LoadImageFromFile'),
    dict(type='mmdet.LoadAnnotations', with_bbox=True, box_type='qbox'),
    dict(type='ConvertBoxType', box_type_mapping=dict(gt_bboxes='rbox')),
    dict(type='mmdet.Resize', scale=(1024, 2014), keep_ratio=True),
    dict(type='mmdet.RandomFlip', prob=0.5),
    dict(type='mmdet.PackDetInputs')
]

Use detector across libraries

Using an algorithm from another library is a little bit complex.

An algorithm contains multiple submodules. Each submodule needs to add a prefix to its type. Take using MMDetection’s YOLOX in MMTracking as an example:

# Use custom_imports to register mmdet models to the registry
custom_imports = dict(imports=['mmdet.models'], allow_failed_imports=False)

model = dict(
    type='mmdet.YOLOX',
    backbone=dict(type='mmdet.CSPDarknet', deepen_factor=1.33, widen_factor=1.25),
    neck=dict(
        type='mmdet.YOLOXPAFPN',
        in_channels=[320, 640, 1280],
        out_channels=320,
        num_csp_blocks=4),
    bbox_head=dict(
        type='mmdet.YOLOXHead', num_classes=1, in_channels=320, feat_channels=320),
    train_cfg=dict(assigner=dict(type='mmdet.SimOTAAssigner', center_radius=2.5)))

To prevent adding prefix to all of the submodules manually, the _scope_ keyword is introduced. When the _scope_ keyword is added to the config of a module, all submodules’ scope will be changed by the _scope_ keyword. Here is an example config:

# Use custom_imports to register mmdet models to the registry
custom_imports = dict(imports=['mmdet.models'], allow_failed_imports=False)

model = dict(
    _scope_='mmdet', # use the _scope_ keyword to avoid adding prefix to all submodules
    type='YOLOX',
    backbone=dict(type='CSPDarknet', deepen_factor=1.33, widen_factor=1.25),
    neck=dict(
        type='YOLOXPAFPN',
        in_channels=[320, 640, 1280],
        out_channels=320,
        num_csp_blocks=4),
    bbox_head=dict(
        type='YOLOXHead', num_classes=1, in_channels=320, feat_channels=320),
    train_cfg=dict(assigner=dict(type='SimOTAAssigner', center_radius=2.5)))

These two examples are equivalent to each other.

If you want to know more about the registry and config, please refer to Config Tutorial and Registry Tutorial

Test time augmentation

Test time augmentation (TTA) is a data augmentation strategy used during the testing phase. It involves applying various augmentations, such as flipping and scaling, to the same image and then merging the predictions of each augmented image to produce a more accurate prediction. To make it easier for users to use TTA, MMEngine provides BaseTTAModel class, which allows users to implement different TTA strategies by simply extending the BaseTTAModel class according to their needs.

The core implementation of TTA is usually divided into two parts:

  1. Data augmentation: This part is implemented in MMCV, see the api docs TestTimeAug for more information.

  2. Merge the predictions: The subclasses of BaseTTAModel will merge the predictions of enhanced data in the test_step method to improve the accuracy of predictions.

Get started

A simple example of TTA is given in examples/test_time_augmentation.py

Prepare test time augmentation pipeline

BaseTTAModel needs to be used with TestTimeAug implemented in MMCV:

tta_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(
        type='TestTimeAug',
        transforms=[
            [dict(type='Resize', img_scale=(1333, 800), keep_ratio=True)],
            [dict(type='RandomFlip', flip_ratio=0.),
             dict(type='RandomFlip', flip_ratio=1.)],
            [dict(type='PackXXXInputs', keys=['img'])],
        ])
]

The above data augmentation pipeline will first perform a scaling enhancement on the image, followed by 2 flipping enhancements (flipping and not flipping). Finally, the image is packaged into the final result using PackXXXInputs.

Define the merge strategy

Commonly, users only need to inherit BaseTTAModel and override the BaseTTAModel.merge_preds to merge the predictions of enhanced data. merge_preds accepts a list of enhanced batch data, and each element of the list means the enhanced single data of the batch.

The BaseTTAModel class requires inferencing on both flipped and unflipped images and then merges the results. The merge_preds method accepts a list where each element represents the results of applying data augmentation to a single element of the batch. For example, if batch_size is 3, and we flip each image in the batch as an augmentation, merge_preds would accept a parameter like the following:

# `data_{i}_{j}` represents the result of applying the jth data augmentation to
#  the ith image in the batch. So, if batch_size is 3, i can take on values of
# 0, 1, and 2. If there are 2 augmentation methods
# (such as flipping the image), then j can take on values of 0 and 1.
# For example, data_2_1 would represent the result of applying the second
# augmentation method (flipping) to the third image in the batch.

demo_results = [
    [data_0_0, data_0_1],
    [data_1_0, data_1_1],
    [data_2_0, data_2_1],
]

The merge_preds method will merge the predictions demo_results into single batch results. For example, if we want to merge multiple classification results:

class AverageClsScoreTTA(BaseTTAModel):
    def merge_preds(
        self,
        data_samples_list: List[List[ClsDataSample]],
    ) -> List[ClsDataSample]:

        merged_data_samples = []
        for data_samples in data_samples_list:
            merged_data_sample: ClsDataSample = data_samples[0].new()
            merged_score = sum(data_sample.pred_label.score
                               for data_sample in data_samples) / len(data_samples)
            merged_data_sample.set_pred_score(merged_score)
            merged_data_samples.append(merged_data_sample)
        return merged_data_samples

The configuration file for the above example is as follows:

tta_model = dict(type='AverageClsScoreTTA')

Changes to test script

cfg.model = ConfigDict(**cfg.tta_model, module=cfg.model)
cfg.test_dataloader.dataset.pipeline = cfg.tta_pipeline

Advanced usage

In general, users who inherit the BaseTTAModel class only need to implement the merge_preds method to perform result fusion. However, for more complex cases, such as fusing the results of a multi-stage detector, it may be necessary to override the test_step method. This requires an understanding of the data flow in the BaseTTAModel class and its relationship with other components.

The relationship between BaseTTAModel and other components

The BaseTTAModel class acts as an intermediary between the DDPWrapper and Model classes. When the Runner.test() method is executed, it will first call DDPWrapper.test_step(), followed by TTAModel.test_step(), and finally model.test_step().

The following diagram illustrates this sequence of method calls:

data flow

After data augmentation with TestTimeAug, the resulting data will have the following format:

image1  = dict(
    inputs=[data_1_1, data_1_2],
    data_sample=[data_sample1_1, data_sample1_2])
)

image2  = dict(
    inputs=[data_2_1, data_2_2],
    data_sample=[data_sample2_1, data_sample2_2])
)

image3  = dict(
    inputs=[data_3_1, data_3_2],
    data_sample=[data_sample3_1, data_sample3_2])
)

where data_{i}_{j} means the enhanced data,and data_sample_{i}_{j} means the ground truth of enhanced data. Then the data will be processed by Dataloader, which contributes to the following format:

data_batch = dict(
    inputs = [
              (data_1_1, data_2_1, data_3_1),
              (data_1_2, data_2_2, data_3_2),
             ]
    data_samples=[
         (data_samples1_1, data_samples2_1, data_samples3_1),
         (data_samples1_2, data_samples2_2, data_samples3_2)
     ]
)

To facilitate model inferencing, the BaseTTAModel will convert the data into the following format:

data_batch_aug1 = dict(
    inputs = (data_1_1, data_2_1, data_3_1),
    data_samples=(data_samples1_1, data_samples2_1, data_samples3_1)
)

data_batch_aug2 = dict(
    inputs = (data_1_2, data_2_2, data_3_2),
    data_samples=(data_samples1_2, data_samples2_2, data_samples3_2)
)

At this point, each data_batch_aug can be passed directly to the model for inferencing. After the model has performed inferencing, the BaseTTAModel will reorganize the predictions as follows for the convenience of merging:

preds = [
    [data_samples1_1, data_samples_1_2],
    [data_samples2_1, data_samples_2_2],
    [data_samples3_1, data_samples_3_2],
]

Now that we understand the data flow in TTA, we can override the BaseTTAModel.test_step() method to implement more complex fusion strategies based on specific requirements.

Hook

Hook programming is a programming pattern in which a mount point is set in one or more locations of a program. When the program runs to a mount point, all methods registered to it at runtime are automatically called. Hook programming can increase the flexibility and extensibility of the program since users can register custom methods to the mount point to be called without modifying the code in the program.

Examples

Here is an example of how it works.

pre_hooks = [(print, 'hello')]
post_hooks = [(print, 'goodbye')]

def main():
    for func, arg in pre_hooks:
        func(arg)
    print('do something here')
    for func, arg in post_hooks:
        func(arg)

main()

Output of the above example.

hello
do something here
goodbye

As we can see, the main function calls print defined in hooks in two locations without making any changes.

Hook is also used everywhere in PyTorch, for example in the neural network module (nn.Module) to get the forward input and output of the module as well as the reverse input and output. For example, the register_forward_hook method registers a forward hook with the module, and the hook can get the forward input and output of the module.

The following is an example of the register_forward_hook usage.

import torch
import torch.nn as nn

def forward_hook_fn(
    module,  # object to be registered hooks
    input,   # forward input of module
    output,  # forward output of module
):
    print(f'"forward_hook_fn" is invoked by {module.name}')
    print('weight:', module.weight.data)
    print('bias:', module.bias.data)
    print('input:', input)
    print('output:', output)

class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Linear(3, 1)

    def forward(self, x):
        y = self.fc(x)
        return y

model = Model()
# Register forward_hook_fn to each submodule of model
for module in model.children():
    module.register_forward_hook(forward_hook_fn)

x = torch.Tensor([[0.0, 1.0, 2.0]])
y = model(x)

Output of the above example.

"forward_hook_fn" is invoked by Linear(in_features=3, out_features=1, bias=True)
weight: tensor([[-0.4077,  0.0119, -0.3606]])
bias: tensor([-0.2943])
input: (tensor([[0., 1., 2.]]),)
output: tensor([[-1.0036]], grad_fn=<AddmmBackward>)

We can see that the forward_hook_fn hook registered to the nn.Linear module is called, and in that hook the weights, biases, module inputs, and outputs of the Linear module are printed. For more information on the use of PyTorch hooks you can read nn.Module.

Design on MMEngine

Before introducing the design of the Hook in MMEngine, let’s briefly introduce the basic steps of model training using PyTorch (copied from PyTorch Tutorials).

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision.transforms as transforms
from torch.utils.data import Dataset, DataLoader

class CustomDataset(Dataset):
    pass

class Net(nn.Module):
    pass

def main():
    transform = transforms.ToTensor()
    train_dataset = CustomDataset(transform=transform, ...)
    val_dataset = CustomDataset(transform=transform, ...)
    test_dataset = CustomDataset(transform=transform, ...)
    train_dataloader = DataLoader(train_dataset, ...)
    val_dataloader = DataLoader(val_dataset, ...)
    test_dataloader = DataLoader(test_dataset, ...)

    net = Net()
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

    for i in range(max_epochs):
        for inputs, labels in train_dataloader:
            optimizer.zero_grad()
            outputs = net(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

        with torch.no_grad():
            for inputs, labels in val_dataloader:
                outputs = net(inputs)
                loss = criterion(outputs, labels)

    with torch.no_grad():
        for inputs, labels in test_dataloader:
            outputs = net(inputs)
            accuracy = ...

The above pseudo-code is the basic step to train a model. If we want to add custom operations to the above code, we need to modify and extend the main function continuously. To increase the flexibility and extensibility of the main function, we can insert mount points into the main function and implement the logic of calling hooks at the corresponding mount points. In this case, we only need to insert hooks into these locations to implement custom logic, such as loading model weights, updating model parameters, etc.

def main():
    ...
    call_hooks('before_run', hooks)
    call_hooks('after_load_checkpoint', hooks)
    call_hooks('before_train', hooks)
    for i in range(max_epochs):
        call_hooks('before_train_epoch', hooks)
        for inputs, labels in train_dataloader:
            call_hooks('before_train_iter', hooks)
            outputs = net(inputs)
            loss = criterion(outputs, labels)
            call_hooks('after_train_iter', hooks)
            loss.backward()
            optimizer.step()
        call_hooks('after_train_epoch', hooks)

        call_hooks('before_val_epoch', hooks)
        with torch.no_grad():
            for inputs, labels in val_dataloader:
                call_hooks('before_val_iter', hooks)
                outputs = net(inputs)
                loss = criterion(outputs, labels)
                call_hooks('after_val_iter', hooks)
        call_hooks('after_val_epoch', hooks)

        call_hooks('before_save_checkpoint', hooks)
    call_hooks('after_train', hooks)

    call_hooks('before_test_epoch', hooks)
    with torch.no_grad():
        for inputs, labels in test_dataloader:
            call_hooks('before_test_iter', hooks)
            outputs = net(inputs)
            accuracy = ...
            call_hooks('after_test_iter', hooks)
    call_hooks('after_test_epoch', hooks)

    call_hooks('after_run', hooks)

In MMEngine, we encapsulates the training process into an executor (Runner). The Runner calls hooks at specific mount points to complete the customization logic. For more information about Runner, please read the Runner documentation.

To facilitate management, MMEngine defines mount points as methods and integrates them into Base Hook. We just need to inherit the base hook and implement custom logic at specific location according to our needs, then register the hooks to the Runner. Those hooks will be called automatically.

There are 22 mount points in the Base Hook.

  • before_run

  • after_run

  • before_train

  • after_train

  • before_train_epoch

  • after_train_epoch

  • before_train_iter

  • after_train_iter

  • before_val

  • after_val

  • before_test_epoch

  • after_test_epoch

  • before_val_iter

  • after_val_iter

  • before_test

  • after_test

  • before_test_epoch

  • after_test_epoch

  • before_test_iter

  • after_test_iter

  • before_save_checkpoint

  • after_load_checkpoint

Further readings: Hook tutorial and Hook API documentations

Runner

Deep learning algorithms usually share similar pipelines for training, validation and testing. Therefore, MMengine designed Runner to simplify the construction of these pipelines. In most cases, users can use our default Runner directly. If you find it not feasible to implement your ideas, you can also modify it or customize your own runner.

Before introducing the design of Runner, let’s walk through some examples to better understand why we should use runner. Below is a few lines of pseudo codes for training models in PyTorch:

model = ResNet()
optimizer = SGD(model.parameters(), lr=0.01, momentum=0.9)
train_dataset = ImageNetDataset(...)
train_dataloader = DataLoader(train_dataset, ...)

for i in range(max_epochs):
    for data_batch in train_dataloader:
        optimizer.zero_grad()
        outputs = model(data_batch)
        loss = loss_func(outputs, data_batch)
        loss.backward()
        optimizer.step()

Pseudo codes for model validation in PyTorch:

model = ResNet()
model.load_state_dict(torch.load(CKPT_PATH))
model.eval()

test_dataset = ImageNetDataset(...)
test_dataloader = DataLoader(test_dataset, ...)

for data_batch in test_dataloader:
    outputs = model(data_batch)
    acc = calculate_acc(outputs, data_batch)

Pseudo codes for model inference in PyTorch:

model = ResNet()
model.load_state_dict(torch.load(CKPT_PATH))
model.eval()

for img in imgs:
    prediction = model(img)

The observation from the above 3 pieces of codes is that they are similar. They can all be divided into some distinct steps, such as model construction, data loading and loop iterations. Although the above examples are based on image classification tasks, the same holds for many other tasks as well, including object detection, image segmentation, etc. Based on the observation above, we propose runner, which structures the training, validation and testing pipeline. With runner, the only thing you need to do is to prepare necessary components (models, data, etc.) of your pipeline, and leave the schedule and execution to Runner. You are free of constructing similar pipelines one and another time. You are free of annoying details like the differences between distributed and non-distributed training. You can focus on your own awesome ideas. These are all achieved by runner and various practical modules in MMEngine.

Runner

The Runner in MMEngine contains various modules required for training, testing and validation, as well as loop controllers(Loop) and Hook, as shown in the figure above. It provides 3 APIs for users: train, val and test, each correspond to a specific Loop. You can use Runner either by providing a config file, or by providing manually constructed modules. Once activated, the Runner will automatically setup the runtime environment, build/compose your modules, execute the loop iterations in Loop and call registered hooks during iterations.

The execution order of Runner is as follows:

runner_flow

A feature of Runner is that it will always lazily initialize modules managed by itself. To be specific, Runner won’t build every module on initialization, and it won’t build a module until it is needed in current Loop. Therefore, if you are running only one of the train, val, or test pipelines, you only need to provide the relevant configs/modules.

Loop

In MMEngine, we abstract the execution process of the task into Loop, based on the observation that most deep learning tasks can be summarized as a model iterating over datasets. We provide 4 built-in loops in MMEngine:

  • EpochBasedTrainLoop

  • IterBasedTrainLoop

  • ValLoop

  • TestLoop

Loop

The built-in runner and loops are capable of most deep learning tasks, but surely not all. Some tasks need extra modifications and refactorizations. Therefore, we make it possible for users to customize their own pipelines for model training, validation and testing.

You can write your own pipeline by subclassing BaseLoop, which needs 2 arguments for initialization: 1) runner the Runner instance, and 2) dataloader the dataloader used in this loop. You are free to add more arguments to your own loop subclass. After defining your own loop subclass, you should register it to LOOPS(mmengine.registry.LOOPS), and specify it in config files by type field in train_cfg, val_cfg and test_cfg. In fact, you can write any execution order, any hook position in your own loop. However, built-in hooks may not work if you change hook positions, which may lead to inconsistent behavior during training. Therefore, we strongly recommend you to implement you subclass with similar execution order illustrated in the figure above, and with the same hook positions defined in hook documentation.

from mmengine.registry import LOOPS, HOOKS
from mmengine.runner import BaseLoop
from mmengine.hooks import Hook


# Customized validation loop
@LOOPS.register_module()
class CustomValLoop(BaseLoop):
    def __init__(self, runner, dataloader, evaluator, dataloader2):
        super().__init__(runner, dataloader, evaluator)
        self.dataloader2 = runner.build_dataloader(dataloader2)

    def run(self):
        self.runner.call_hooks('before_val_epoch')
        for idx, data_batch in enumerate(self.dataloader):
            self.runner.call_hooks(
                'before_val_iter', batch_idx=idx, data_batch=data_batch)
            outputs = self.run_iter(idx, data_batch)
            self.runner.call_hooks(
                'after_val_iter', batch_idx=idx, data_batch=data_batch, outputs=outputs)
        metric = self.evaluator.evaluate()

        # add extra loop for validation purpose
        for idx, data_batch in enumerate(self.dataloader2):
            # add new hooks
            self.runner.call_hooks(
                'before_valloader2_iter', batch_idx=idx, data_batch=data_batch)
            self.run_iter(idx, data_batch)
            # add new hooks
            self.runner.call_hooks(
                'after_valloader2_iter', batch_idx=idx, data_batch=data_batch, outputs=outputs)
        metric2 = self.evaluator.evaluate()

        ...

        self.runner.call_hooks('after_val_epoch')


# Define a hook with extra hook positions
@HOOKS.register_module()
class CustomValHook(Hook):
    def before_valloader2_iter(self, batch_idx, data_batch):
        ...

    def after_valloader2_iter(self, batch_idx, data_batch, outputs):
        ...

The example above shows how to implement a different validation loop. The new loop validates on two different validation datasets. It also defines a new hook position in the second validation. You can easily use it by setting type='CustomValLoop' in val_cfg in your config file.

# Customized validation loop
val_cfg = dict(type='CustomValLoop', dataloader2=dict(dataset=dict(type='ValDataset2'), ...))
# Customized hook with extra hook position
custom_hooks = [dict(type='CustomValHook')]

Customize Runner

Moreover, you can write your own runner by subclassing Runner if the built-in Runner is not feasible. The method is similar to writing other modules: write your subclass inherited from Runner, overrides some functions, register it to RUNNERS and access it by assigning runner_type in your config file.

from mmengine.registry import RUNNERS
from mmengine.runner import Runner

@RUNNERS.register_module()
class CustomRunner(Runner):

    def setup_env(self):
        ...

The example above shows how to implement a customized runner which overrides the setup_env function and is registered to RUNNERS. Now CustomRunner is prepared to be used by setting runner_type='CustomRunner' in your config file.

Further readings: Runner tutorial and Runner API documentations

Evaluation

Coming soon. Please refer to chinese documentation.

Visualization

1 Overall Design

Visualization provides an intuitive explanation of the training and testing process of the deep learning model. In OpenMMLab, we expect the visualization module to meet the following requirements:

  • Provides rich out-of-the-box features that can meet most computer vision visualization tasks.

  • Versatile, expandable, and can be customized easily

  • Able to visualize at anywhere in the training and testing process.

  • Unified APIs for all OpenMMLab libraries, which is convenient for users to understand and use.

Based on the above requirements, we proposed the Visualizer and various VisBackend such as LocalVisBackend, WandbVisBackend, and TensorboardVisBackend in OpenMMLab 2.0. The visualizer could not only visualize the image data, but also things like configurations, scalars, and model structure.

  • For convenience, the APIs provided by the Visualizer implement the drawing and storage functions. As an internal property of Visualizer, VisBackend will be called by Visualizer to write data to different backends.

  • Considering that you may want to write data to multiple backends after drawing, Visualizer can be configured with multiple backends. When the user calls the storage API of the Visualizer, it will traverse and call all the specified APIs of VisBackend internally.

The UML diagram of the two is as follows.

2 Visualizer

The external interface of Visualizer can be divided into three categories.

  1. Drawing APIs

  • draw_bboxes draws a single or multiple bounding boxes

  • draw_points draws a single or multiple points

  • draw_texts draws a single or multiple text boxes

  • draw_lines draws a single or multiple line segments

  • draw_circles draws a single or multiple circles

  • draw_polygons draws a single or multiple polygons

  • draw_binary_masks draws single or multiple binary masks

  • draw_featmap draws feature map (static method)

The above APIs can be called in a chain except for draw_featmap because the image size may change after this method is called. To avoid confusion, draw_featmap is a static method.

  1. Storage APIs

  • add_config writes configuration to a specific storage backend

  • add_graph writes model graph to a specific storage backend

  • add_image writes image to a specific storage backend

  • add_scalar writes scalar to a specific storage backend

  • add_scalars writes multiple scalars to a specific storage backend at once

  • add_datasample the abstract interface for each repositories to draw data sample

Interfaces beginning with the add prefix represent storage APIs. [datasample] (./data_element.md)is the unified interface of each downstream repository in the OpenMMLab 2.0, and add_datasample can process the data sample directly .

  1. Other APIs

  • set_image sets the original image data, the default input image format is RGB

  • get_image gets the image data in Numpy format after drawing, the default output format is RGB

  • show for visualization

  • get_backend gets a specific storage backend by name

  • close closes all resources, including VisBackend

For more details, you can refer to Visualizer Tutorial.

3 VisBackend

After drawing, the drawn data can be stored in multiple visualization storage backends. To unify the interfaces, MMEngine provides an abstract class, BaseVisBackend, and some commonly used backends such as LocalVisBackend, WandbVisBackend, and TensorboardVisBackend. The main interfaces and properties of BaseVisBackend are as follows:

  • add_config writes configuration to a specific storage backend

  • add_graph writes model graph to a specific backend

  • add_image writes image to a specific backend

  • add_scalar writes scalar to a specific backend

  • add_scalars writes multiple scalars to a specific backend at once

  • close closes the resource that has been opened

  • experiment writes backend objects, such as WandB objects and Tensorboard objects

BaseVisBackend defines five common data writing interfaces. Some writing backends are very powerful, such as WandB, which could write tables and videos. Users can directly obtain the experiment object for such needs and then call native APIs of the corresponding backend. LocalVisBackend, WandbVisBackend, and TensorboardVisBackend are all inherited from BaseVisBackend and implement corresponding storage functions according to their features. Users can also customize BaseVisBackend to extend the storage backends and implement custom storage requirements.

For more details, you can refer to Storage Backend Tutorial.

Logging

Overview

Runner produces amounts of logs during execution. These logs include dataset information, model initialization, learning rates, losses, etc. In order to make these logs easily accessed by users, MMEngine designs MessageHub, HistoryBuffer, LogProcessor and MMLogger, which enable:

  • Configure statistical methods in config files. For example, losses can be globally averaged or smoothed by a sliding window.

  • Query training states (iterations, epochs, etc.) in any module

  • Configure whether save the multi-process log or not during distributed training.

image

Each scalar (losses, learning rates, etc.) during training is encapsulated by HistoryBuffer, managed by MessageHub in key-value pairs, formatted by LogProcessor and then exported to various visualization backends by LoggerHook. In most cases, statistical methods of these scalars can be configured through the LogProcessor without understanding the data flow. Before diving into the design of the logging system, please read through logging tutorial first for familiarizing basic use cases.

HistoryBuffer

HistoryBuffer records the history of the corresponding scalar such as losses, learning rates, and iteration time in an array. As an internal class, it works with MessageHub, LoggerHook and LogProcessor to make training log configurable. Meanwhile, HistoryBuffer can also be used alone, which enables users to manage their training logs and do various statistics in an easy manner.

We will first introduce the usage of HistoryBuffer in the following section. The association between HistoryBuffer and MessageHub will be introduced later in the MessageHub section.

HistoryBuffer Initialization

HistoryBuffer accepts log_history, count_history and max_length for initialization.

  • log_history records the history of the scaler. For example, if the loss in the previous 3 iterations is 0.3, 0.2, 0.1 respectively, there will be log_history=[0.3, 0.2, 0.1].

  • count_history controls the statistical granularity and will be used when counting the average. Take the above example, if we count the average loss across iterations, we have count_history=[1, 1, 1]. Instead, if we count the average loss across images with batch_size=8, then we have count_history=[8, 8, 8].

  • max_length controls the maximum length of the history. If the length of log_history and count_history exceeds max_length, the earliest elements will be removed.

Besides, we can access the history of the data through history_buffer.data.

from mmengine.logging import HistoryBuffer

history_buffer = HistoryBuffer()  # Default initialization
log_history, count_history = history_buffer.data
# [] []
history_buffer = HistoryBuffer([1, 2, 3], [1, 2, 3])  # Init with lists
log_history, count_history = history_buffer.data
# [1 2 3] [1 2 3]
history_buffer = HistoryBuffer([1, 2, 3], [1, 2, 3], max_length=2)
# The length of history buffer(3) exceeds the max_length(2), the first few elements will be ignored.
log_history, count_history = history_buffer.data
# [2 3] [2 3]

HistoryBuffer Update

We can update the log_history and count_history through HistoryBuffer.update(log_history, count_history).

history_buffer = HistoryBuffer([1, 2, 3], [1, 1, 1])
history_buffer.update(4)  # count default to 1
log_history, count_history = history_buffer.data
# [1, 2, 3, 4] [1, 1, 1, 1]
history_buffer.update(5, 2)
log_history, count_history = history_buffer.data
# [1, 2, 3, 4, 5] [1, 1, 1, 1, 2]

Basic Statistical Methods

HistoryBuffer provides some basic statistical methods:

  • current(): Get the latest data.

  • mean(window_size=None): Count the mean value of the previous window_size data. Defaults to None, as global mean.

  • max(window_size=None): Count the max value of the previous window_size data. Defaults to None, as global maximum.

  • min(window_size=None): Count the min value of the previous window_size data. Defaults to None, as global minimum.

history_buffer = HistoryBuffer([1, 2, 3], [1, 1, 1])
history_buffer.min(2)
# 2, the minimum in [2, 3]
history_buffer.min()
# 1, the global minimum

history_buffer.max(2)
# 3,the maximum in [2, 3]
history_buffer.min()
# 3, the global maximum
history_buffer.mean(2)
# 2.5,the mean value in [2, 3], (2 + 3) / (1 + 1)
history_buffer.mean()
# 2, the global mean, (1 + 2 + 3) / (1 + 1 + 1)
history_buffer = HistoryBuffer([1, 2, 3], [2, 2, 2])  # Cases when counts are not 1
history_buffer.mean()
# 1, (1 + 2 + 3) / (2 + 2 + 2)
history_buffer = HistoryBuffer([1, 2, 3], [1, 1, 1])
history_buffer.update(4, 1)
history_buffer.current()
# 4

Statistical Methods Invoking

Statistical methods can be accessed through HistoryBuffer.statistics with method name and arguments. The name parameter should be a registered method name (i.e. built-in methods like min and max), while arguments should be the corresponding method’s arguments.

history_buffer = HistoryBuffer([1, 2, 3], [1, 1, 1])
history_buffer.statistics('mean')
# 2, as global mean
history_buffer.statistics('mean', 2)
# 2.5, as the mean of [2, 3]
history_buffer.statistics('mean', 2, 3)
# Error! mismatch arguments given to `mean(window_size)`
history_buffer.statistics('data')
# Error! `data` method not registered

Statistical Methods Registration

Custom statistical methods can be registered through @HistoryBuffer.register_statistics.

from mmengine.logging import HistoryBuffer
import numpy as np


@HistoryBuffer.register_statistics
def weighted_mean(self, window_size, weight):
    assert len(weight) == window_size
    return (self._log_history[-window_size:] * np.array(weight)).sum() / \
            self._count_history[-window_size:]


history_buffer = HistoryBuffer([1, 2], [1, 1])
history_buffer.statistics('weighted_mean', 2, [2, 1])  # get (2 * 1 + 1 * 2) / (1 + 1)

Use Cases

logs = dict(lr=HistoryBuffer(), loss=HistoryBuffer())  # different keys for different logs
max_iter = 10
log_interval = 5
for iter in range(1, max_iter+1):
    lr = iter / max_iter * 0.1  # linear scaling of lr
    loss = 1 / iter  # loss
    logs['lr'].update(lr, 1)
    logs['loss'].update(loss, 1)
    if iter % log_interval == 0:
        latest_lr = logs['lr'].statistics('current')  # select statistical methods by name
        mean_loss = logs['loss'].statistics('mean', log_interval)  # mean loss of the latest `log_interval` iterations
        print(f'lr:   {latest_lr}\n'
              f'loss: {mean_loss}')
# lr:   0.05
# loss: 0.45666666666666667
# lr:   0.1
# loss: 0.12912698412698415

MessageHub

As shown above, HistoryBuffer can easily handle the update and statistics of a single variable. However, there are multiple variables to log during training, each potentially coming from a different module. This makes it an issue to collect and distribute different variables. To address this issue, we provide MessageHub in MMEngine. It is derived from ManagerMixin and thus can be accessed globally. It can be used to simplify the sharing of data across modules.

MessageHub stores data into 2 internal dictionaries, each has its own definition:

  • log_scalars: Scalars including losses, learning rates and iteration time are collected from different modules and stored into the HistoryBuffer with corresponding key in this dict. Values in this dict will be formatted by LogProcessor and then output to terminal or saved locally. If you want to customize your logging info, you can add new keys to this dict and update in the subsequent training steps.

  • runtime_info: Some runtime information including epochs and iterations are stored in this dict. This dict makes it easy to share some necessary information across modules.

Note

You may need to use MessageHub only if you want to add extra data to logs or share custom data across modules.

The following examples show the usage of MessageHub, including scalars update, data sharing and log customization.

Update & get training log

HistoryBuffers are stored in MessageHub’s log_scalars dictionary as values. You can call update_scalars method to update the HistoryBuffer with the given key. On first call with an unseen key, a HistoryBuffer will be initialized. In the subsequent calls with the same key, the corresponding HistoryBuffer’s update method will be invoked. You can get values or statistics of a HistoryBuffer by specifying a key in get_scalar method. You can also get full logs by directly accessing the log_scalars attribute of a MessageHub.

from mmengine import MessageHub

message_hub = MessageHub.get_instance('task')
message_hub.update_scalar('train/loss', 1, 1)
message_hub.get_scalar('train/loss').current()  # 1, the latest updated train/loss
message_hub.update_scalar('train/loss', 3, 1)
message_hub.get_scalar('train/loss').mean()  # 2, the mean calculated as (1 + 3) / (1 + 1)
message_hub.update_scalar('train/lr', 0.1, 1)

message_hub.update_scalars({'train/time': {'value': 0.1, 'count': 1},
                            'train/data_time': {'value': 0.1, 'count': 1}})

train_time = message_hub.get_scalar('train/time')  # 1

log_dict = message_hub.log_scalars  # return the whole dict
lr_buffer, loss_buffer, time_buffer, data_time_buffer = (
    log_dict['train/lr'], log_dict['train/loss'], log_dict['train/time'],
    log_dict['train/data_time'])

Note

Losses, learning rates and iteration time are automatically updated by runner and hooks. You are not supposed to manually update them.

Note

MessageHub has no special requirements for keys in log_scalars. However, MMEngine will only output a scalar to logs if it has a key prfixed with train/val/test.

Update & get runtime info

Runtime information is stored in runtime_info dict. The dict accepts data in any data types. Different from HistoryBuffer, the value will be overwritten on every update.

message_hub = MessageHub.get_instance('task')
message_hub.update_info('iter', 1)
message_hub.get_info('iter')  # 1
message_hub.update_info('iter', 2)
message_hub.get_info('iter')  # 2, overwritten by the above command

Share MessageHub across modules

During the execution of a runner, different modules receive and post data through MessageHub. Then, RuntimeInfoHook gathers data such as losses and learning rates before exporting them to user defined backends (Tensorboard, WandB, etc). Following is an example to show the communication between logger hook and other modules.

from mmengine import MessageHub

class LogProcessor:
    # gather data from other modules. similar to logger hook
    def __init__(self, name):
        self.message_hub = MessageHub.get_instance(name)  # access MessageHub

    def run(self):
        print(f"Learning rate is {self.message_hub.get_scalar('train/lr').current()}")
        print(f"loss is {self.message_hub.get_scalar('train/loss').current()}")
        print(f"meta is {self.message_hub.get_info('meta')}")


class LrUpdater:
    # update the learning rate
    def __init__(self, name):
        self.message_hub = MessageHub.get_instance(name)  # access MessageHub

    def run(self):
        self.message_hub.update_scalar('train/lr', 0.001)
        # update the learning rate, saved as HistoryBuffer


class MetaUpdater:
    # update meta information
    def __init__(self, name):
        self.message_hub = MessageHub.get_instance(name)

    def run(self):
        self.message_hub.update_info(
            'meta',
            dict(experiment='retinanet_r50_caffe_fpn_1x_coco.py',
                 repo='mmdetection'))    # meta info will be overwritten on every update


class LossUpdater:
    # update losses
    def __init__(self, name):
        self.message_hub = MessageHub.get_instance(name)

    def run(self):
        self.message_hub.update_scalar('train/loss', 0.1)

class ToyRunner:
    # compose of different modules
    def __init__(self, name):
        self.message_hub = MessageHub.get_instance(name)  # this will create a global MessageHub instance
        self.log_processor = LogProcessor(name)
        self.updaters = [LossUpdater(name),
                         MetaUpdater(name),
                         LrUpdater(name)]

    def run(self):
        for updater in self.updaters:
            updater.run()
        self.log_processor.run()

if __name__ == '__main__':
    task = ToyRunner('name')
    task.run()
    # Learning rate is 0.001
    # loss is 0.1
    # meta {'experiment': 'retinanet_r50_caffe_fpn_1x_coco.py', 'repo': 'mmdetection'}

Add custom logs

Users can update scalars in MessageHub anywhere in any module. All data in log_scalars with valid keys are exported to user defined backends after statistical methods.

Note

Only those data in log_scalars with keys prefixed with train/val/test are exported.

class CustomModule:
    def __init__(self):
        self.message_hub = MessageHub.get_current_instance()

    def custom_method(self):
        self.message_hub.update_scalar('train/a', 100)
        self.message_hub.update_scalars({'train/b': 1, 'train/c': 2})

By default, the latest value of the custom data(a, b and c) are exported. Users can also configure the LogProcessor to switch between statistical methods.

LogProcessor

Users can configure the LogProcessor to specify the statistical methods and extra arguments. By default, learning rates are displayed by the latest value, while losses and iteration time are counted with an iteration-based smooth method.

Minimum example

log_processor = dict(
    window_size=10
)

In this configuration, losses and iteration time will be averaged in the latest 10 iterations. The output might be:

04/15 12:34:24 - mmengine - INFO - Iter [10/12]  , eta: 0:00:00, time: 0.003, data_time: 0.002, loss: 0.13

Custom statistical methods

Users can configure the custom_cfg list to specify the statistical method. Each element in custom_cfg must be a dict consisting of the following keys:

  • data_src: Required argument representing the data source of the log. A data source may have multiple statistical methods. Default sources, which are automatically added to logs, include all keys in loss dict(i.e. loss), learning rate(lr) and iteration time(time & data_time). Besides, all scalars updated by MessageHub’s update_scalar/update_scalars methods with valid keys are configurable data sources, but be aware that the prefix(‘train/’, ‘val/’, ‘test/’) should be removed.

  • method_name: Required argument representing the statistical method. It supports both built-in methods and custom methods.

  • log_name: Optional argument representing the output name after statistics. If not specified, the new log will overwrite the old one.

  • Other arguments: Extra arguments needed by your specified method. window_size is a special key, which can be either an int, ‘epoch’ or ‘global’. LogProcessor will parse these arguments and return statistical result based on iteration/epoch/global smooth.

  1. Overwrite the old statistical method

log_processor = dict(
    window_size=10,
    by_epoch=True,
    custom_cfg=[
        dict(data_src='loss',
             method_name='mean',
             window_size=100)])

In this configuration, LogProcessor will overwrite the default window size 10 by a larger window size 100 and output the mean value to ‘loss’ field in logs.

04/15 12:34:24 - mmengine - INFO - Iter [10/12]  , eta: 0:00:00, time: 0.003, data_time: 0.002, loss: 0.11
  1. New statistical method without overwriting

log_processor = dict(
    window_size=10,
    by_epoch=True,
    custom_cfg=[
        dict(data_src='loss',
             log_name='loss_min',
             method_name='min',
             window_size=100)])
04/15 12:34:24 - mmengine - INFO - Iter [10/12]  , eta: 0:00:00, time: 0.003, data_time: 0.002, loss: 0.11, loss_min: 0.08

MMLogger

In order to export logs with clear hierarchies, unified formats and less disturbation from third-party logging systems, MMengine implements a MMLogger class based on logging. It is derived from ManagerMixin. Compared with logging.logger, it enables accessing logger in current runner without knowing the logger name.

Instantiate MMLogger

Users can create a global logger by calling get_instance. The default log format is shown as below

logger = MMLogger.get_instance('mmengine', log_level='INFO')
logger.info("this is a test")
# 04/15 14:01:11 - mmengine - INFO - this is a test

Apart from user defined messages, the logger will also export timestamps, logger name and log level. ERROR messages are treated specially with red highlight and extra information like error locations.

logger = MMLogger.get_instance('mmengine', log_level='INFO')
logger.error('division by zero')
# 04/15 14:01:56 - mmengine - ERROR - /mnt/d/PythonCode/DeepLearning/OpenMMLab/mmengine/a.py - <module> - 4 - division by zero

Export logs

When get_instance is invoked with log_file argument, logs will be additionally exported to local storage in text format.

logger = MMLogger.get_instance('mmengine', log_file='tmp.log', log_level='INFO')
logger.info("this is a test")
# 04/15 14:01:11 - mmengine - INFO - this is a test

tmp/tmp.log:

04/15 14:01:11 - mmengine - INFO - this is a test

Since distributed applications will create multiple log files, we add a directory with the same name to the exported log file name. Logs from different processes are all saved in this directory. Therefore, the actual log file path in the above example is tmp/tmp.log.

Export logs in distributed training

When training with pytorch distributed methods, users can set distributed=True in config file to export multiple logs from all processes. If not specified, only master process will export log file.

logger = MMLogger.get_instance('mmengine', log_file='tmp.log', distributed=True, log_level='INFO')

In the case of multiple processes in a single node, or multiple processes in multiple nodes with shared storage, the exported log files have the following hierarchy

#  shared storage case
./tmp
├── tmp.log
├── tmp_rank1.log
├── tmp_rank2.log
├── tmp_rank3.log
├── tmp_rank4.log
├── tmp_rank5.log
├── tmp_rank6.log
└── tmp_rank7.log
...
└── tmp_rank63.log

In the case of multiple processes in multiple nodes without storage, logs are organized as follows

# without shared storage
# node 0:
work_dir/
└── exp_name_logs
    ├── exp_name.log
    ├── exp_name_rank1.log
    ├── exp_name_rank2.log
    ├── exp_name_rank3.log
    ...
    └── exp_name_rank7.log

# node 7:
work_dir/
└── exp_name_logs
    ├── exp_name_rank56.log
    ├── exp_name_rank57.log
    ├── exp_name_rank58.log
    ...
    └── exp_name_rank63.log

Migrate Runner from MMCV to MMEngine

Introduction

As MMCV supports more and more deep learning tasks, and users’ needs become much more complicated, we have higher requirements for the flexibility and versatility of the existing Runner of MMCV. Therefore, MMEngine implements a more general and flexible Runner based on MMCV to support more complicated training processes.

The Runner in MMEngine expands the scope and takes on more functions. we abstracted training loop controller (EpochBasedTrainLoop/IterBasedTrainLoop), validation loop controller ( ValLoop) and TestLoop to make it more convenient for users to customize their training process.

Firstly, we will introduce how to migrate the entry point of training from MMCV to MMEngine, to simplify and unify the training script. Then, we’ll introduce the difference in the instantiation of Runner between MMCV and MMEngine in detail.

Migrate the entry point

Take MMDet as an example, the differences between training scripts in MMCV and MMEngine are as follows:

Migrate the configuration file

Configuration file based on MMCV Runner Configuration file based on MMEngine Runner
# default_runtime.py
checkpoint_config = dict(interval=1)
log_config = dict(
    interval=50,
    hooks=[
        dict(type='TextLoggerHook'),
        # dict(type='TensorboardLoggerHook')
    ])
custom_hooks = [dict(type='NumClassCheckHook')]

dist_params = dict(backend='nccl')
log_level = 'INFO'
load_from = None
resume_from = None
workflow = [('train', 1)]


opencv_num_threads = 0
mp_start_method = 'fork'
auto_scale_lr = dict(enable=False, base_batch_size=16)
# default_runtime.py
default_scope = 'mmdet'

default_hooks = dict(
    timer=dict(type='IterTimerHook'),
    logger=dict(type='LoggerHook', interval=50),
    param_scheduler=dict(type='ParamSchedulerHook'),
    checkpoint=dict(type='CheckpointHook', interval=1),
    sampler_seed=dict(type='DistSamplerSeedHook'),
    visualization=dict(type='DetVisualizationHook'))

env_cfg = dict(
    cudnn_benchmark=False,
    mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
    dist_cfg=dict(backend='nccl'),
)

vis_backends = [dict(type='LocalVisBackend')]
visualizer = dict(
    type='DetLocalVisualizer', vis_backends=vis_backends, name='visualizer')
log_processor = dict(type='LogProcessor', window_size=50, by_epoch=True)

log_level = 'INFO'
load_from = None
resume = False
# scheduler.py
# optimizer
optimizer = dict(type='SGD', lr=0.02, momentum=0.9, weight_decay=0.0001)
optimizer_config = dict(grad_clip=None)
# learning policy
lr_config = dict(
    policy='step',
    warmup='linear',
    warmup_iters=500,
    warmup_ratio=0.001,
    step=[8, 11])
runner = dict(type='EpochBasedRunner', max_epochs=12)
# scheduler.py
# training schedule for 1x
train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=12, val_interval=1)
val_cfg = dict(type='ValLoop')
test_cfg = dict(type='TestLoop')

# learning rate
param_scheduler = [
    dict(
        type='LinearLR', start_factor=0.001, by_epoch=False, begin=0, end=500),
    dict(
        type='MultiStepLR',
        begin=0,
        end=12,
        by_epoch=True,
        milestones=[8, 11],
        gamma=0.1)
]

# optimizer
optim_wrapper = dict(
    type='OptimWrapper',
    optimizer=dict(type='SGD', lr=0.02, momentum=0.9, weight_decay=0.0001))

# Default setting for scaling LR automatically
#   - `enable` means enable scaling LR automatically
#       or not by default.
#   - `base_batch_size` = (8 GPUs) x (2 samples per GPU).
auto_scale_lr = dict(enable=False, base_batch_size=16)
# coco_detection.py

# dataset settings
dataset_type = 'CocoDataset'
data_root = 'data/coco/'
img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='LoadAnnotations', with_bbox=True),
    dict(type='Resize', img_scale=(1333, 800), keep_ratio=True),
    dict(type='RandomFlip', flip_ratio=0.5),
    dict(type='Normalize', **img_norm_cfg),
    dict(type='Pad', size_divisor=32),
    dict(type='DefaultFormatBundle'),
    dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels']),
]
test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(
        type='MultiScaleFlipAug',
        img_scale=(1333, 800),
        flip=False,
        transforms=[
            dict(type='Resize', keep_ratio=True),
            dict(type='RandomFlip'),
            dict(type='Normalize', **img_norm_cfg),
            dict(type='Pad', size_divisor=32),
            dict(type='ImageToTensor', keys=['img']),
            dict(type='Collect', keys=['img']),
        ])
]
data = dict(
    samples_per_gpu=2,
    workers_per_gpu=2,
    train=dict(
        type=dataset_type,
        ann_file=data_root + 'annotations/instances_train2017.json',
        img_prefix=data_root + 'train2017/',
        pipeline=train_pipeline),
    val=dict(
        type=dataset_type,
        ann_file=data_root + 'annotations/instances_val2017.json',
        img_prefix=data_root + 'val2017/',
        pipeline=test_pipeline),
    test=dict(
        type=dataset_type,
        ann_file=data_root + 'annotations/instances_val2017.json',
        img_prefix=data_root + 'val2017/',
        pipeline=test_pipeline))
evaluation = dict(interval=1, metric='bbox')
# coco_detection.py

# dataset settings
dataset_type = 'CocoDataset'
data_root = 'data/coco/'

file_client_args = dict(backend='disk')

train_pipeline = [
    dict(type='LoadImageFromFile', file_client_args=file_client_args),
    dict(type='LoadAnnotations', with_bbox=True),
    dict(type='Resize', scale=(1333, 800), keep_ratio=True),
    dict(type='RandomFlip', prob=0.5),
    dict(type='PackDetInputs')
]
test_pipeline = [
    dict(type='LoadImageFromFile', file_client_args=file_client_args),
    dict(type='Resize', scale=(1333, 800), keep_ratio=True),
    # If you don't have a gt annotation, delete the pipeline
    dict(type='LoadAnnotations', with_bbox=True),
    dict(
        type='PackDetInputs',
        meta_keys=('img_id', 'img_path', 'ori_shape', 'img_shape',
                   'scale_factor'))
]
train_dataloader = dict(
    batch_size=2,
    num_workers=2,
    persistent_workers=True,
    sampler=dict(type='DefaultSampler', shuffle=True),
    batch_sampler=dict(type='AspectRatioBatchSampler'),
    dataset=dict(
        type=dataset_type,
        data_root=data_root,
        ann_file='annotations/instances_train2017.json',
        data_prefix=dict(img='train2017/'),
        filter_cfg=dict(filter_empty_gt=True, min_size=32),
        pipeline=train_pipeline))
val_dataloader = dict(
    batch_size=1,
    num_workers=2,
    persistent_workers=True,
    drop_last=False,
    sampler=dict(type='DefaultSampler', shuffle=False),
    dataset=dict(
        type=dataset_type,
        data_root=data_root,
        ann_file='annotations/instances_val2017.json',
        data_prefix=dict(img='val2017/'),
        test_mode=True,
        pipeline=test_pipeline))
test_dataloader = val_dataloader

val_evaluator = dict(
    type='CocoMetric',
    ann_file=data_root + 'annotations/instances_val2017.json',
    metric='bbox',
    format_only=False)
test_evaluator = val_evaluator

Runner in MMEngine provides more customizable components, including training/validation/testing process and DataLoader. Therefore, the configuration file is a bit longer compared to MMCV.

MMEngine follows the WYSIWYG principle and reorganizes the hierarchy of each component in configuration so that most of the first-level fields of configuration correspond to the core components in the Runner, such as DataLoader, Evaluator, Hook, etc. The new format configuration file could help users to read and understand the core components in Runner, and ignore the relatively unimportant parts.

Migrate the training script

Compared with the Runner in MMCV, Runner in MMEngine takes on more functions, such as building DataLoader and distributed model. Therefore, we do not need to build the components like DataLoader and distributed model manually anymore. We can configure them during the instantiation of Runner, and then build them in the training/validation/testing process. Take the training script of MMDet as an example:

Training script based on MMCV Runner Training script based on MMEngine Runner
# tools/train.py
args = parse_args()

cfg = Config.fromfile(args.config)

# replace the ${key} with the value of cfg.key
cfg = replace_cfg_vals(cfg)

# update data root according to MMDET_DATASETS
update_data_root(cfg)

if args.cfg_options is not None:
    cfg.merge_from_dict(args.cfg_options)

if args.auto_scale_lr:
    if 'auto_scale_lr' in cfg and \
            'enable' in cfg.auto_scale_lr and \
            'base_batch_size' in cfg.auto_scale_lr:
        cfg.auto_scale_lr.enable = True
    else:
        warnings.warn('Can not find "auto_scale_lr" or '
                        '"auto_scale_lr.enable" or '
                        '"auto_scale_lr.base_batch_size" in your'
                        ' configuration file. Please update all the '
                        'configuration files to mmdet >= 2.24.1.')

# set multi-process settings
setup_multi_processes(cfg)

# set cudnn_benchmark
if cfg.get('cudnn_benchmark', False):
    torch.backends.cudnn.benchmark = True

# work_dir is determined in this priority: CLI > segment in file > filename
if args.work_dir is not None:
    # update configs according to CLI args if args.work_dir is not None
    cfg.work_dir = args.work_dir
elif cfg.get('work_dir', None) is None:
    # use config filename as default work_dir if cfg.work_dir is None
    cfg.work_dir = osp.join('./work_dirs',
                            osp.splitext(osp.basename(args.config))[0])

if args.resume_from is not None:
    cfg.resume_from = args.resume_from
cfg.auto_resume = args.auto_resume
if args.gpus is not None:
    cfg.gpu_ids = range(1)
    warnings.warn('`--gpus` is deprecated because we only support '
                    'single GPU mode in non-distributed training. '
                    'Use `gpus=1` now.')
if args.gpu_ids is not None:
    cfg.gpu_ids = args.gpu_ids[0:1]
    warnings.warn('`--gpu-ids` is deprecated, please use `--gpu-id`. '
                    'Because we only support single GPU mode in '
                    'non-distributed training. Use the first GPU '
                    'in `gpu_ids` now.')
if args.gpus is None and args.gpu_ids is None:
    cfg.gpu_ids = [args.gpu_id]

# init distributed env first, since logger depends on the dist info.
if args.launcher == 'none':
    distributed = False
else:
    distributed = True
    init_dist(args.launcher, **cfg.dist_params)
    # re-set gpu_ids with distributed training mode
    _, world_size = get_dist_info()
    cfg.gpu_ids = range(world_size)

# create work_dir
mmcv.mkdir_or_exist(osp.abspath(cfg.work_dir))
# dump config
cfg.dump(osp.join(cfg.work_dir, osp.basename(args.config)))
# init the logger before other steps
timestamp = time.strftime('%Y%m%d_%H%M%S', time.localtime())
log_file = osp.join(cfg.work_dir, f'{timestamp}.log')
logger = get_root_logger(log_file=log_file, log_level=cfg.log_level)

# init the meta dict to record some important information such as
# environment info and seed, which will be logged
meta = dict()
# log env info
env_info_dict = collect_env()
env_info = '\n'.join([(f'{k}: {v}') for k, v in env_info_dict.items()])
dash_line = '-' * 60 + '\n'
logger.info('Environment info:\n' + dash_line + env_info + '\n' +
            dash_line)
meta['env_info'] = env_info
meta['config'] = cfg.pretty_text
# log some basic info
logger.info(f'Distributed training: {distributed}')
logger.info(f'Config:\n{cfg.pretty_text}')

cfg.device = get_device()
# set random seeds
seed = init_random_seed(args.seed, device=cfg.device)
seed = seed + dist.get_rank() if args.diff_seed else seed
logger.info(f'Set random seed to {seed}, '
            f'deterministic: {args.deterministic}')
set_random_seed(seed, deterministic=args.deterministic)
cfg.seed = seed
meta['seed'] = seed
meta['exp_name'] = osp.basename(args.config)

model = build_detector(
    cfg.model,
    train_cfg=cfg.get('train_cfg'),
    test_cfg=cfg.get('test_cfg'))
model.init_weights()

datasets = []
train_detector(
    model,
    datasets,
    cfg,
    distributed=distributed,
    validate=(not args.no_validate),
    timestamp=timestamp,
    meta=meta)
# tools/train.py
args = parse_args()

# register all modules in mmdet into the registries
# do not init the default scope here because it will be init in the runner
register_all_modules(init_default_scope=False)

# load config
cfg = Config.fromfile(args.config)
cfg.launcher = args.launcher
if args.cfg_options is not None:
    cfg.merge_from_dict(args.cfg_options)

# work_dir is determined in this priority: CLI > segment in file > filename
if args.work_dir is not None:
    # update configs according to CLI args if args.work_dir is not None
    cfg.work_dir = args.work_dir
elif cfg.get('work_dir', None) is None:
    # use config filename as default work_dir if cfg.work_dir is None
    cfg.work_dir = osp.join('./work_dirs',
                            osp.splitext(osp.basename(args.config))[0])

# enable automatic-mixed-precision training
if args.amp is True:
    optim_wrapper = cfg.optim_wrapper.type
    if optim_wrapper == 'AmpOptimWrapper':
        print_log(
            'AMP training is already enabled in your config.',
            logger='current',
            level=logging.WARNING)
    else:
        assert optim_wrapper == 'OptimWrapper', (
            '`--amp` is only supported when the optimizer wrapper type is '
            f'`OptimWrapper` but got {optim_wrapper}.')
        cfg.optim_wrapper.type = 'AmpOptimWrapper'
        cfg.optim_wrapper.loss_scale = 'dynamic'

# enable automatically scaling LR
if args.auto_scale_lr:
    if 'auto_scale_lr' in cfg and \
            'enable' in cfg.auto_scale_lr and \
            'base_batch_size' in cfg.auto_scale_lr:
        cfg.auto_scale_lr.enable = True
    else:
        raise RuntimeError('Can not find "auto_scale_lr" or '
                            '"auto_scale_lr.enable" or '
                            '"auto_scale_lr.base_batch_size" in your'
                            ' configuration file.')

cfg.resume = args.resume

# build the runner from config
if 'runner_type' not in cfg:
    # build the default runner
    runner = Runner.from_cfg(cfg)
else:
    # build customized runner from the registry
    # if 'runner_type' is set in the cfg
    runner = RUNNERS.build(cfg)

# start training
runner.train()
# apis/train.py
def init_random_seed(...):
    ...

def set_random_seed(...):
    ...

# define function tools.
...


def train_detector(model,
                   dataset,
                   cfg,
                   distributed=False,
                   validate=False,
                   timestamp=None,
                   meta=None):

    cfg = compat_cfg(cfg)
    logger = get_root_logger(log_level=cfg.log_level)

    # put model on gpus
    if distributed:
        find_unused_parameters = cfg.get('find_unused_parameters', False)
        # Sets the `find_unused_parameters` parameter in
        # torch.nn.parallel.DistributedDataParallel
        model = build_ddp(
            model,
            cfg.device,
            device_ids=[int(os.environ['LOCAL_RANK'])],
            broadcast_buffers=False,
            find_unused_parameters=find_unused_parameters)
    else:
        model = build_dp(model, cfg.device, device_ids=cfg.gpu_ids)

    # build optimizer
    auto_scale_lr(cfg, distributed, logger)
    optimizer = build_optimizer(model, cfg.optimizer)

    runner = build_runner(
        cfg.runner,
        default_args=dict(
            model=model,
            optimizer=optimizer,
            work_dir=cfg.work_dir,
            logger=logger,
            meta=meta))

    # an ugly workaround to make .log and .log.json filenames the same
    runner.timestamp = timestamp

    # fp16 setting
    fp16_cfg = cfg.get('fp16', None)
    if fp16_cfg is not None:
        optimizer_config = Fp16OptimizerHook(
            **cfg.optimizer_config, **fp16_cfg, distributed=distributed)
    elif distributed and 'type' not in cfg.optimizer_config:
        optimizer_config = OptimizerHook(**cfg.optimizer_config)
    else:
        optimizer_config = cfg.optimizer_config

    # register hooks
    runner.register_training_hooks(
        cfg.lr_config,
        optimizer_config,
        cfg.checkpoint_config,
        cfg.log_config,
        cfg.get('momentum_config', None),
        custom_hooks_config=cfg.get('custom_hooks', None))

    if distributed:
        if isinstance(runner, EpochBasedRunner):
            runner.register_hook(DistSamplerSeedHook())

    # register eval hooks
    if validate:
        val_dataloader_default_args = dict(
            samples_per_gpu=1,
            workers_per_gpu=2,
            dist=distributed,
            shuffle=False,
            persistent_workers=False)

        val_dataloader_args = {
            **val_dataloader_default_args,
            **cfg.data.get('val_dataloader', {})
        }
        # Support batch_size > 1 in validation

        if val_dataloader_args['samples_per_gpu'] > 1:
            # Replace 'ImageToTensor' to 'DefaultFormatBundle'
            cfg.data.val.pipeline = replace_ImageToTensor(
                cfg.data.val.pipeline)
        val_dataset = build_dataset(cfg.data.val, dict(test_mode=True))

        val_dataloader = build_dataloader(val_dataset, **val_dataloader_args)
        eval_cfg = cfg.get('evaluation', {})
        eval_cfg['by_epoch'] = cfg.runner['type'] != 'IterBasedRunner'
        eval_hook = DistEvalHook if distributed else EvalHook
        # In this PR (https://github.com/open-mmlab/mmcv/pull/1193), the
        # priority of IterTimerHook has been modified from 'NORMAL' to 'LOW'.
        runner.register_hook(
            eval_hook(val_dataloader, **eval_cfg), priority='LOW')

    resume_from = None
    if cfg.resume_from is None and cfg.get('auto_resume'):
        resume_from = find_latest_checkpoint(cfg.work_dir)
    if resume_from is not None:
        cfg.resume_from = resume_from

    if cfg.resume_from:
        runner.resume(cfg.resume_from)
    elif cfg.load_from:
        runner.load_checkpoint(cfg.load_from)
    runner.run(data_loaders, cfg.workflow)
# `apis/train.py` is removed in `mmengine`

Table above shows the differences between training script of MMEngine Runner and MMCV Runner. Repositories of OpenMMLab 1.x organize their own process to build Runner, which contributes to the large amount of redundant code. MMEngine unifies and formats the building process, such as setting random seed, initializing distributed environment, building DataLoader, building Optimizer, etc. This help the downstream repositories simplify the process to prepare the runner, and only need to configure the parameters of Runner.

For the downstream repositories, training script based on MMEngine Runner not only simplify the tools/train.py, but also can directly omit the apis/train.py. Similarly, we can also set random seed, initialize distributed environment by configuring the parameters of Runner, and do not need to implement the corresponding code.

Migrate Runner

This section describes the differences in the training, validation, and testing processes between the MMCV Runner and the MMEngine Runner, as follows.

  1. Prepare logger

  2. Set random seed

  3. Initialize environment variables

  4. Prepare data

  5. Prepare model

  6. Prepare optimizer

  7. Prepare hooks

  8. Prepare testing/validation components

  9. Build runner

  10. Load checkpoint

  11. Training process, Testing process

  12. Custom training process

The following tutorial will describe the difference above in detail.

Prepare logger

Prepare logger in MMCV

MMCV needs to call the get_logger to get a formatted logger and use it to output and log the training information.

logger = get_logger(name='custom', log_file=log_file, log_level=cfg.log_level)
env_info_dict = collect_env()
env_info = '\n'.join([(f'{k}: {v}') for k, v in env_info_dict.items()])
dash_line = '-' * 60 + '\n'
logger.info('Environment info:\n' + dash_line + env_info + '\n' +
            dash_line)

The instantiation of the Runner also relies on the logger:

runner = Runner(
    ...
    logger=logger
    ...)

Prepare logger in MMEngine

Configure the log_level for Runner, and it will build the logger automatically.

log_level = 'INFO'

Set random seed

Set random seed in MMCV

Set random seed manually in training script:

...
seed = init_random_seed(args.seed, device=cfg.device)
seed = seed + dist.get_rank() if args.diff_seed else seed
logger.info(f'Set random seed to {seed}, '
            f'deterministic: {args.deterministic}')
set_random_seed(seed, deterministic=args.deterministic)
...

Set random seed in MMEngine

Configure the randomness for Runner, see more information in Runner.set_randomness

Configuration changes

Configuration of MMCV Configuration of MMEngine
seed = 1
deterministic=False
diff_seed=False
randomness=dict(seed=1,
                deterministic=True,
                diff_rank_seed=False)

Initialize environment variables

Initialize the environment variables

MMCV needs to setup launcher of distributed training, set environment variables for multi-process communication, initialize the distributed environment and wrap model with the distributed wrapper like this:

...
setup_multi_processes(cfg)
init_dist(cfg.launcher, **cfg.dist_params)
model = MMDistributedDataParallel(
    model,
    device_ids=[int(os.environ['LOCAL_RANK'])],
    broadcast_buffers=False,
    find_unused_parameters=find_unused_parameters)

As for MMEngine, you can setup launcher by configuring launcher of Runner, and configure other items mentioned above in env_cfg. See more information in the table below:

Configuration changes

MMCV configuration MMEngine configuration
launcher = 'pytorch'  # enable distributed training
dist_params = dict(backend='nccl')  # choose communication backend
launcher = 'pytorch'
env_cfg = dict(dist_cfg=dict(backend='nccl'))

In this tutorial, we set env_cfg to:

env_cfg = dict(dist_cfg=dict(backend='nccl'))

Prepare data

Both MMEngine and MMCV Runner can accept built DataLoader

import torchvision.transforms as transforms
from torch.utils.data import DataLoader
from torchvision.datasets import CIFAR10

transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

train_dataset = CIFAR10(
    root='data', train=True, download=True, transform=transform)
train_dataloader = DataLoader(
    train_dataset, batch_size=128, shuffle=True, num_workers=2)

val_dataset = CIFAR10(
    root='data', train=False, download=True, transform=transform)
val_dataloader = DataLoader(
    val_dataset, batch_size=128, shuffle=False, num_workers=2)

Configuration changes

Configuration of MMCV Configuration of MMEngine
data = dict(
    samples_per_gpu=2,  # batch_size of single gpu
    workers_per_gpu=2,  # num_workers of DataLoader
    train=dict(
        type=dataset_type,
        ann_file=data_root + 'annotations/instances_train2017.json',
        img_prefix=data_root + 'train2017/',
        pipeline=train_pipeline),
    val=dict(
        type=dataset_type,
        ann_file=data_root + 'annotations/instances_val2017.json',
        img_prefix=data_root + 'val2017/',
        pipeline=test_pipeline),
    test=dict(
        type=dataset_type,
        ann_file=data_root + 'annotations/instances_val2017.json',
        img_prefix=data_root + 'val2017/',
        pipeline=test_pipeline))
train_dataloader = dict(
    batch_size=2,
    num_workers=2,
    persistent_workers=True,
    # Configurable sampler
    sampler=dict(type='DefaultSampler', shuffle=True),
    # Configurable batch_sampler
    batch_sampler=dict(type='AspectRatioBatchSampler'),
    dataset=dict(
        type=dataset_type,
        data_root=data_root,
        ann_file='annotations/instances_train2017.json',
        data_prefix=dict(img='train2017/'),
        filter_cfg=dict(filter_empty_gt=True, min_size=32),
        pipeline=train_pipeline))

val_dataloader = dict(
    batch_size=1, # batch_size of validation process
    num_workers=2,
    persistent_workers=True,
    drop_last=False, # whether drop the last batch
    sampler=dict(type='DefaultSampler', shuffle=False),
    dataset=dict(
        type=dataset_type,
        data_root=data_root,
        ann_file='annotations/instances_val2017.json',
        data_prefix=dict(img='val2017/'),
        test_mode=True,
        pipeline=test_pipeline))

test_dataloader = val_dataloader

Prepare model

See Migrate model from mmcv for more information

import torch.nn as nn
import torch.nn.functional as F
from mmengine.model import BaseModel


class Model(BaseModel):

    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)
        self.loss_fn = nn.CrossEntropyLoss()

    def forward(self, img, label, mode):
        feat = self.pool(F.relu(self.conv1(img)))
        feat = self.pool(F.relu(self.conv2(feat)))
        feat = feat.view(-1, 16 * 5 * 5)
        feat = F.relu(self.fc1(feat))
        feat = F.relu(self.fc2(feat))
        feat = self.fc3(feat)
        if mode == 'loss':
            loss = self.loss_fn(feat, label)
            return dict(loss=loss)
        else:
            return [feat.argmax(1)]

model = Model()

Prepare optimizer

Prepare optimizer in MMCV

MMCV Runner can accept built optimizer

optimizer = SGD(model.parameters(), lr=0.1, momentum=0.9)

For complicated configurations of optimizers, MMCV needs to build optimizers based on the optimizer constructors.


optimizer_cfg = dict(
    optimizer=dict(type='SGD', lr=0.01, weight_decay=0.0001),
    paramwise_cfg=dict(norm_decay_mult=0))

def build_optimizer_constructor(cfg):
    constructor_type = cfg.get('type')
    if constructor_type in OPTIMIZER_BUILDERS:
        return build_from_cfg(cfg, OPTIMIZER_BUILDERS)
    elif constructor_type in MMCV_OPTIMIZER_BUILDERS:
        return build_from_cfg(cfg, MMCV_OPTIMIZER_BUILDERS)
    else:
        raise KeyError(f'{constructor_type} is not registered '
                       'in the optimizer builder registry.')


def build_optimizer(model, cfg):
    optimizer_cfg = copy.deepcopy(cfg)
    constructor_type = optimizer_cfg.pop('constructor',
                                         'DefaultOptimizerConstructor')
    paramwise_cfg = optimizer_cfg.pop('paramwise_cfg', None)
    optim_constructor = build_optimizer_constructor(
        dict(
            type=constructor_type,
            optimizer_cfg=optimizer_cfg,
            paramwise_cfg=paramwise_cfg))
    optimizer = optim_constructor(model)
    return optimizer

optimizer = build_optimizer(model, optimizer_cfg)

Prepare optimizer in MMEngine

MMEngine needs to configure optim_wrapper for Runner. For more complicated cases, you can also configure the optim_wrapper more specifically. See more information in the API documents

Configuration changes

Configuration in MMCV Configuration in MMEngine
optimizer = dict(
    constructor='CustomConstructor',
    type='AdamW',
    lr=0.0001,
    betas=(0.9, 0.999),
    weight_decay=0.05,
    paramwise_cfg={  # parameters of constructor
        'decay_rate': 0.95,
        'decay_type': 'layer_wise',
        'num_layers': 6
    })

# MMCV needs to configure `optim_config` additionally
optimizer_config = dict(grad_clip=None)
optim_wrapper = dict(
    constructor='CustomConstructor',
    type='OptimWrapper',  # Specify the type of OptimWrapper
    optimizer=dict(  # optimizer configuration
        type='AdamW',
        lr=0.0001,
        betas=(0.9, 0.999),
        weight_decay=0.05)
    paramwise_cfg={
        'decay_rate': 0.95,
        'decay_type': 'layer_wise',
        'num_layers': 6
    })

Note

For the high-level tasks like detection and classification, MMCV needs to configure optim_config to build OptimizerHook, while not necessary for MMEngine.

optim_wrapper used in this tutorial is as follows:

from torch.optim import SGD

optimizer = SGD(model.parameters(), lr=0.1, momentum=0.9)
optim_wrapper = dict(optimizer=optimizer)

Prepare hooks

Prepare hooks in MMCV

The commonly used hooks configuration in MMCV is as follows:

# learning rate scheduler config
lr_config = dict(policy='step', step=[2, 3])
# configuration of optimizer
optimizer_config = dict(grad_clip=None)
# configuration of saving checkpoints periodically
checkpoint_config = dict(interval=1)
# save log periodically and multiple hooks can be used simultaneously
log_config = dict(interval=100, hooks=[dict(type='TextLoggerHook')])
# register hooks to runner and those hooks will be invoked automatically
runner.register_training_hooks(
    lr_config=lr_config,
    optimizer_config=optimizer_config,
    checkpoint_config=checkpoint_config,
    log_config=log_config)

Among them:

  • lr_config is used for LrUpdaterHook

  • optimizer_config is used for OptimizerHook

  • checkpoint_config is used for CheckPointHook

  • log_config is used for LoggerHook

Besides the hooks mentioned above, MMCV Runner will build IterTimerHook automatically. MMCV Runner will register the training hooks after instantiating the model, while MMEngine Runner will initialize the hooks during instantiating the model.

Prepare hooks in MMEngine

MMEngine Runner takes some commonly used hooks in MMCV as the default hooks.

Compared with the example of MMCV

  • LrUpdaterHook correspond to the ParamSchedulerHook, find more details in migrate scheduler

  • MMEngine optimize the model in train_step, therefore we do not need OptimizerHook in MMEngine anymore

  • MMEngine takes CheckPointHook as the default hook

  • MMEngine take LoggerHook as the default hook

Therefore, we can achieve the same effect as the MMCV example as long as we configure the param_scheduler correctly.

We can also register custom hooks in MMEngine runner, find more details in runner tutorial and migrate hook.

Commonly used hooks in MMCV Default hooks in MMEngine
# Configure training hooks
# Configure LrUpdaterHook
lr_config = dict(
    policy='step',
    warmup='linear',
    warmup_iters=500,
    warmup_ratio=0.001,
    step=[8, 11])

# Configure OptimizerHook
optimizer_config = dict(grad_clip=None)

# Configure LoggerHook
log_config = dict(  # LoggerHook
    interval=50,
    hooks=[
        dict(type='TextLoggerHook'),
        # dict(type='TensorboardLoggerHook')
    ])

# Configure CheckPointHook
checkpoint_config = dict(interval=1)  # CheckPointHook
# Configure parameter scheduler
param_scheduler = [
    dict(
        type='LinearLR', start_factor=0.001, by_epoch=False, begin=0, end=500),
    dict(
        type='MultiStepLR',
        begin=0,
        end=12,
        by_epoch=True,
        milestones=[8, 11],
        gamma=0.1)
]

# Configure default hooks
default_hooks = dict(
    timer=dict(type='IterTimerHook'),
    logger=dict(type='LoggerHook', interval=50),
    param_scheduler=dict(type='ParamSchedulerHook'),
    checkpoint=dict(type='CheckpointHook', interval=1),
    sampler_seed=dict(type='DistSamplerSeedHook'),
    visualization=dict(type='DetVisualizationHook'))

The parameter scheduler used in this tutorial is as follows:

from math import gamma

param_scheduler = dict(type='MultiStepLR', milestones=[2, 3], gamma=0.1)

Prepare testing/validation components

MMCV implements the validation process by EvalHook, and we’ll not talk too much about it here. Given that validation is a common process in training, MMEngine abstracts validation as two independent modules: Evaluator and ValLoop. We can customize the metric or the validation process by defining a new loop or a new metric.

import torch
from mmengine.evaluator import BaseMetric
from mmengine.registry import METRICS

@METRICS.register_module(force=True)
class ToyAccuracyMetric(BaseMetric):

    def process(self, label, pred) -> None:
        self.results.append((label[1], pred, len(label[1])))

    def compute_metrics(self, results: list) -> dict:
        num_sample = 0
        acc = 0
        for label, pred, batch_size in results:
            acc += (label == torch.stack(pred)).sum()
            num_sample += batch_size
        return dict(Accuracy=acc / num_sample)

After defining the metric, we should also configure the evaluator and loop for Runner. The example used in this tutorial is as follows:

val_evaluator = dict(type='ToyAccuracyMetric')
val_cfg = dict(type='ValLoop')
Configure validation in MMCV Configure validation in MMEngine
eval_cfg = cfg.get('evaluation', {})
eval_cfg['by_epoch'] = cfg.runner['type'] != 'IterBasedRunner'
eval_hook = DistEvalHook if distributed else EvalHook
runner.register_hook(
    eval_hook(val_dataloader, **eval_cfg), priority='LOW')
val_dataloader = val_dataloader
val_evaluator = dict(type='ToyAccuracyMetric')
val_cfg = dict(type='ValLoop')

Build Runner

Building Runner in MMCV

runner = EpochBasedRunner(
    model=model,
    optimizer=optimizer,
    work_dir=work_dir,
    logger=logger,
    max_epochs=4
)

Building Runner in MMEngine

The EpochBasedRunner and max_epochs arguments in MMCV are moved to train_cfg in MMEngine. All parameters configurable in train_cfg are listed below:

  • by_epoch: True equivalent to EpochBasedRunner. False equivalent to IterBasedRunner

  • max_epoch/max_iter: Equivalent to max_epochs and max_iters in MMCV

  • val_iterval: Equivalent to interval in MMCV

from mmengine.runner import Runner

runner = Runner(
    model=model,  # model to be optimized
    work_dir='./work_dir',  # working directory
    randomness=randomness,  # random seed
    env_cfg=env_cfg,  # environment config
    launcher='none',  # launcher for distributed training
    optim_wrapper=optim_wrapper,  # configure optimizer wrapper
    param_scheduler=param_scheduler,  # configure parameter scheduler
    train_dataloader=train_dataloader,  # configure train dataloader
    train_cfg=dict(by_epoch=True, max_epochs=4, val_interval=1),  # Configure training loop
    val_dataloader=val_dataloader,  # Configure validation dataloader
    val_evaluator=val_evaluator,  # Configure evaluator and metrics
    val_cfg=val_cfg)  # Configure validation loop

Load checkpoint

Loading checkpoint in MMCV

if cfg.resume_from:
    runner.resume(cfg.resume_from)
elif cfg.load_from:
    runner.load_checkpoint(cfg.load_from)

Loading checkpoint in MMEngine

runner = Runner(
    ...
    load_from='/path/to/checkpoint',
    resume=True
)
Configuration of loading checkpoint in MMCV Configuration of loading checkpoint in MMEngine
load_from = 'path/to/ckpt'
load_from = 'path/to/ckpt'
resume = False
resume_from = 'path/to/ckpt'
load_from = 'path/to/ckpt'
resume = True

Training process

Training process in MMCV

Resume or load checkpoint firstly, and then start training.

if cfg.resume_from:
    runner.resume(cfg.resume_from)
elif cfg.load_from:
    runner.load_checkpoint(cfg.load_from)
runner.run(data_loaders, cfg.workflow)

Training process in MMEngine

Complete the process mentioned above the Runner.__init__ and Runner.train

runner.train()

Testing process

Since MMCV Runner does not integrate the test function, we need to implement the test scripts by ourselves.

For MMEngine Runner, as long as we have configured the test_dataloader, test_cfg and test_evaluator for the Runner, we can call Runner.test to start the testing process.

work_dir is the same for training

runner = Runner(
    model=model,
    work_dir='./work_dir',
    randomness=randomness,
    env_cfg=env_cfg,
    launcher='none',  # 不开启分布式训练
    optim_wrapper=optim_wrapper,
    train_dataloader=train_dataloader,
    train_cfg=dict(by_epoch=True, max_epochs=5, val_interval=1),
    val_dataloader=val_dataloader,
    val_evaluator=val_evaluator,
    val_cfg=val_cfg,
    test_dataloader=val_dataloader,  # 假设测试和验证使用相同的数据和评测器
    test_evaluator=val_evaluator,
    test_cfg=dict(type='TestLoop'),
)
runner.test()

work_dir is the different for training, configure load_from manually

runner = Runner(
    model=model,
    work_dir='./test_work_dir',
    load_from='./work_dir/epoch_5.pth',  # set load_from additionally
    randomness=randomness,
    env_cfg=env_cfg,
    launcher='none',
    optim_wrapper=optim_wrapper,
    train_dataloader=train_dataloader,
    train_cfg=dict(by_epoch=True, max_epochs=5, val_interval=1),
    val_dataloader=val_dataloader,
    val_evaluator=val_evaluator,
    val_cfg=val_cfg,
    test_dataloader=val_dataloader,
    test_evaluator=val_evaluator,
    test_cfg=dict(type='TestLoop'),
)
runner.test()

Customize training process

If we want to customize a training/validation process, we need to override the Runner.val or Runner.train in a custom Runner. Take overriding runner.train as an example, suppose we need to train with the same batch twice for each iteration, we can override the Runner.train like this:

class CustomRunner(EpochBasedRunner):
    def train(self, data_loader, **kwargs):
        self.model.train()
        self.mode = 'train'
        self.data_loader = data_loader
        self._max_iters = self._max_epochs * len(self.data_loader)
        self.call_hook('before_train_epoch')
        time.sleep(2)  # Prevent possible deadlock during epoch transition
        for i, data_batch in enumerate(self.data_loader):
            self.data_batch = data_batch
            self._inner_iter = i
            for _ in range(2)
                self.call_hook('before_train_iter')
                self.run_iter(data_batch, train_mode=True, **kwargs)
                self.call_hook('after_train_iter')
            del self.data_batch
            self._iter += 1

        self.call_hook('after_train_epoch')
        self._epoch += 1

In MMEngine, we need to customize a train loop.

from mmengine.registry import LOOPS
from mmengine.runner import EpochBasedTrainLoop


@LOOPS.register_module()
class CustomEpochBasedTrainLoop(EpochBasedTrainLoop):
    def run_iter(self, idx, data_batch) -> None:
        for _ in range(2):
            super().run_iter(idx, data_batch)

and then, we need to set type as CustomEpochBasedTrainLoop in train_cfg. Note that by_epoch and type cannot be configured at the same time. Once by_epoch is configured, the type of the training loop will be inferred as EpochBasedTrainLoop.

runner = Runner(
    model=model,
    work_dir='./test_work_dir',
    randomness=randomness,
    env_cfg=env_cfg,
    launcher='none',
    optim_wrapper=dict(optimizer=dict(type='SGD', lr=0.001, momentum=0.9)),
    train_dataloader=train_dataloader,
    train_cfg=dict(
        type='CustomEpochBasedTrainLoop',
        max_epochs=5,
        val_interval=1),
    val_dataloader=val_dataloader,
    val_evaluator=val_evaluator,
    val_cfg=val_cfg,
    test_dataloader=val_dataloader,
    test_evaluator=val_evaluator,
    test_cfg=dict(type='TestLoop'),
)
runner.train()

For more complicated migration needs of Runner, you can refer to the runner tutorials and runner design.

Migrate Hook from MMCV to MMEngine

Coming soon. Please refer to chinese documentation.

Migrate Model from MMCV to MMEngine

Introduction

The early computer vision tasks supported by MMCV, such as detection and classification, used a general process to optimize model. It can be summarized as the following four steps:

  1. Calculate the loss

  2. Calculate the gradients

  3. Update the model parameters

  4. Clean the gradients of the last iteration

For most of the high-level tasks, “where” and “when” to perform the above processes is commonly fixed, therefore it seems reasonable to use Hook to implement it. MMCV implements series of hooks, such as OptimizerHook, Fp16OptimizerHook and GradientCumulativeFp16OptimizerHook to provide varies of optimization strategies.

On the other hand, tasks like GAN (Generative adversarial network) and Self-supervision require more flexible training processes, which do not meet the characteristics mentioned above, and it could be hard to use hooks to implement them. To meet the needs of these tasks, MMCV will pass optimizer to train_step and users can customize the optimization process as they want. Although it works, it cannot utilize various OptimizerHook implemented in MMCV, and downstream repositories have to implement mix-precision training, and gradient accumulation on their own.

To unify the training process of various deep learning tasks, MMEngine designed the OptimWrapper, which integrates the mixed-precision training, gradient accumulation and other optimization strategies into a unified interface.

Migrate optimization process

Since MMEngine designs the OptimWrapper and deprecates series of OptimizerHook, there would be some differences between the optimization process in MMCV and MMEngine.

Commonly used optimization process

Considering tasks like detection and classification, the optimization process is usually the same, so BaseModel integrates the process into train_step.

Model based on MMCV

Before describing how to migrate the model, let’s look at a minimal example to train a model based on the MMCV.

import torch
import torch.nn as nn
from torch.optim import SGD
from torch.utils.data import DataLoader

from mmcv.runner import Runner
from mmcv.utils.logging import get_logger


train_dataset = [(torch.ones(1, 1), torch.ones(1, 1))] * 50
train_dataloader = DataLoader(train_dataset, batch_size=2)


class MMCVToyModel(nn.Module):
    def __init__(self) -> None:
        super().__init__()
        self.linear = nn.Linear(1, 1)

    def forward(self, img, label, return_loss=False):
        feat = self.linear(img)
        loss1 = (feat - label).pow(2)
        loss2 = (feat - label).abs()
        loss = (loss1 + loss2).sum()
        return dict(loss=loss,
                    num_samples=len(img),
                    log_vars=dict(
                        loss1=loss1.sum().item(),
                        loss2=loss2.sum().item()))

    def train_step(self, data, optimizer=None):
        return self(*data, return_loss=True)

    def val_step(self, data, optimizer=None):
        return self(*data, return_loss=False)


model = MMCVToyModel()
optimizer = SGD(model.parameters(), lr=0.01)
logger = get_logger('demo')

lr_config = dict(policy='step', step=[2, 3])
optimizer_config = dict(grad_clip=None)
log_config = dict(interval=10, hooks=[dict(type='TextLoggerHook')])


runner = Runner(
    model=model,
    work_dir='tmp_dir',
    optimizer=optimizer,
    logger=logger,
    max_epochs=5)

runner.register_training_hooks(
    lr_config=lr_config,
    optimizer_config=optimizer_config,
    log_config=log_config)
runner.run([train_dataloader], [('train', 1)])

Model based on MMCV must implement train_step, and return a dict which contains the following keys:

  • loss: Passed to OptimizerHook to calculate gradient.

  • num_samples: Passed to LogBuffer to count the averaged loss

  • log_vars: Passed to LogBuffer to count the averaged loss

Model based on MMEngine

The same model based on MMEngine

import torch
import torch.nn as nn
from torch.utils.data import DataLoader

from mmengine.runner import Runner
from mmengine.model import BaseModel

train_dataset = [(torch.ones(1, 1), torch.ones(1, 1))] * 50
train_dataloader = DataLoader(train_dataset, batch_size=2)


class MMEngineToyModel(BaseModel):

    def __init__(self) -> None:
        super().__init__()
        self.linear = nn.Linear(1, 1)

    def forward(self, img, label, mode):
        feat = self.linear(img)
        # Called by train_step and return the loss dict
        if mode == 'loss':
            loss1 = (feat - label).pow(2)
            loss2 = (feat - label).abs()
            return dict(loss1=loss1, loss2=loss2)
        # Called by val_step and return the predictions
        elif mode == 'predict':
            return [_feat for _feat in feat]
        # tensor model, find more details in tutorials/model.md
        else:
            pass


runner = Runner(
    model=MMEngineToyModel(),
    work_dir='tmp_dir',
    train_dataloader=train_dataloader,
    train_cfg=dict(by_epoch=True, max_epochs=5),
    optim_wrapper=dict(optimizer=dict(type='SGD', lr=0.01)))
runner.train()

In MMEngine, users can customize their model based on BaseModel, which implements the same logic as OptimizerHook in train_step. For high-level tasks, train_step will be called in train loop with specific arguments, and users do not need to care about the optimization process. For low-level tasks, users can override the train_step to customize the optimization process.

Model in MMCV Model in MMEngine
class MMCVToyModel(nn.Module):

    def __init__(self) -> None:
        super().__init__()
        self.linear = nn.Linear(1, 1)

    def forward(self, img, label, return_loss=False):
        feat = self.linear(img)
        loss1 = (feat - label).pow(2)
        loss2 = (feat - label).abs()
        loss = (loss1 + loss2).sum()
        return dict(loss=loss,
                    num_samples=len(img),
                    log_vars=dict(
                        loss1=loss1.sum().item(),
                        loss2=loss2.sum().item()))

    def train_step(self, data, optimizer=None):
        return self(*data, return_loss=True)

    def val_step(self, data, optimizer=None):
        return self(*data, return_loss=False)
class MMEngineToyModel(BaseModel):

    def __init__(self) -> None:
        super().__init__()
        self.linear = nn.Linear(1, 1)

    def forward(self, img, label, mode):
        if mode == 'loss':
            feat = self.linear(img)
            loss1 = (feat - label).pow(2)
            loss2 = (feat - label).abs()
            return dict(loss1=loss1, loss2=loss2)
        elif mode == 'predict':
            return [_feat for _feat in feat]
        else:
            pass

    # The equivalent code snippet of `train_step`
    # def train_step(self, data, optim_wrapper):
    #     data = self.data_preprocessor(data)
    #     loss_dict = self(*data, mode='loss')
    #     loss_dict['loss1'] = loss_dict['loss1'].sum()
    #     loss_dict['loss2'] = loss_dict['loss2'].sum()
    #     loss = (loss_dict['loss1'] + loss_dict['loss2']).sum()
    #     Call the optimizer wrapper to update parameters.
    #     optim_wrapper.update_params(loss)
    #     return loss_dict

Note

See more information about data_preprocessor and optim_wrapper in docs optim_wrapper and data_preprocessor.

The main differences of model in MMCV and MMEngine can be summarized as follows:

  • MMCVToyModel inherits from nn.Module, and MMEngineToyModel inherits from BaseModel

  • MMCVToyModel must implement train_step method and return a dict with keys loss, log_vars, and num_samples. MMEngineToyModel only needs to implement forward method for high level tasks, and return a dict with differentiable losses.

  • MMCVToyModel.forward and MMEngineToyModel.forward must match with train_step which will call it. Since MMEngineToyModel does not override the train_step, BaseModel.train_step will be directly called, which requires that forward must accept mode parameter. Find more details in tutorials of model

Custom optimization process

Takes training a GAN model as an example, generator and discriminator need to be optimized in turn and the optimization strategy could change as the training iteration grows. Therefore it could be hard to use OptimizerHook to meet such requirements in MMCV. GAN model based on MMCV will accept an optimizer in train_step and update parameters in it. Actually, MMEngine borrows this way and simplifies it by passing an optim_wrapper rather than an optimizer.

Referred to training a GAN model, The differences of MMCV and MMEngine are as follows:

Training gan in MMCV Training gan in MMEngine
    def train_discriminator(self, inputs, optimizer):
        real_imgs = inputs['inputs']
        z = torch.randn(
            (real_imgs.shape[0], self.noise_size)).type_as(real_imgs)
        with torch.no_grad():
            fake_imgs = self.generator(z)

        disc_pred_fake = self.discriminator(fake_imgs)
        disc_pred_real = self.discriminator(real_imgs)

        parsed_losses, log_vars = self.disc_loss(disc_pred_fake,
                                                 disc_pred_real)
        parsed_losses.backward()
        optimizer.step()
        optimizer.zero_grad()
        return log_vars

    def train_generator(self, inputs, optimizer):
        real_imgs = inputs['inputs']
        z = torch.randn(inputs['inputs'].shape[0], self.noise_size).type_as(
            real_imgs)

        fake_imgs = self.generator(z)

        disc_pred_fake = self.discriminator(fake_imgs)
        parsed_loss, log_vars = self.gen_loss(disc_pred_fake)

        parsed_losses.backward()
        optimizer.step()
        optimizer.zero_grad()
        return log_vars
    def train_discriminator(self, inputs, optimizer_wrapper):
        real_imgs = inputs['inputs']
        z = torch.randn(
            (real_imgs.shape[0], self.noise_size)).type_as(real_imgs)
        with torch.no_grad():
            fake_imgs = self.generator(z)

        disc_pred_fake = self.discriminator(fake_imgs)
        disc_pred_real = self.discriminator(real_imgs)

        parsed_losses, log_vars = self.disc_loss(disc_pred_fake,
                                                 disc_pred_real)
        optimizer_wrapper.update_params(parsed_losses)
        return log_vars



    def train_generator(self, inputs, optimizer_wrapper):
        real_imgs = inputs['inputs']
        z = torch.randn(real_imgs.shape[0], self.noise_size).type_as(real_imgs)

        fake_imgs = self.generator(z)

        disc_pred_fake = self.discriminator(fake_imgs)
        parsed_loss, log_vars = self.gen_loss(disc_pred_fake)

        optimizer_wrapper.update_params(parsed_loss)
        return log_vars

Apart from the differences mentioned in the previous section, the main difference in the optimization process in MMCV and MMEngine is that the latter can use optim_wrapper in a more simple way. The convenience of optim_wrapper would be more obvious if gradient accumulation and mix-precision training are applied.

Migrate validation/testing process

Model based on MMCV usually does not need to provide test_step or val_step for testing/validation. However, MMEngine performs the testing/validation by ValLoop and TestLoop, which will call runner.model.val_step and runner.model.test_step. Therefore model based on MMEngine needs to implement val_step and test_step, of which input data and output predictions should be compatible with DataLoader and Evaluator.process respectively. You can find more details in the model tutorial. Therefore, MMEngineToyModel.forward will slice the feat and return the predictions as a list.


class MMEngineToyModel(BaseModel):

    ...
    def forward(self, img, label, mode):
        if mode == 'loss':
            ...
        elif mode == 'predict':
            # Slice the data to a list
            return [_feat for _feat in feat]
        else:
            ...

Migrate the distributed training

MMCV will wrap the model with distributed wrapper before building the runner, while MMEngine will wrap the model in Runner. Therefore, we need to configure the launcher and model_wrapper_cfg for Runner. Migrate Runner from MMCV to MMEngine will introduce it in detail.

  1. Commonly used training process

    For the high-level tasks mentioned in introduction, the default distributed model wrapper is enough. Therefore, we only need to configure the launcher for MMEngine Runner.

    Distributed training in MMCV Distributed training in MMEngine
    model = MMDistributedDataParallel(
        model,
        device_ids=[int(os.environ['LOCAL_RANK'])],
        broadcast_buffers=False,
        find_unused_parameters=find_unused_parameters)
    ...
    runner = Runner(model=model, ...)
    
    runner = Runner(
        model=model,
        launcher='pytorch', # enable distributed training
        ...,
    )
    

 

  1. optimize modules independently with custom optimization process

    Again, taking the example of training a GAN model, the generator and discriminator need to be optimized separately. Therefore, the model needs to be wrapped by MMSeparateDistributedDataParallel, which need to be specified when building the runner.

    cfg = dict(model_wrapper_cfg='MMSeparateDistributedDataParallel')
    runner = Runner(
        model=model,
        ..., # 其他配置
        launcher='pytorch',
        cfg=cfg)
    

 

  1. Optimize a model with a custom optimization process

Sometimes we need to optimize the whole model with a custom optimization process, where we cannot reuse BaseModel.train_step, but need to override it, e.g. we want to optimize the model twice with the same batch of images: the first time with batch data augmentation on, and the second time with it off

class CustomModel(BaseModel):

    def train_step(self, data, optim_wrapper):
        data = self.data_preprocessor(data, training=True)  # Enable batch augmentation
        loss = self(data, mode='loss')
        optim_wrapper.update_params(loss)
        data = self.data_preprocessor(data, training=False)  # Disable batch augmentation
        loss = self(data, mode='loss')
        optim_wrapper.update_params(loss)

In this case, we need to customize a model wrapper that overrides the train_step and performs the same process as CustomModel.train_step.

   class CustomDistributedDataParallel(MMSeparateDistributedDataParallel):

       def train_step(self, data, optim_wrapper):
           data = self.data_preprocessor(data, training=True)  # Enable batch augmentation
           loss = self(data, mode='loss')
           optim_wrapper.update_params(loss)
           data = self.data_preprocessor(data, training=False)  # Disable batch augmentation
           loss = self(data, mode='loss')
           optim_wrapper.update_params(loss)

Then we can specify it when building Runner:

cfg = dict(model_wrapper_cfg=dict(type='CustomDistributedDataParallel'))
runner = Runner(
    model=model,
    ...,
    launcher='pytorch',
    cfg=cfg
)

Migrate parameter scheduler from MMCV to MMEngine

MMCV 1.x version uses LrUpdaterHook and MomentumUpdaterHook to adjust the learning rate and momentum. However, the design of LrUpdaterHook has been difficult to meet more abundant customization requirements due to the development of the training strategies. Hence, MMEngine proposes parameter schedulers (ParamScheduler).

The interface of the parameter scheduler is consistent with PyTroch’s learning rate scheduler (LRScheduler). In addition, the parameter scheduler provides stronger functions. For details, please refer to Parameter Scheduler User Guide.

Learning rate scheduler (LrUpdater) migration

MMEngine uses LRScheduler instead of LrUpdaterHook. The field in the config file is changed from the original lr_config to param_scheduler. The learning rate config in MMCV corresponds to the parameter scheduler config in MMEngine as follows:

Learning rate warm-up migration

The learning rate warm-up can be achieved through the combination of schedulers by specifying the effective range begin and end. There are 3 learning rate warm-up methods in MMCV, namely 'constant', 'linear', 'exp'. The corresponding config in MMEngine should be modified as follows:

Constant warm-up
MMCV-1.x MMEngine
lr_config = dict(
    warmup='constant',
    warmup_ratio=0.1,
    warmup_iters=500,
    warmup_by_epoch=False
)
param_scheduler = [
    dict(type='ConstantLR',
         factor=0.1,
         begin=0,
         end=500,
         by_epoch=False),
    dict(...) # the main learning rate scheduler
]
Linear warm-up
MMCV-1.x MMEngine
lr_config = dict(
    warmup='linear',
    warmup_ratio=0.1,
    warmup_iters=500,
    warmup_by_epoch=False
)
param_scheduler = [
    dict(type='LinearLR',
         start_factor=0.1,
         begin=0,
         end=500,
         by_epoch=False),
    dict(...) # the main learning rate scheduler
]
Exponential warm-up
MMCV-1.x MMEngine
lr_config = dict(
    warmup='exp',
    warmup_ratio=0.1,
    warmup_iters=500,
    warmup_by_epoch=False
)
param_scheduler = [
    dict(type='ExponentialLR',
         gamma=0.1,
         begin=0,
         end=500,
         by_epoch=False),
    dict(...) # the main learning rate scheduler
]

Fixed learning rate (FixedLrUpdaterHook) migration

MMCV-1.x MMEngine
lr_config = dict(policy='fixed')
param_scheduler = [
    dict(type='ConstantLR', factor=1)
]

Step learning rate (StepLrUpdaterHook) migration

MMCV-1.x MMEngine
lr_config = dict(
    policy='step',
    step=[8, 11],
    gamma=0.1,
    by_epoch=True
)
param_scheduler = [
    dict(type='MultiStepLR',
         milestone=[8, 11],
         gamma=0.1,
         by_epoch=True)
]

Poly learning rate (PolyLrUpdaterHook) migration

MMCV-1.x MMEngine
lr_config = dict(
    policy='poly',
    power=0.7,
    min_lr=0.001,
    by_epoch=True
)
param_scheduler = [
    dict(type='PolyLR',
         power=0.7,
         eta_min=0.001,
         begin=0,
         end=num_epochs,
         by_epoch=True)
]

Exponential learning rate (ExpLrUpdaterHook) migration

MMCV-1.x MMEngine
lr_config = dict(
    policy='exp',
    power=0.5,
    by_epoch=True
)
param_scheduler = [
    dict(type='ExponentialLR',
         gamma=0.5,
         begin=0,
         end=num_epochs,
         by_epoch=True)
]

Cosine annealing learning rate (CosineAnnealingLrUpdaterHook) migration

MMCV-1.x MMEngine
lr_config = dict(
    policy='CosineAnnealing',
    min_lr=0.5,
    by_epoch=True
)
param_scheduler = [
    dict(type='CosineAnnealingLR',
         eta_min=0.5,
         T_max=num_epochs,
         begin=0,
         end=num_epochs,
         by_epoch=True)
]

FlatCosineAnnealingLrUpdaterHook migration

The learning rate strategy combined by multiple phases like FlatCosineAnnealing originally needs to be achieved by rewriting a Hook. But in MMEngine, it can be achieved with combining two parameter scheduler configs:

MMCV-1.x MMEngine
lr_config = dict(
    policy='FlatCosineAnnealing',
    start_percent=0.5,
    min_lr=0.005,
    by_epoch=True
)
param_scheduler = [
    dict(type='ConstantLR', factor=1, begin=0, end=num_epochs * 0.75)
    dict(type='CosineAnnealingLR',
         eta_min=0.005,
         begin=num_epochs * 0.75,
         end=num_epochs,
         T_max=num_epochs * 0.25,
         by_epoch=True)
]

CosineRestartLrUpdaterHook migration

MMCV-1.x MMEngine
lr_config = dict(policy='CosineRestart',
                 periods=[5, 10, 15],
                 restart_weights=[1, 0.7, 0.3],
                 min_lr=0.001,
                 by_epoch=True)
param_scheduler = [
    dict(type='CosineRestartLR',
         periods=[5, 10, 15],
         restart_weights=[1, 0.7, 0.3],
         eta_min=0.001,
         by_epoch=True)
]

OneCycleLrUpdaterHook migration

MMCV-1.x MMEngine
lr_config = dict(policy='OneCycle',
                 max_lr=0.02,
                 total_steps=90000,
                 pct_start=0.3,
                 anneal_strategy='cos',
                 div_factor=25,
                 final_div_factor=1e4,
                 three_phase=True,
                 by_epoch=False)
param_scheduler = [
    dict(type='OneCycleLR',
         eta_max=0.02,
         total_steps=90000,
         pct_start=0.3,
         anneal_strategy='cos',
         div_factor=25,
         final_div_factor=1e4,
         three_phase=True,
         by_epoch=False)
]

Notice: by_epoch defaults to False in MMCV. It now defaults to True in MMEngine.

LinearAnnealingLrUpdaterHook migration

MMCV-1.x MMEngine
lr_config = dict(
    policy='LinearAnnealing',
    min_lr_ratio=0.01,
    by_epoch=True
)
param_scheduler = [
    dict(type='LinearLR',
         start_factor=1,
         end_factor=0.01,
         begin=0,
         end=num_epochs,
         by_epoch=True)
]

MomentumUpdater migration

MMCV uses momentum_config field and MomentumUpdateHook to adjust momentum. The momentum in MMEngine is also controlled by the parameter scheduler. Users can simply change the LR of the learning rate scheduler to Momentum to use the same strategy to adjust the momentum. The momentum scheduler shares the same param_scheduler field in the config with the learning rate scheduler:

MMCV-1.x MMEngine
lr_config = dict(...)
momentum_config = dict(
    policy='CosineAnnealing',
    min_momentum=0.1,
    by_epoch=True
)
param_scheduler = [
    # config of learning rate schedulers
    dict(...),
    # config of momentum schedulers
    dict(type='CosineAnnealingMomentum',
         eta_min=0.1,
         T_max=num_epochs,
         begin=0,
         end=num_epochs,
         by_epoch=True)
]

Migrate Data Transform to OpenMMLab 2.0

Introduction

According to the data transform interface convention of TorchVision, all data transform classes need to implement the __call__ method. And in the convention of OpenMMLab 1.0, we require the input and output of the __call__ method should be a dictionary.

In OpenMMLab 2.0, to make the data transform classes more extensible, we use transform method instead of __call__ method to implement data transformation, and all data transform classes should inherit the mmcv.transforms.BaseTransfrom class. And you can still use these data transform classes by calling.

A tutorial to implement a data transform class can be found in the Data Transform.

In addition, we move some common data transform classes from every repositories to MMCV, and in this document, we will compare the functionalities, usages and implementations between the original data transform classes (in MMClassification v0.23.2, MMDetection v2.25.1) and the new data transform classes (in MMCV v2.0.0rc1)

Functionality Differences

MMClassification (original) MMDetection (original) MMCV (new)
LoadImageFromFile Join the 'img_prefix' and 'img_info.filename' field to find the path of images and loading. Join the 'img_prefix' and 'img_info.filename' field to find the path of images and loading. Support specifying the order of channels. Load images from 'img_path'. Support ignoring failed loading and specifying decode backend.
LoadAnnotations Not available. Load bbox, label, mask (include polygon masks), semantic segmentation. Support converting bbox coordinate system. Load bbox, label, mask (not include polygon masks), semantic segmentation.
Pad Pad all images in the "img_fields" field. Pad all images in the "img_fields" field. Support padding to integer multiple size. Pad the image in the "img" field. Support padding to integer multiple size.
CenterCrop Crop all images in the "img_fields" field. Support cropping as EfficientNet style. Not available. Crop the image in the "img" field, the bbox in the "gt_bboxes" field, the semantic segmentation in the "gt_seg_map" field, the keypoints in the "gt_keypoints" field. Support padding the margin of the cropped image.
Normalize Normalize the image. No differences. No differences, but we recommend to use data preprocessor to normalize the image.
Resize Resize all images in the "img_fields" field. Support resizing proportionally according to the specified edge. Use Resize with ratio_range=None, the img_scale have a single scale, and multiscale_mode="value". Resize the image in the "img" field, the bbox in the "gt_bboxes" field, the semantic segmentation in the "gt_seg_map" field, the keypoints in the "gt_keypoints" field. Support specifying the ratio of new scale to original scale and support resizing proportionally.
RandomResize Not available Use Resize with ratio_range=None, img_scale have two scales and multiscale_mode="range", or ratio_range is not None.
Resize(
    img_sacle=[(640, 480), (960, 720)],
    mode="range",
)
Have the same resize function as Resize. Support sampling the scale from a scale range or scale ratio range.
RandomResize(scale=[(640, 480), (960, 720)])
RandomChoiceResize Not available Use Resize with ratio_range=None, img_scale have multiple scales, and multiscale_mode="value".
Resize(
    img_sacle=[(640, 480), (960, 720)],
    mode="value",
)
Have the same resize function as Resize. Support randomly choosing the scale from multiple scales or multiple scale ratios.
RandomChoiceResize(scales=[(640, 480), (960, 720)])
RandomGrayscale Randomly grayscale all images in the "img_fields" field. Support keeping channels after grayscale. Not available Randomly grayscale the image in the "img" field. Support specifying the weight of each channel, and support keeping channels after grayscale.
RandomFlip Randomly flip all images in the "img_fields" field. Support flipping horizontally and vertically. Randomly flip all values in the "img_fields", "bbox_fields", "mask_fields" and "seg_fields". Support flipping horizontally, vertically and diagonally, and support specifying the probability of every kind of flipping. Randomly flip the values in the "img", "gt_bboxes", "gt_seg_map", "gt_keypoints" field. Support flipping horizontally, vertically and diagonally, and support specifying the probability of every kind of flipping.
MultiScaleFlipAug Not available Used for test-time-augmentation. Use TestTimeAug
ToTensor Convert the values in the specified fields to torch.Tensor. No differences No differences
ImageToTensor Convert the values in the specified fields to torch.Tensor and transpose the channels to CHW. No differences. No differences.

Implementation Differences

Take RandomFlip as example, the new version RandomFlip in MMCV inherits BaseTransfrom, and move the functionality implementation from __call__ to transform method. In addition, the randomness related code is placed in some extra methods and these methods need to be wrapped by cache_randomness decorator.

  • MMDetection (original version)

class RandomFlip:
    def __call__(self, results):
        """Randomly flip images."""
        ...
        # Randomly choose the flip direction
        cur_dir = np.random.choice(direction_list, p=flip_ratio_list)
        ...
        return results
  • MMCV (new version)

class RandomFlip(BaseTransfrom):
    def transform(self, results):
        """Randomly flip images"""
        ...
        cur_dir = self._random_direction()
        ...
        return results

    @cache_randomness
    def _random_direction(self):
        """Randomly choose the flip direction"""
        ...
        return np.random.choice(direction_list, p=flip_ratio_list)

mmengine.registry

Registry

A registry to map strings to classes or functions.

DefaultScope

Scope of current task used to reset the current registry, which can be accessed globally.

build_from_cfg

Build a module from config dict when it is a class configuration, or call a function from config dict when it is a function configuration.

build_model_from_cfg

Build a PyTorch model from config dict(s).

build_runner_from_cfg

Build a Runner object.

build_scheduler_from_cfg

Builds a ParamScheduler instance from config.

count_registered_modules

Scan all modules in MMEngine’s root and child registries and dump to json.

traverse_registry_tree

Traverse the whole registry tree from any given node, and collect information of all registered modules in this registry tree.

init_default_scope

Initialize the given default scope.

mmengine.config

Config

A facility for config and config files.

ConfigDict

A dictionary for config which has the same interface as python’s built- in dictionary and can be used as a normal dictionary.

DictAction

argparse action to split an argument into KEY=VALUE form on the first = and append to a dictionary.

mmengine.runner

Runner

Runner

A training helper for PyTorch.

Loop

BaseLoop

Base loop class.

EpochBasedTrainLoop

Loop for epoch-based training.

IterBasedTrainLoop

Loop for iter-based training.

ValLoop

Loop for validation.

TestLoop

Loop for test.

Checkpoints

CheckpointLoader

A general checkpoint loader to manage all schemes.

find_latest_checkpoint

Find the latest checkpoint from the given path.

get_deprecated_model_names

get_external_models

get_mmcls_models

get_state_dict

Returns a dictionary containing a whole state of the module.

get_torchvision_models

load_checkpoint

Load checkpoint from a file or URI.

load_state_dict

Load state_dict to a module.

save_checkpoint

Save checkpoint to file.

weights_to_cpu

Copy a model state_dict to cpu.

AMP

autocast

A wrapper of torch.autocast and toch.cuda.amp.autocast.

Miscellaneous

LogProcessor

A log processor used to format log information collected from runner.message_hub.log_scalars.

Priority

Hook priority levels.

get_priority

Get priority value.

mmengine.hooks

Hook

Base hook class.

CheckpointHook

Save checkpoints periodically.

EMAHook

A Hook to apply Exponential Moving Average (EMA) on the model during training.

LoggerHook

Collect logs from different components of Runner and write them to terminal, JSON file, tensorboard and wandb .etc.

NaiveVisualizationHook

Show or Write the predicted results during the process of testing.

ParamSchedulerHook

A hook to update some hyper-parameters in optimizer, e.g., learning rate and momentum.

RuntimeInfoHook

A hook that updates runtime information into message hub.

DistSamplerSeedHook

Data-loading sampler for distributed training.

IterTimerHook

A hook that logs the time spent during iteration.

SyncBuffersHook

Synchronize model buffers such as running_mean and running_var in BN at the end of each epoch.

EmptyCacheHook

Releases all unoccupied cached GPU memory during the process of training.

ProfilerHook

A hook to analyze performance during training and inference.

PrepareTTAHook

Wraps runner.model with subclass of BaseTTAModel in before_test.

mmengine.model

Module

BaseModule

Base module for all modules in openmmlab.

ModuleDict

ModuleDict in openmmlab.

ModuleList

ModuleList in openmmlab.

Sequential

Sequential module in openmmlab.

Model

BaseModel

Base class for all algorithmic models.

BaseDataPreprocessor

Base data pre-processor used for copying data to the target device.

ImgDataPreprocessor

Image pre-processor for normalization and bgr to rgb conversion.

BaseTTAModel

Base model for inference with test-time augmentation.

EMA

BaseAveragedModel

A base class for averaging model weights.

ExponentialMovingAverage

Implements the exponential moving average (EMA) of the model.

MomentumAnnealingEMA

Exponential moving average (EMA) with momentum annealing strategy.

StochasticWeightAverage

Implements the stochastic weight averaging (SWA) of the model.

Model Wrapper

MMDistributedDataParallel

A distributed model wrapper used for training,testing and validation in loop.

MMSeparateDistributedDataParallel

A DistributedDataParallel wrapper for models in MMGeneration.

MMFullyShardedDataParallel

A wrapper for sharding Module parameters across data parallel workers.

is_model_wrapper

Check if a module is a model wrapper.

Weight Initialization

BaseInit

Caffe2XavierInit

ConstantInit

Initialize module parameters with constant values.

KaimingInit

Initialize module parameters with the values according to the method described in `Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification - He, K.

NormalInit

Initialize module parameters with the values drawn from the normal distribution \(\mathcal{N}(\text{mean}, \text{std}^2)\).

PretrainedInit

Initialize module by loading a pretrained model.

TruncNormalInit

Initialize module parameters with the values drawn from the normal distribution \(\mathcal{N}(\text{mean}, \text{std}^2)\) with values outside \([a, b]\).

UniformInit

Initialize module parameters with values drawn from the uniform distribution \(\mathcal{U}(a, b)\).

XavierInit

Initialize module parameters with values according to the method described in `Understanding the difficulty of training deep feedforward neural networks - Glorot, X.

bias_init_with_prob

initialize conv/fc bias value according to a given probability value.

caffe2_xavier_init

constant_init

initialize

Initialize a module.

kaiming_init

normal_init

trunc_normal_init

uniform_init

update_init_info

Update the _params_init_info in the module if the value of parameters are changed.

xavier_init

Utils

detect_anomalous_params

merge_dict

Merge all dictionaries into one dictionary.

stack_batch

Stack multiple tensors to form a batch and pad the tensor to the max shape use the right bottom padding mode in these images.

revert_sync_batchnorm

Helper function to convert all SyncBatchNorm (SyncBN) and mmcv.ops.sync_bn.SyncBatchNorm`(MMSyncBN) layers in the model to `BatchNormXd layers.

convert_sync_batchnorm

Helper function to convert all BatchNorm layers in the model to SyncBatchNorm (SyncBN) or `mmcv.ops.sync_bn.SyncBatchNorm`(MMSyncBN) layers. Adapted from <https://pytorch.org/docs/stable/generated/torch.nn.Sy ncBatchNorm.html#torch.nn.SyncBatchNorm.convert_sync_batchnorm>_.

mmengine.optim

mmengine.optim

Optimizer

AmpOptimWrapper

A subclass of OptimWrapper that supports automatic mixed precision training based on torch.cuda.amp.

OptimWrapper

Optimizer wrapper provides a common interface for updating parameters.

OptimWrapperDict

A dictionary container of OptimWrapper.

DefaultOptimWrapperConstructor

Default constructor for optimizers.

build_optim_wrapper

Build function of OptimWrapper.

Scheduler

_ParamScheduler

Base class for parameter schedulers.

ConstantLR

Decays the learning rate value of each parameter group by a small constant factor until the number of epoch reaches a pre-defined milestone: end.

ConstantMomentum

Decays the momentum value of each parameter group by a small constant factor until the number of epoch reaches a pre-defined milestone: end.

ConstantParamScheduler

Decays the parameter value of each parameter group by a small constant factor until the number of epoch reaches a pre-defined milestone: end.

CosineAnnealingLR

Set the learning rate of each parameter group using a cosine annealing schedule, where \(\eta_{max}\) is set to the initial value and \(T_{cur}\) is the number of epochs since the last restart in SGDR:

CosineAnnealingMomentum

Set the momentum of each parameter group using a cosine annealing schedule, where \(\eta_{max}\) is set to the initial value and \(T_{cur}\) is the number of epochs since the last restart in SGDR:

CosineAnnealingParamScheduler

Set the parameter value of each parameter group using a cosine annealing schedule, where \(\eta_{max}\) is set to the initial value and \(T_{cur}\) is the number of epochs since the last restart in SGDR:

ExponentialLR

Decays the learning rate of each parameter group by gamma every epoch.

ExponentialMomentum

Decays the momentum of each parameter group by gamma every epoch.

ExponentialParamScheduler

Decays the parameter value of each parameter group by gamma every epoch.

LinearLR

Decays the learning rate of each parameter group by linearly changing small multiplicative factor until the number of epoch reaches a pre-defined milestone: end.

LinearMomentum

Decays the momentum of each parameter group by linearly changing small multiplicative factor until the number of epoch reaches a pre-defined milestone: end.

LinearParamScheduler

Decays the parameter value of each parameter group by linearly changing small multiplicative factor until the number of epoch reaches a pre-defined milestone: end.

MultiStepLR

Decays the specified learning rate in each parameter group by gamma once the number of epoch reaches one of the milestones.

MultiStepMomentum

Decays the specified momentum in each parameter group by gamma once the number of epoch reaches one of the milestones.

MultiStepParamScheduler

Decays the specified parameter in each parameter group by gamma once the number of epoch reaches one of the milestones.

OneCycleLR

Sets the learning rate of each parameter group according to the 1cycle learning rate policy.

OneCycleParamScheduler

Sets the parameters of each parameter group according to the 1cycle learning rate policy.

PolyLR

Decays the learning rate of each parameter group in a polynomial decay scheme.

PolyMomentum

Decays the momentum of each parameter group in a polynomial decay scheme.

PolyParamScheduler

Decays the parameter value of each parameter group in a polynomial decay scheme.

StepLR

Decays the learning rate of each parameter group by gamma every step_size epochs.

StepMomentum

Decays the momentum of each parameter group by gamma every step_size epochs.

StepParamScheduler

Decays the parameter value of each parameter group by gamma every step_size epochs.

ReduceOnPlateauLR

Reduce the learning rate of each parameter group when a metric has stopped improving.

ReduceOnPlateauMomentum

Reduce the momentum of each parameter group when a metric has stopped improving.

ReduceOnPlateauParamScheduler

Reduce the parameters of each parameter group when a metric has stopped improving.

mmengine.evaluator

mmengine.evaluator

Evaluator

Evaluator

Wrapper class to compose multiple BaseMetric instances.

Metric

BaseMetric

Base class for a metric.

DumpResults

Dump model predictions to a pickle file for offline evaluation.

Utils

get_metric_value

Get the metric value specified by an indicator, which can be either a metric name or a full name with evaluator prefix.

mmengine.structures

BaseDataElement

A base data interface that supports Tensor-like and dict-like operations.

InstanceData

Data structure for instance-level annotations or predictions.

LabelData

Data structure for label-level annotations or predictions.

PixelData

Data structure for pixel-level annotations or predictions.

mmengine.dataset

Dataset

BaseDataset

BaseDataset for open source projects in OpenMMLab.

Compose

Compose multiple transforms sequentially.

Dataset Wrapper

ClassBalancedDataset

A wrapper of class balanced dataset.

ConcatDataset

A wrapper of concatenated dataset.

RepeatDataset

A wrapper of repeated dataset.

Sampler

DefaultSampler

The default data sampler for both distributed and non-distributed environment.

InfiniteSampler

It’s designed for iteration-based runner and yields a mini-batch indices each time.

Utils

default_collate

Convert list of data sampled from dataset into a batch of data, of which type consistent with the type of each data_itement in data_batch.

pseudo_collate

Convert list of data sampled from dataset into a batch of data, of which type consistent with the type of each data_itement in data_batch.

worker_init_fn

This function will be called on each worker subprocess after seeding and before data loading.

mmengine.infer

BaseInferencer

Base inferencer for downstream tasks.

mmengine.device

get_device

Returns the currently existing device type.

get_max_cuda_memory

Returns the maximum GPU memory occupied by tensors in megabytes (MB) for a given device.

is_cuda_available

Returns True if cuda devices exist.

is_npu_available

Returns True if Ascend PyTorch and npu devices exist.

is_mlu_available

Returns True if Cambricon PyTorch and mlu devices exist.

is_mps_available

Return True if mps devices exist.

mmengine.hub

get_config

Get config from external package.

get_model

Get built model from external package.

mmengine.logging

MMLogger

Formatted logger used to record messages.

MessageHub

Message hub for component interaction.

HistoryBuffer

Unified storage format for different log types.

print_log

Print a log message.

mmengine.visualization

mmengine.visualization

Visualizer

Visualizer

MMEngine provides a Visualizer class that uses the Matplotlib library as the backend.

visualization Backend

BaseVisBackend

Base class for visualization backend.

LocalVisBackend

Local visualization backend class.

TensorboardVisBackend

Tensorboard visualization backend class.

WandbVisBackend

Wandb visualization backend class.

mmengine.fileio

File Backend

BaseStorageBackend

Abstract class of storage backends.

FileClient

A general file client to access files in different backends.

HardDiskBackend

Raw hard disks storage backend.

LocalBackend

Raw local storage backend.

HTTPBackend

HTTP and HTTPS storage bachend.

LmdbBackend

Lmdb storage backend.

MemcachedBackend

Memcached storage backend.

PetrelBackend

Petrel storage backend (for internal usage).

register_backend

Register a backend.

File IO

dump

Dump data to json/yaml/pickle strings or files.

load

Load data from json/yaml/pickle files.

copy_if_symlink_fails

Create a symbolic link pointing to src named dst.

copyfile

Copy a file src to dst and return the destination file.

copyfile_from_local

Copy a local file src to dst and return the destination file.

copyfile_to_local

Copy the file src to local dst and return the destination file.

copytree

Recursively copy an entire directory tree rooted at src to a directory named dst and return the destination directory.

copytree_from_local

Recursively copy an entire directory tree rooted at src to a directory named dst and return the destination directory.

copytree_to_local

Recursively copy an entire directory tree rooted at src to a local directory named dst and return the destination directory.

exists

Check whether a file path exists.

generate_presigned_url

Generate the presigned url of video stream which can be passed to mmcv.VideoReader.

get

Read bytes from a given filepath with ‘rb’ mode.

get_file_backend

Return a file backend based on the prefix of uri or backend_args.

get_local_path

Download data from filepath and write the data to local path.

get_text

Read text from a given filepath with ‘r’ mode.

isdir

Check whether a file path is a directory.

isfile

Check whether a file path is a file.

join_path

Concatenate all file paths.

list_dir_or_file

Scan a directory to find the interested directories or files in arbitrary order.

put

Write bytes to a given filepath with ‘wb’ mode.

put_text

Write text to a given filepath with ‘w’ mode.

remove

Remove a file.

rmtree

Recursively delete a directory tree.

Parse File

dict_from_file

Load a text file and parse the content as a dict.

list_from_file

Load a text file and parse the content as a list of strings.

mmengine.dist

mmengine.dist

dist

gather

Gather data from the whole group to dst process.

gather_object

Gathers picklable objects from the whole group in a single process.

all_gather

Gather data from the whole group in a list.

all_gather_object

Gather picklable objects from the whole group into a list.

all_reduce

Reduces the tensor data across all machines in such a way that all get the final result.

all_reduce_dict

Reduces the dict across all machines in such a way that all get the final result.

all_reduce_params

All-reduce parameters.

broadcast

Broadcast the data from src process to the whole group.

sync_random_seed

Synchronize a random seed to all processes.

broadcast_object_list

Broadcasts picklable objects in object_list to the whole group.

collect_results

Collected results in distributed environments.

collect_results_cpu

Collect results under cpu mode.

collect_results_gpu

Collect results under gpu mode.

utils

get_dist_info

Get distributed information of the given process group.

init_dist

Initialize distributed environment.

init_local_group

Setup the local process group.

get_backend

Return the backend of the given process group.

get_world_size

Return the number of the given process group.

get_rank

Return the rank of the given process group.

get_local_size

Return the number of the current node.

get_local_rank

Return the rank of current process in the current node.

is_main_process

Whether the current rank of the given process group is equal to 0.

master_only

Decorate those methods which should be executed in master process.

barrier

Synchronize all processes from the given process group.

is_distributed

Return True if distributed environment has been initialized.

get_local_group

Return local process group.

get_default_group

Return default process group.

get_data_device

Return the device of data.

get_comm_device

Return the device for communication among groups.

cast_data_device

Recursively convert Tensor in data to device.

mmengine.utils

Manager

ManagerMeta

The metaclass for global accessible class.

ManagerMixin

ManagerMixin is the base class for classes that have global access requirements.

Path

check_file_exist

fopen

is_abs

Check if path is an absolute path in different backends.

is_filepath

mkdir_or_exist

scandir

Scan a directory to find the interested files.

symlink

Package

call_command

install_package

get_installed_path

Get installed path of package.

is_installed

Check package whether installed.

Version

digit_version

Convert a version string into a tuple of integers.

get_git_hash

Get the git hash of the current repo.

Progress Bar

ProgressBar

A progress bar which can print the progress.

track_iter_progress

Track the progress of tasks iteration or enumeration with a progress bar.

track_parallel_progress

Track the progress of parallel task execution with a progress bar.

track_progress

Track the progress of tasks execution with a progress bar.

Miscellaneous

Timer

A flexible Timer class.

TimerError

is_list_of

Check whether it is a list of some type.

is_tuple_of

Check whether it is a tuple of some type.

is_seq_of

Check whether it is a sequence of some type.

is_str

Whether the input is an string instance.

iter_cast

Cast elements of an iterable object into some type.

list_cast

Cast elements of an iterable object into a list of some type.

tuple_cast

Cast elements of an iterable object into a tuple of some type.

concat_list

Concatenate a list of list into a single list.

slice_list

Slice a list into several sub lists by a list of given length.

to_1tuple

to_2tuple

to_3tuple

to_4tuple

to_ntuple

check_prerequisites

A decorator factory to check if prerequisites are satisfied.

deprecated_api_warning

A decorator to check if some arguments are deprecate and try to replace deprecate src_arg_name to dst_arg_name.

deprecated_function

Marks functions as deprecated.

has_method

Check whether the object has a method.

is_method_overridden

Check if a method of base class is overridden in derived class.

import_modules_from_strings

Import modules from the given list of strings.

requires_executable

A decorator to check if some executable files are installed.

requires_package

A decorator to check if some python packages are installed.

check_time

Add check points in a single line.

mmengine.utils.dl_utils

TimeCounter

A tool that counts the average running time of a function or a method.

collect_env

Collect the information of the running environments.

load_url

Loads the Torch serialized object at the given URL.

has_batch_norm

Detect whether model has a BatchNormalization layer.

is_norm

Check if a layer is a normalization layer.

mmcv_full_available

Check whether mmcv-full is installed.

tensor2imgs

Convert tensor to 3-channel images or 1-channel gray images.

TORCH_VERSION

A string with magic powers to compare to both Version and iterables! Prior to 1.10.0 torch.__version__ was stored as a str and so many did comparisons against torch.__version__ as if it were a str.

set_multi_processing

Set multi-processing related environment.

torch_meshgrid

A wrapper of torch.meshgrid to compat different PyTorch versions.

is_jit_tracing

Changelog of v0.x

v0.5.0 (01/20/2023)

Highlights

  • Add BaseInferencer to provide a general inference interface

  • Provide ReduceOnPlateauParamScheduler to adjust learning rate by metric

  • Deprecate support for Python3.6

New Features & Enhancements

  • Deprecate support for Python3.6 by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/863

  • Support non-scalar type metric value by @mzr1996 in https://github.com/open-mmlab/mmengine/pull/827

  • Remove unnecessary calls and lazily import to speed import performance by @zhouzaida in https://github.com/open-mmlab/mmengine/pull/837

  • Support ReduceOnPlateauParamScheduler by @LEFTeyex in https://github.com/open-mmlab/mmengine/pull/819

  • Disable warning of subprocess launched by dataloader by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/870

  • Add BaseInferencer to provide general interface by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/874

Bug Fixes

  • Fix support for Ascend device by @wangjiangben-hw in https://github.com/open-mmlab/mmengine/pull/847

  • Fix Config cannot parse base config when there is . in tmp path, etc. tmp/a.b/c by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/856

  • Fix unloaded weights will not be initialized when using PretrainedIinit by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/764

  • Fix error package name defined in PKG2PROJECT by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/872

Docs

  • Fix typos in advanced_tutorials/logging.md by @RangeKing in https://github.com/open-mmlab/mmengine/pull/861

  • Translate CN train_a_gan to EN by @yaqi0510 in https://github.com/open-mmlab/mmengine/pull/860

  • Update fileio.md by @Xiangxu-0103 in https://github.com/open-mmlab/mmengine/pull/869

  • Add Chinese documentation for inferencer. by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/884

Contributors

A total of 8 developers contributed to this release. Thanks @LEFTeyex, @RangeKing, @yaqi0510, @Xiangxu-0103, @wangjiangben-hw, @mzr1996, @zhouzaida, @HAOCHENYE.

v0.4.0 (12/28/2022)

Highlights

  • Registry supports importing modules automatically

  • Upgrade the documentation and provide the English documentation

  • Provide ProfileHook to profile the running process

New Features & Enhancements

  • Add conf_path in PetrelBackend by @sunyc11 in https://github.com/open-mmlab/mmengine/pull/774

  • Support multiple --cfg-options. by @mzr1996 in https://github.com/open-mmlab/mmengine/pull/759

  • Support passing arguments to OptimWrapper.update_params by @twmht in https://github.com/open-mmlab/mmengine/pull/796

  • Make get_torchvision_model compatible with torch 1.13 by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/793

  • Support flat_decay_mult and fix bias_decay_mult of depth-wise-conv in DefaultOptimWrapperConstructor by @RangiLyu in https://github.com/open-mmlab/mmengine/pull/771

  • Registry supports importing modules automatically. by @RangiLyu in https://github.com/open-mmlab/mmengine/pull/643

  • Add profiler hook functionality by @BayMaxBHL in https://github.com/open-mmlab/mmengine/pull/768

  • Make TTAModel compatible with FSDP. by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/611

Bug Fixes

  • hub.get_model fails on some MMCls models by @C1rN09 in https://github.com/open-mmlab/mmengine/pull/784

  • Fix BaseModel.to and BaseDataPreprocessor.to to make them consistent with torch.nn.Module by @C1rN09 in https://github.com/open-mmlab/mmengine/pull/783

  • Fix creating a new logger at PretrainedInit by @xiexinch in https://github.com/open-mmlab/mmengine/pull/791

  • Fix ZeroRedundancyOptimizer ambiguous error with param groups when PyTorch < 1.12.0 by @C1rN09 in https://github.com/open-mmlab/mmengine/pull/818

  • Fix MessageHub set resumed key repeatedly by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/839

  • Add progress argument to load_from_http by @austinmw in https://github.com/open-mmlab/mmengine/pull/770

  • Ensure metrics is not empty when saving best checkpoint by @zhouzaida in https://github.com/open-mmlab/mmengine/pull/849

Docs

  • Add contributing.md by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/754

  • Add gif to 15 min tutorial by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/748

  • Refactor documentations and translate them to English by @zhouzaida in https://github.com/open-mmlab/mmengine/pull/786

  • Fix document link by @MambaWong in https://github.com/open-mmlab/mmengine/pull/775

  • Fix typos in EN contributing.md by @RangeKing in https://github.com/open-mmlab/mmengine/pull/792

  • Translate data transform docs. by @mzr1996 in https://github.com/open-mmlab/mmengine/pull/737

  • Replace markdown table with html table by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/800

  • Fix wrong example in Visualizer.draw_polygons by @lyviva in https://github.com/open-mmlab/mmengine/pull/798

  • Fix docstring format and rescale the images by @zhouzaida in https://github.com/open-mmlab/mmengine/pull/802

  • Fix failed link in registry by @zhouzaida in https://github.com/open-mmlab/mmengine/pull/811

  • Fix typos by @shanmo in https://github.com/open-mmlab/mmengine/pull/814

  • Fix wrong links and typos in docs by @shanmo in https://github.com/open-mmlab/mmengine/pull/815

  • Translate save_gpu_memory.md by @xin-li-67 in https://github.com/open-mmlab/mmengine/pull/803

  • Translate the documentation of hook design by @zhouzaida in https://github.com/open-mmlab/mmengine/pull/780

  • Fix docstring format by @zhouzaida in https://github.com/open-mmlab/mmengine/pull/816

  • Translate registry.md by @xin-li-67 in https://github.com/open-mmlab/mmengine/pull/817

  • Update docstring of BaseDataElement by @Xiangxu-0103 in https://github.com/open-mmlab/mmengine/pull/836

  • Fix typo by @Xiangxu-0103 in https://github.com/open-mmlab/mmengine/pull/841

  • Update docstring of structures by @Xiangxu-0103 in https://github.com/open-mmlab/mmengine/pull/840

  • Translate optim_wrapper.md by @xin-li-67 in https://github.com/open-mmlab/mmengine/pull/833

  • Fix link error in initialize tutorial. by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/843

  • Fix table in initialized.md by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/844

Contributors

A total of 16 developers contributed to this release. Thanks @BayMaxBHL, @RangeKing, @Xiangxu-0103, @xin-li-67, @twmht, @shanmo, @sunyc11, @lyviva, @austinmw, @xiexinch, @mzr1996, @RangiLyu, @MambaWong, @C1rN09, @zhouzaida, @HAOCHENYE

v0.3.2 (11/24/2022)

New Features & Enhancements

  • Send git errors to subprocess.PIPE by @austinmw in https://github.com/open-mmlab/mmengine/pull/717

  • Add a common TestRunnerTestCase to build a Runner instance. by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/631

  • Align the log by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/436

  • Log the called order of hooks during training process by @songyuc in https://github.com/open-mmlab/mmengine/pull/672

  • Support setting eta_min_ratio in CosineAnnealingParamScheduler by @cir7 in https://github.com/open-mmlab/mmengine/pull/725

  • Enhance compatibility of revert_sync_batchnorm by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/695

Bug Fixes

  • Fix distributed_training.py in examples by @PingHGao in https://github.com/open-mmlab/mmengine/pull/700

  • Format the log of CheckpointLoader.load_checkpoint by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/685

  • Fix bug of CosineAnnealingParamScheduler by @fangyixiao18 in https://github.com/open-mmlab/mmengine/pull/735

  • Fix add_graph is not called bug by @shenmishajing in https://github.com/open-mmlab/mmengine/pull/632

  • Fix .pre-commit-config-zh-cn.yaml pyupgrade-repo github->gitee by @BayMaxBHL in https://github.com/open-mmlab/mmengine/pull/756

Docs

  • Add English docs of BaseDataset by @GT9505 in https://github.com/open-mmlab/mmengine/pull/713

  • Fix BaseDataset typo about lazy initialization by @MengzhangLI in https://github.com/open-mmlab/mmengine/pull/733

  • Fix typo by @zhouzaida in https://github.com/open-mmlab/mmengine/pull/734

  • Translate visualization docs by @xin-li-67 in https://github.com/open-mmlab/mmengine/pull/692

v0.3.1 (11/09/2022)

Highlights

  • Fix error when saving best checkpoint in ddp-training

New Features & Enhancements

  • Replace print with print_log for those functions called by runner by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/686

Bug Fixes

  • Fix error when saving best checkpoint in ddp-training by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/682

Docs

  • Refine Chinese tutorials by @Xiangxu-0103 in https://github.com/open-mmlab/mmengine/pull/694

  • Add MMEval in README by @sanbuphy in https://github.com/open-mmlab/mmengine/pull/669

  • Fix error URL in runner docstring by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/668

  • Fix error evaluator type name in evaluator.md by @sanbuphy in https://github.com/open-mmlab/mmengine/pull/675

  • Fix typo in utils.md @sanbuphy in https://github.com/open-mmlab/mmengine/pull/702

v0.3.0 (11/02/2022)

New Features & Enhancements

  • Support running on Ascend chip by @wangjiangben-hw in https://github.com/open-mmlab/mmengine/pull/572

  • Support torch ZeroRedundancyOptimizer by @nijkah in https://github.com/open-mmlab/mmengine/pull/551

  • Add non-blocking feature to BaseDataPreprocessor by @shenmishajing in https://github.com/open-mmlab/mmengine/pull/618

  • Add documents for clip_grad, and support clip grad by value. by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/513

  • Add ROCm info when collecting env by @zhouzaida in https://github.com/open-mmlab/mmengine/pull/633

  • Add a function to mark the deprecated function. by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/609

  • Call register_all_modules in Registry.get() by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/541

  • Deprecate _save_to_state_dict implemented in mmengine by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/610

  • Add ignore_keys in ConcatDataset by @BIGWangYuDong in https://github.com/open-mmlab/mmengine/pull/556

Docs

  • Fix cannot show changelog.md in chinese documents. by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/606

  • Fix Chinese docs whitespaces by @C1rN09 in https://github.com/open-mmlab/mmengine/pull/521

  • Translate installation and 15_min by @xin-li-67 in https://github.com/open-mmlab/mmengine/pull/629

  • Refine chinese doc by @Tau-J in https://github.com/open-mmlab/mmengine/pull/516

  • Add MMYOLO link in README by @Xiangxu-0103 in https://github.com/open-mmlab/mmengine/pull/634

  • Add MMEngine logo in docs by @zhouzaida in https://github.com/open-mmlab/mmengine/pull/641

  • Fix docstring of BaseDataset by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/656

  • Fix docstring and documentation used for hub.get_model by @zengyh1900 in https://github.com/open-mmlab/mmengine/pull/659

  • Fix typo in docs/zh_cn/advanced_tutorials/visualization.md by @MambaWong in https://github.com/open-mmlab/mmengine/pull/616

  • Fix typo docstring of DefaultOptimWrapperConstructor by @triple-Mu in https://github.com/open-mmlab/mmengine/pull/644

  • Fix typo in advanced tutorial by @cxiang26 in https://github.com/open-mmlab/mmengine/pull/650

  • Fix typo in Config docstring by @sanbuphy in https://github.com/open-mmlab/mmengine/pull/654

  • Fix typo in docs/zh_cn/tutorials/config.md by @Xiangxu-0103 in https://github.com/open-mmlab/mmengine/pull/596

  • Fix typo in docs/zh_cn/tutorials/model.md by @C1rN09 in https://github.com/open-mmlab/mmengine/pull/598

Bug Fixes

  • Fix error calculation of eta_min in CosineRestartParamScheduler by @Z-Fran in https://github.com/open-mmlab/mmengine/pull/639

  • Fix BaseDataPreprocessor.cast_data could not handle string data by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/602

  • Make autocast compatible with mps by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/587

  • Fix error format of log message by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/508

  • Fix error implementation of is_model_wrapper by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/640

  • Fix VisBackend.add_config is not called by @shenmishajing in https://github.com/open-mmlab/mmengine/pull/613

  • Change strict_load of EMAHook to False by default by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/642

  • Fix open encoding problem of Config in Windows by @sanbuphy in https://github.com/open-mmlab/mmengine/pull/648

  • Fix the total number of iterations in log is a float number. by @jbwang1997 in https://github.com/open-mmlab/mmengine/pull/604

  • Fix pip upgrade CI by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/622

New Contributors

  • @shenmishajing made their first contribution in https://github.com/open-mmlab/mmengine/pull/618

  • @Xiangxu-0103 made their first contribution in https://github.com/open-mmlab/mmengine/pull/596

  • @Tau-J made their first contribution in https://github.com/open-mmlab/mmengine/pull/516

  • @wangjiangben-hw made their first contribution in https://github.com/open-mmlab/mmengine/pull/572

  • @triple-Mu made their first contribution in https://github.com/open-mmlab/mmengine/pull/644

  • @sanbuphy made their first contribution in https://github.com/open-mmlab/mmengine/pull/648

  • @Z-Fran made their first contribution in https://github.com/open-mmlab/mmengine/pull/639

  • @BIGWangYuDong made their first contribution in https://github.com/open-mmlab/mmengine/pull/556

  • @zengyh1900 made their first contribution in https://github.com/open-mmlab/mmengine/pull/659

v0.2.0 (11/10/2022)

New Features & Enhancements

  • Add SMDDP backend and support running on AWS by @austinmw in https://github.com/open-mmlab/mmengine/pull/579

  • Refactor FileIO but without breaking bc by @zhouzaida in https://github.com/open-mmlab/mmengine/pull/533

  • Add test time augmentation base model by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/538

  • Use torch.lerp\_() to speed up EMA by @RangiLyu in https://github.com/open-mmlab/mmengine/pull/519

  • Support converting BN to SyncBN by config by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/506

  • Support defining metric name in wandb backend by @okotaku in https://github.com/open-mmlab/mmengine/pull/509

  • Add dockerfile by @zhouzaida in https://github.com/open-mmlab/mmengine/pull/347

Docs

  • Fix API files of English documentation by @zhouzaida in https://github.com/open-mmlab/mmengine/pull/525

  • Fix typo in instance_data.py by @Dai-Wenxun in https://github.com/open-mmlab/mmengine/pull/530

  • Fix the docstring of the model sub-package by @zhouzaida in https://github.com/open-mmlab/mmengine/pull/573

  • Fix a spelling error in docs/zh_cn by @cxiang26 in https://github.com/open-mmlab/mmengine/pull/548

  • Fix typo in docstring by @MengzhangLI in https://github.com/open-mmlab/mmengine/pull/527

  • Update config.md by @Zhengfei-0311 in https://github.com/open-mmlab/mmengine/pull/562

Bug Fixes

  • Fix LogProcessor does not smooth loss if the name of loss doesn’t start with loss by @liuyanyi in https://github.com/open-mmlab/mmengine/pull/539

  • Fix failed to enable detect_anomalous_params in MMSeparateDistributedDataParallel by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/588

  • Fix CheckpointHook behavior unexpected if given filename_tmpl argument by @C1rN09 in https://github.com/open-mmlab/mmengine/pull/518

  • Fix error argument sequence in FSDP by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/520

  • Fix uploading image in wandb backend @okotaku in https://github.com/open-mmlab/mmengine/pull/510

  • Fix loading state dictionary in EMAHook by @okotaku in https://github.com/open-mmlab/mmengine/pull/507

  • Fix circle import in EMAHook by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/523

  • Fix unit test could fail caused by MultiProcessTestCase by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/535

  • Remove unnecessary “if statement” in Registry by @MambaWong in https://github.com/open-mmlab/mmengine/pull/536

  • Fix _save_to_state_dict by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/542

  • Support comparing NumPy array dataset meta in Runner.resume by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/511

  • Use get instead of pop to dump runner_type in build_runner_from_cfg by @nijkah in https://github.com/open-mmlab/mmengine/pull/549

  • Upgrade pre-commit hooks by @zhouzaida in https://github.com/open-mmlab/mmengine/pull/576

  • Delete the error comment in registry.md by @vansin in https://github.com/open-mmlab/mmengine/pull/514

  • Fix Some out-of-date unit tests by @C1rN09 in https://github.com/open-mmlab/mmengine/pull/586

  • Fix typo in MMFullyShardedDataParallel by @yhna940 in https://github.com/open-mmlab/mmengine/pull/569

  • Update Github Action CI and CircleCI by @zhouzaida in https://github.com/open-mmlab/mmengine/pull/512

  • Fix unit test in windows by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/515

  • Fix merge ci & multiprocessing unit test by @HAOCHENYE in https://github.com/open-mmlab/mmengine/pull/529

New Contributors

  • @okotaku made their first contribution in https://github.com/open-mmlab/mmengine/pull/510

  • @MengzhangLI made their first contribution in https://github.com/open-mmlab/mmengine/pull/527

  • @MambaWong made their first contribution in https://github.com/open-mmlab/mmengine/pull/536

  • @cxiang26 made their first contribution in https://github.com/open-mmlab/mmengine/pull/548

  • @nijkah made their first contribution in https://github.com/open-mmlab/mmengine/pull/549

  • @Zhengfei-0311 made their first contribution in https://github.com/open-mmlab/mmengine/pull/562

  • @austinmw made their first contribution in https://github.com/open-mmlab/mmengine/pull/579

  • @yhna940 made their first contribution in https://github.com/open-mmlab/mmengine/pull/569

  • @liuyanyi made their first contribution in https://github.com/open-mmlab/mmengine/pull/539

Contributing to OpenMMLab

Welcome to the MMEngine community, we are committed to building a cutting-edge computer vision foundational library and all kinds of contributions are welcomed, including but not limited to

Fix bug

You can directly post a Pull Request to fix typos in code or documents

The steps to fix the bug of code implementation are as follows.

  1. If the modification involves significant changes, you should create an issue first and describe the error information and how to trigger the bug. Other developers will discuss it with you and propose a proper solution.

  2. Posting a pull request after fixing the bug and adding the corresponding unit test.

New Feature or Enhancement

  1. If the modification involves significant changes, you should create an issue to discuss with our developers to propose a proper design.

  2. Post a Pull Request after implementing the new feature or enhancement and add the corresponding unit test.

Document

You can directly post a pull request to fix documents. If you want to add a document, you should first create an issue to check if it is reasonable.

Pull Request Workflow

If you’re not familiar with Pull Request, don’t worry! The following guidance will tell you how to create a Pull Request step by step. If you want to dive into the development mode of Pull Request, you can refer to the official documents.

1. Fork and clone

If you are posting a pull request for the first time, you should fork the OpenMMLab repositories by clicking the Fork button in the top right corner of the GitHub page, and the forked repositories will appear under your GitHub profile.

Then, you can clone the repositories to local:

git clone git@github.com:{username}/mmengine.git

After that, you should add the official repository as the upstream repository.

git remote add upstream git@github.com:open-mmlab/mmengine

Check whether the remote repository has been added successfully by git remote -v.

origin	git@github.com:{username}/mmengine.git (fetch)
origin	git@github.com:{username}/mmengine.git (push)
upstream	git@github.com:open-mmlab/mmengine (fetch)
upstream	git@github.com:open-mmlab/mmengine (push)

Note

Here’s a brief introduction to origin and upstream. When we use “git clone”, we create an “origin” remote by default, which points to the repository cloned from. As for “upstream”, we add it ourselves to point to the target repository. Of course, if you don’t like the name “upstream”, you could name it as you wish. Usually, we’ll push the code to “origin”. If the pushed code conflicts with the latest code in official(“upstream”), we should pull the latest code from upstream to resolve the conflicts, and then push to “origin” again. The posted Pull Request will be updated automatically.

2. Configure pre-commit

You should configure pre-commit in the local development environment to make sure the code style matches that of OpenMMLab. Note: The following code should be executed under the mmengine directory.

pip install -U pre-commit
pre-commit install

Check that pre-commit is configured successfully, and install the hooks defined in .pre-commit-config.yaml.

pre-commit run --all-files

If the installation process is interrupted, you can repeatedly run pre-commit run ... to continue the installation.

If the code does not conform to the code style specification, pre-commit will raise a warning and fixes some of the errors automatically.

If we want to commit our code bypassing the pre-commit hook, we can use the --no-verify option(only for temporary committing).

git commit -m "xxx" --no-verify

3. Create a development branch

After configuring the pre-commit, we should create a branch based on the master branch to develop the new feature or fix the bug. The proposed branch name is username/pr_name

git checkout -b yhc/refactor_contributing_doc

In subsequent development, if the master branch of the local repository is behind the master branch of “upstream”, we need to pull the upstream for synchronization, and then execute the above command:

git pull upstream master

4. Commit the code and pass the unit test

  • MMEngine introduces mypy to do static type checking to increase the robustness of the code. Therefore, we need to add Type Hints to our code and pass the mypy check. If you are not familiar with Type Hints, you can refer to this tutorial.

  • The committed code should pass through the unit test

    # Pass all unit tests
    pytest tests
    
    # Pass the unit test of runner
    pytest tests/test_runner/test_runner.py
    

    If the unit test fails for lack of dependencies, you can install the dependencies referring to the guidance

  • If the documents are modified/added, we should check the rendering result referring to guidance

5. Push the code to remote

We could push the local commits to remote after passing through the check of unit test and pre-commit. You can associate the local branch with remote branch by adding -u option.

git push -u origin {branch_name}

This will allow you to use the git push command to push code directly next time, without having to specify a branch or the remote repository.

6. Create a Pull Request

(1) Create a pull request in GitHub’s Pull request interface

(2) Modify the PR description according to the guidelines so that other developers can better understand your changes

Find more details about Pull Request description in pull request guidelines.

note

(a) The Pull Request description should contain the reason for the change, the content of the change, and the impact of the change, and be associated with the relevant Issue (see documentation

(b) If it is your first contribution, please sign the CLA

(c) Check whether the Pull Request pass through the CI

MMEngine will run unit test for the posted Pull Request on different platforms (Linux, Window, Mac), based on different versions of Python, PyTorch, CUDA to make sure the code is correct. We can see the specific test information by clicking Details in the above image so that we can modify the code.

(3) If the Pull Request passes the CI, then you can wait for the review from other developers. You’ll modify the code based on the reviewer’s comments, and repeat the steps 4-5 until all reviewers approve it. Then, we will merge it ASAP.

7. Resolve conflicts

If your local branch conflicts with the latest master branch of “upstream”, you’ll need to resolove them. There are two ways to do this:

git fetch --all --prune
git rebase upstream/master

or

git fetch --all --prune
git merge upstream/master

If you are very good at handling conflicts, then you can use rebase to resolve conflicts, as this will keep your commit logs tidy. If you are not familiar with rebase, then you can use merge to resolve conflicts.

Guidance

Unit test

We should also make sure the committed code will not decrease the coverage of unit test, we could run the following command to check the coverage of unit test:

python -m coverage run -m pytest /path/to/test_file
python -m coverage html
# check file in htmlcov/index.html

Document rendering

If the documents are modified/added, we should check the rendering result. We could install the dependencies and run the following command to render the documents and check the results:

pip install -r requirements/docs.txt
cd docs/zh_cn/
# or docs/en
make html
# check file in ./docs/zh_cn/_build/html/index.html

Python Code style

We adopt PEP8 as the preferred code style.

We use the following tools for linting and formatting:

  • flake8: A wrapper around some linter tools.

  • isort: A Python utility to sort imports.

  • yapf: A formatter for Python files.

  • codespell: A Python utility to fix common misspellings in text files.

  • mdformat: Mdformat is an opinionated Markdown formatter that can be used to enforce a consistent style in Markdown files.

  • docformatter: A formatter to format docstring.

Style configurations of yapf and isort can be found in setup.cfg.

We use pre-commit hook that checks and formats for flake8, yapf, isort, trailing whitespaces, markdown files, fixes end-of-files, double-quoted-strings, python-encoding-pragma, mixed-line-ending, sorts requirments.txt automatically on every commit. The config for a pre-commit hook is stored in .pre-commit-config.

PR Specs

  1. Use pre-commit hook to avoid issues of code style

  2. One short-time branch should be matched with only one PR

  3. Accomplish a detailed change in one PR. Avoid large PR

    • Bad: Support Faster R-CNN

    • Acceptable: Add a box head to Faster R-CNN

    • Good: Add a parameter to box head to support custom conv-layer number

  4. Provide clear and significant commit message

  5. Provide clear and meaningful PR description

    • Task name should be clarified in title. The general format is: [Prefix] Short description of the PR (Suffix)

    • Prefix: add new feature [Feature], fix bug [Fix], related to documents [Docs], in developing [WIP] (which will not be reviewed temporarily)

    • Introduce main changes, results and influences on other modules in short description

    • Associate related issues and pull requests with a milestone

Indices and tables


© Copyright 2022, mmengine contributors. Revision 8d4885cb.

Built with Sphinx using a theme provided by Read the Docs.

Get Started

Common Usage

Tutorials

Advanced tutorials

Design

Migration guide

API Reference

Notes

Switch Language

Read the Docs v: v0.5.0
Versions
latest
stable
v0.5.0
v0.4.0
v0.3.0
v0.2.0
Downloads
On Read the Docs
Project Home
Builds

Free document hosting provided by Read the Docs.