Logging¶

Overview¶

Runner produces amounts of logs during execution. These logs include dataset information, model initialization, learning rates, losses, etc. In order to make these logs easily accessed by users, MMEngine designs MessageHub, HistoryBuffer, LogProcessor and MMLogger, which enable:

Configure statistical methods in config files. For example, losses can be globally averaged or smoothed by a sliding window.
Query training states (iterations, epochs, etc.) in any module
Configure whether save the multi-process log or not during distributed training.

Each scalar (losses, learning rates, etc.) during training is encapsulated by HistoryBuffer, managed by MessageHub in key-value pairs, formatted by LogProcessor and then exported to various visualization backends by LoggerHook. In most cases, statistical methods of these scalars can be configured through the LogProcessor without understanding the data flow. Before diving into the design of the logging system, please read through logging tutorial first for familiarizing basic use cases.

HistoryBuffer¶

HistoryBuffer records the history of the corresponding scalar such as losses, learning rates, and iteration time in an array. As an internal class, it works with MessageHub, LoggerHook and LogProcessor to make training log configurable. Meanwhile, HistoryBuffer can also be used alone, which enables users to manage their training logs and do various statistics in an easy manner.

We will first introduce the usage of HistoryBuffer in the following section. The association between HistoryBuffer and MessageHub will be introduced later in the MessageHub section.

HistoryBuffer Initialization¶

HistoryBuffer accepts log_history, count_history and max_length for initialization.

log_history records the history of the scaler. For example, if the loss in the previous 3 iterations is 0.3, 0.2, 0.1 respectively, there will be log_history=[0.3, 0.2, 0.1].
count_history controls the statistical granularity and will be used when counting the average. Take the above example, if we count the average loss across iterations, we have count_history=[1, 1, 1]. Instead, if we count the average loss across images with batch_size=8, then we have count_history=[8, 8, 8].
max_length controls the maximum length of the history. If the length of log_history and count_history exceeds max_length, the earliest elements will be removed.

Besides, we can access the history of the data through history_buffer.data.

from mmengine.logging import HistoryBuffer

history_buffer = HistoryBuffer()  # Default initialization
log_history, count_history = history_buffer.data
# [] []
history_buffer = HistoryBuffer([1, 2, 3], [1, 2, 3])  # Init with lists
log_history, count_history = history_buffer.data
# [1 2 3] [1 2 3]
history_buffer = HistoryBuffer([1, 2, 3], [1, 2, 3], max_length=2)
# The length of history buffer(3) exceeds the max_length(2), the first few elements will be ignored.
log_history, count_history = history_buffer.data
# [2 3] [2 3]

HistoryBuffer Update¶

We can update the log_history and count_history through HistoryBuffer.update(log_history, count_history).

history_buffer = HistoryBuffer([1, 2, 3], [1, 1, 1])
history_buffer.update(4)  # count default to 1
log_history, count_history = history_buffer.data
# [1, 2, 3, 4] [1, 1, 1, 1]
history_buffer.update(5, 2)
log_history, count_history = history_buffer.data
# [1, 2, 3, 4, 5] [1, 1, 1, 1, 2]

Basic Statistical Methods¶

HistoryBuffer provides some basic statistical methods:

current(): Get the latest data.
mean(window_size=None): Count the mean value of the previous window_size data. Defaults to None, as global mean.
max(window_size=None): Count the max value of the previous window_size data. Defaults to None, as global maximum.
min(window_size=None): Count the min value of the previous window_size data. Defaults to None, as global minimum.

history_buffer = HistoryBuffer([1, 2, 3], [1, 1, 1])
history_buffer.min(2)
# 2, the minimum in [2, 3]
history_buffer.min()
# 1, the global minimum

history_buffer.max(2)
# 3, the maximum in [2, 3]
history_buffer.min()
# 3, the global maximum
history_buffer.mean(2)
# 2.5, the mean value in [2, 3], (2 + 3) / (1 + 1)
history_buffer.mean()
# 2, the global mean, (1 + 2 + 3) / (1 + 1 + 1)
history_buffer = HistoryBuffer([1, 2, 3], [2, 2, 2])  # Cases when counts are not 1
history_buffer.mean()
# 1, (1 + 2 + 3) / (2 + 2 + 2)
history_buffer = HistoryBuffer([1, 2, 3], [1, 1, 1])
history_buffer.update(4, 1)
history_buffer.current()
# 4

Statistical Methods Invoking¶

Statistical methods can be accessed through HistoryBuffer.statistics with method name and arguments. The name parameter should be a registered method name (i.e. built-in methods like min and max), while arguments should be the corresponding method’s arguments.

history_buffer = HistoryBuffer([1, 2, 3], [1, 1, 1])
history_buffer.statistics('mean')
# 2, as global mean
history_buffer.statistics('mean', 2)
# 2.5, as the mean of [2, 3]
history_buffer.statistics('mean', 2, 3)
# Error! mismatch arguments given to `mean(window_size)`
history_buffer.statistics('data')
# Error! `data` method not registered

Statistical Methods Registration¶

Custom statistical methods can be registered through @HistoryBuffer.register_statistics.

from mmengine.logging import HistoryBuffer
import numpy as np


@HistoryBuffer.register_statistics
def weighted_mean(self, window_size, weight):
    assert len(weight) == window_size
    return (self._log_history[-window_size:] * np.array(weight)).sum() / \
            self._count_history[-window_size:]


history_buffer = HistoryBuffer([1, 2], [1, 1])
history_buffer.statistics('weighted_mean', 2, [2, 1])  # get (2 * 1 + 1 * 2) / (1 + 1)

Use Cases¶

logs = dict(lr=HistoryBuffer(), loss=HistoryBuffer())  # different keys for different logs
max_iter = 10
log_interval = 5
for iter in range(1, max_iter+1):
    lr = iter / max_iter * 0.1  # linear scaling of lr
    loss = 1 / iter  # loss
    logs['lr'].update(lr, 1)
    logs['loss'].update(loss, 1)
    if iter % log_interval == 0:
        latest_lr = logs['lr'].statistics('current')  # select statistical methods by name
        mean_loss = logs['loss'].statistics('mean', log_interval)  # mean loss of the latest `log_interval` iterations
        print(f'lr:   {latest_lr}\n'
              f'loss: {mean_loss}')
# lr:   0.05
# loss: 0.45666666666666667
# lr:   0.1
# loss: 0.12912698412698415

MessageHub¶

As shown above, HistoryBuffer can easily handle the update and statistics of a single variable. However, there are multiple variables to log during training, each potentially coming from a different module. This makes it an issue to collect and distribute different variables. To address this issue, we provide MessageHub in MMEngine. It is derived from ManagerMixin and thus can be accessed globally. It can be used to simplify the sharing of data across modules.

MessageHub stores data into 2 internal dictionaries, each has its own definition:

log_scalars: Scalars including losses, learning rates and iteration time are collected from different modules and stored into the HistoryBuffer with corresponding key in this dict. Values in this dict will be formatted by LogProcessor and then output to terminal or saved locally. If you want to customize your logging info, you can add new keys to this dict and update in the subsequent training steps.
runtime_info: Some runtime information including epochs and iterations are stored in this dict. This dict makes it easy to share some necessary information across modules.

Note

You may need to use MessageHub only if you want to add extra data to logs or share custom data across modules.

The following examples show the usage of MessageHub, including scalars update, data sharing and log customization.

Update & get training log¶

HistoryBuffers are stored in MessageHub’s log_scalars dictionary as values. You can call update_scalars method to update the HistoryBuffer with the given key. On first call with an unseen key, a HistoryBuffer will be initialized. In the subsequent calls with the same key, the corresponding HistoryBuffer’s update method will be invoked. You can get values or statistics of a HistoryBuffer by specifying a key in get_scalar method. You can also get full logs by directly accessing the log_scalars attribute of a MessageHub.

from mmengine import MessageHub

message_hub = MessageHub.get_instance('task')
message_hub.update_scalar('train/loss', 1, 1)
message_hub.get_scalar('train/loss').current()  # 1, the latest updated train/loss
message_hub.update_scalar('train/loss', 3, 1)
message_hub.get_scalar('train/loss').mean()  # 2, the mean calculated as (1 + 3) / (1 + 1)
message_hub.update_scalar('train/lr', 0.1, 1)

message_hub.update_scalars({'train/time': {'value': 0.1, 'count': 1},
                            'train/data_time': {'value': 0.1, 'count': 1}})

train_time = message_hub.get_scalar('train/time')  # 1

log_dict = message_hub.log_scalars  # return the whole dict
lr_buffer, loss_buffer, time_buffer, data_time_buffer = (
    log_dict['train/lr'], log_dict['train/loss'], log_dict['train/time'],
    log_dict['train/data_time'])

Note

Losses, learning rates and iteration time are automatically updated by runner and hooks. You are not supposed to manually update them.

Note

MessageHub has no special requirements for keys in log_scalars. However, MMEngine will only output a scalar to logs if it has a key prfixed with train/val/test.

Update & get runtime info¶

Runtime information is stored in runtime_info dict. The dict accepts data in any data types. Different from HistoryBuffer, the value will be overwritten on every update.

message_hub = MessageHub.get_instance('task')
message_hub.update_info('iter', 1)
message_hub.get_info('iter')  # 1
message_hub.update_info('iter', 2)
message_hub.get_info('iter')  # 2, overwritten by the above command

Share MessageHub across modules¶

During the execution of a runner, different modules receive and post data through MessageHub. Then, RuntimeInfoHook gathers data such as losses and learning rates before exporting them to user defined backends (Tensorboard, WandB, etc). Following is an example to show the communication between logger hook and other modules.

from mmengine import MessageHub

class LogProcessor:
    # gather data from other modules. similar to logger hook
    def __init__(self, name):
        self.message_hub = MessageHub.get_instance(name)  # access MessageHub

    def run(self):
        print(f"Learning rate is {self.message_hub.get_scalar('train/lr').current()}")
        print(f"loss is {self.message_hub.get_scalar('train/loss').current()}")
        print(f"meta is {self.message_hub.get_info('meta')}")


class LrUpdater:
    # update the learning rate
    def __init__(self, name):
        self.message_hub = MessageHub.get_instance(name)  # access MessageHub

    def run(self):
        self.message_hub.update_scalar('train/lr', 0.001)
        # update the learning rate, saved as HistoryBuffer


class MetaUpdater:
    # update meta information
    def __init__(self, name):
        self.message_hub = MessageHub.get_instance(name)

    def run(self):
        self.message_hub.update_info(
            'meta',
            dict(experiment='retinanet_r50_caffe_fpn_1x_coco.py',
                 repo='mmdetection'))    # meta info will be overwritten on every update


class LossUpdater:
    # update losses
    def __init__(self, name):
        self.message_hub = MessageHub.get_instance(name)

    def run(self):
        self.message_hub.update_scalar('train/loss', 0.1)

class ToyRunner:
    # compose of different modules
    def __init__(self, name):
        self.message_hub = MessageHub.get_instance(name)  # this will create a global MessageHub instance
        self.log_processor = LogProcessor(name)
        self.updaters = [LossUpdater(name),
                         MetaUpdater(name),
                         LrUpdater(name)]

    def run(self):
        for updater in self.updaters:
            updater.run()
        self.log_processor.run()

if __name__ == '__main__':
    task = ToyRunner('name')
    task.run()
    # Learning rate is 0.001
    # loss is 0.1
    # meta {'experiment': 'retinanet_r50_caffe_fpn_1x_coco.py', 'repo': 'mmdetection'}

Add custom logs¶

Users can update scalars in MessageHub anywhere in any module. All data in log_scalars with valid keys are exported to user defined backends after statistical methods.

Note

Only those data in log_scalars with keys prefixed with train/val/test are exported.

class CustomModule:
    def __init__(self):
        self.message_hub = MessageHub.get_current_instance()

    def custom_method(self):
        self.message_hub.update_scalar('train/a', 100)
        self.message_hub.update_scalars({'train/b': 1, 'train/c': 2})

By default, the latest value of the custom data(a, b and c) are exported. Users can also configure the LogProcessor to switch between statistical methods.

LogProcessor¶

Users can configure the LogProcessor to specify the statistical methods and extra arguments. By default, learning rates are displayed by the latest value, while losses and iteration time are counted with an iteration-based smooth method.

Minimum example¶

log_processor = dict(
    window_size=10
)

In this configuration, losses and iteration time will be averaged in the latest 10 iterations. The output might be:

04/15 12:34:24 - mmengine - INFO - Iter [10/12]  , eta: 0:00:00, time: 0.003, data_time: 0.002, loss: 0.13

Custom statistical methods¶

Users can configure the custom_cfg list to specify the statistical method. Each element in custom_cfg must be a dict consisting of the following keys:

data_src: Required argument representing the data source of the log. A data source may have multiple statistical methods. Default sources, which are automatically added to logs, include all keys in loss dict(i.e. loss), learning rate(lr) and iteration time(time & data_time). Besides, all scalars updated by MessageHub’s update_scalar/update_scalars methods with valid keys are configurable data sources, but be aware that the prefix(‘train/’, ‘val/’, ‘test/’) should be removed.
method_name: Required argument representing the statistical method. It supports both built-in methods and custom methods.
log_name: Optional argument representing the output name after statistics. If not specified, the new log will overwrite the old one.
Other arguments: Extra arguments needed by your specified method. window_size is a special key, which can be either an int, ‘epoch’ or ‘global’. LogProcessor will parse these arguments and return statistical result based on iteration/epoch/global smooth.

Overwrite the old statistical method

log_processor = dict(
    window_size=10,
    by_epoch=True,
    custom_cfg=[
        dict(data_src='loss',
             method_name='mean',
             window_size=100)])

In this configuration, LogProcessor will overwrite the default window size 10 by a larger window size 100 and output the mean value to ‘loss’ field in logs.

04/15 12:34:24 - mmengine - INFO - Iter [10/12]  , eta: 0:00:00, time: 0.003, data_time: 0.002, loss: 0.11

New statistical method without overwriting

log_processor = dict(
    window_size=10,
    by_epoch=True,
    custom_cfg=[
        dict(data_src='loss',
             log_name='loss_min',
             method_name='min',
             window_size=100)])

04/15 12:34:24 - mmengine - INFO - Iter [10/12]  , eta: 0:00:00, time: 0.003, data_time: 0.002, loss: 0.11, loss_min: 0.08

MMLogger¶

In order to export logs with clear hierarchies, unified formats and less disturbation from third-party logging systems, MMengine implements a MMLogger class based on logging. It is derived from ManagerMixin. Compared with logging.logger, it enables accessing logger in current runner without knowing the logger name.

Instantiate MMLogger¶

Users can create a global logger by calling get_instance. The default log format is shown as below

logger = MMLogger.get_instance('mmengine', log_level='INFO')
logger.info("this is a test")
# 04/15 14:01:11 - mmengine - INFO - this is a test

Apart from user defined messages, the logger will also export timestamps, logger name and log level. ERROR messages are treated specially with red highlight and extra information like error locations.

logger = MMLogger.get_instance('mmengine', log_level='INFO')
logger.error('division by zero')
# 04/15 14:01:56 - mmengine - ERROR - /mnt/d/PythonCode/DeepLearning/OpenMMLab/mmengine/a.py - <module> - 4 - division by zero

Export logs¶

When get_instance is invoked with log_file argument, logs will be additionally exported to local storage in text format.

logger = MMLogger.get_instance('mmengine', log_file='tmp.log', log_level='INFO')
logger.info("this is a test")
# 04/15 14:01:11 - mmengine - INFO - this is a test

tmp/tmp.log:

04/15 14:01:11 - mmengine - INFO - this is a test

Since distributed applications will create multiple log files, we add a directory with the same name to the exported log file name. Logs from different processes are all saved in this directory. Therefore, the actual log file path in the above example is tmp/tmp.log.

Export logs in distributed training¶

When training with pytorch distributed methods, users can set distributed=True or log_level='DEBUG' in config file to export multiple logs from all processes. If not specified, only master process will export log file.

logger = MMLogger.get_instance('mmengine', log_file='tmp.log', distributed=True, log_level='INFO')
# or
# logger = MMLogger.get_instance('mmengine', log_file='tmp.log', log_level='DEBUG')

In the case of multiple processes in a single node, or multiple processes in multiple nodes with shared storage, the exported log files have the following hierarchy

#  shared storage case
work_dir/20230228_141908
├── 20230306_183634_${hostname}_device0_rank0.log
├── 20230306_183634_${hostname}_device1_rank1.log
├── 20230306_183634_${hostname}_device2_rank2.log
├── 20230306_183634_${hostname}_device3_rank3.log
├── 20230306_183634_${hostname}_device4_rank4.log
├── 20230306_183634_${hostname}_device5_rank5.log
├── 20230306_183634_${hostname}_device6_rank6.log
├── 20230306_183634_${hostname}_device7_rank7.log
...
├── 20230306_183634_${hostname}_device7_rank63.log

In the case of multiple processes in multiple nodes without storage, logs are organized as follows

# without shared storage
# node 0:
work_dir/20230228_141908
├── 20230306_183634_${hostname}_device0_rank0.log
├── 20230306_183634_${hostname}_device1_rank1.log
├── 20230306_183634_${hostname}_device2_rank2.log
├── 20230306_183634_${hostname}_device3_rank3.log
├── 20230306_183634_${hostname}_device4_rank4.log
├── 20230306_183634_${hostname}_device5_rank5.log
├── 20230306_183634_${hostname}_device6_rank6.log
├── 20230306_183634_${hostname}_device7_rank7.log

# node 7:
work_dir/20230228_141908
├── 20230306_183634_${hostname}_device0_rank56.log
├── 20230306_183634_${hostname}_device1_rank57.log
├── 20230306_183634_${hostname}_device2_rank58.log
├── 20230306_183634_${hostname}_device3_rank59.log
├── 20230306_183634_${hostname}_device4_rank60.log
├── 20230306_183634_${hostname}_device5_rank61.log
├── 20230306_183634_${hostname}_device6_rank62.log
├── 20230306_183634_${hostname}_device7_rank63.log