During neural network training, optimization hyperparameters (e.g. learning rate) are usually adjusted along with the training process. One of the simplest and most common learning rate adjustment strategies is multi-step learning rate decay, which reduces the learning rate to a fraction at regular intervals. PyTorch provides LRScheduler to implement various learning rate adjustment strategies. In MMEngine, we have extended it and implemented a more general ParamScheduler. It can adjust optimization hyperparameters such as learning rate and momentum. It also supports the combination of multiple schedulers to create more complex scheduling strategies.
We first introduce how to use PyTorch’s
torch.optim.lr_scheduler to adjust learning rate.
How to use PyTorch's builtin learning rate scheduler?
Here is an example which refers from PyTorch official documentation:
Initialize an ExponentialLR object, and call the
step method after each training epoch.
import torch from torch.optim import SGD from torch.optim.lr_scheduler import ExponentialLR model = torch.nn.Linear(1, 1) dataset = [torch.randn((1, 1, 1)) for _ in range(20)] optimizer = SGD(model, 0.1) scheduler = ExponentialLR(optimizer, gamma=0.9) for epoch in range(10): for data in dataset: optimizer.zero_grad() output = model(data) loss = 1 - output loss.backward() optimizer.step() scheduler.step()
mmengine.optim.scheduler supports most of PyTorch’s learning rate schedulers such as
MultiStepLR, etc. Please refer to parameter scheduler API documentation for all of the supported schedulers.
MMEngine also supports adjusting momentum with parameter schedulers. To use momentum schedulers, replace
LR in the class name to
Momentum, such as
LinearMomentum. Further, we implement the general parameter scheduler ParamScheduler, which is used to adjust the specified hyperparameters in the optimizer, such as weight_decay, etc. This feature makes it easier to apply some complex hyperparameter tuning strategies.
Different from the above example, MMEngine usually does not need to manually implement the training loop and call
optimizer.step(). The runner will automatically manage the training progress and control the execution of the parameter scheduler through
Use a single LRScheduler¶
If only one scheduler needs to be used for the entire training process, there is no difference with PyTorch’s learning rate scheduler.
# build the scheduler manually from torch.optim import SGD from mmengine.runner import Runner from mmengine.optim.scheduler import MultiStepLR optimizer = SGD(model.parameters(), lr=0.01, momentum=0.9) param_scheduler = MultiStepLR(optimizer, milestones=[8, 11], gamma=0.1) runner = Runner( model=model, optim_wrapper=dict( optimizer=optimizer), param_scheduler=param_scheduler, ... )
If using the runner with the registry and config file, we can specify the scheduler by setting the
param_scheduler field in the config. The runner will automatically build a parameter scheduler based on this field:
# build the scheduler with config file param_scheduler = dict(type='MultiStepLR', by_epoch=True, milestones=[8, 11], gamma=0.1)
Note that the parameter
by_epoch is added here, which controls the frequency of learning rate adjustment. When set to True, it means adjusting by epoch. When set to False, it means adjusting by iteration. The default value is True.
In the above example, it means to adjust according to epochs. At this time, the unit of the parameters is epoch. For example, [8, 11] in
milestones means that the learning rate will be multiplied by 0.1 at the end of the 8 and 11 epoch.
When the frequency is modified, the meaning of the count-related settings of the scheduler will be changed accordingly. When
by_epoch=True, the numbers in milestones indicate at which epoch the learning rate decay is performed, and when
by_epoch=False it indicates at which iteration the learning rate decay is performed.
Here is an example of adjusting by iterations: At the end of the 600th and 800th iterations, the learning rate will be multiplied by 0.1 times.
param_scheduler = dict(type='MultiStepLR', by_epoch=False, milestones=[600, 800], gamma=0.1)
If users want to use the iteration-based frequency while filling the scheduler config settings by epoch, MMEngine’s scheduler also provides an automatic conversion method. Users can call the
build_iter_from_epoch method and provide the number of iterations for each training epoch to construct a scheduler object updated by iterations:
epoch_length = len(train_dataloader) param_scheduler = MultiStepLR.build_iter_from_epoch(optimizer, milestones=[8, 11], gamma=0.1, epoch_length=epoch_length)
If using config to build a scheduler, just add
convert_to_iter_based=True to the field. The runner will automatically call
build_iter_from_epoch to convert the epoch-based config to an iteration-based scheduler object:
param_scheduler = dict(type='MultiStepLR', by_epoch=True, milestones=[8, 11], gamma=0.1, convert_to_iter_based=True)
Below is a Cosine Annealing learning rate scheduler that is updated by epoch, where the learning rate is only modified after each epoch:
param_scheduler = dict(type='CosineAnnealingLR', by_epoch=True, T_max=12)
After automatically conversion, the learning rate is updated by iteration. As you can see from the graph below, the learning rate changes more smoothly.
param_scheduler = dict(type='CosineAnnealingLR', by_epoch=True, T_max=12, convert_to_iter_based=True)
Combine multiple LRSchedulers (e.g. learning rate warm-up)¶
In the training process of some algorithms, the learning rate is not adjusted according to a certain scheduling strategy from beginning to end. The most common example is learning rate warm-up.
For example, in the first few iterations, a linear strategy is used to increase the learning rate from a small value to normal, and then another strategy is applied.
MMEngine supports combining multiple schedulers together. Just modify the
param_scheduler field in the config file to a list of scheduler config, and the ParamSchedulerHook can automatically process the scheduler list. The following example implements learning rate warm-up.
param_scheduler = [ # Linear learning rate warm-up scheduler dict(type='LinearLR', start_factor=0.001, by_epoch=False, # Updated by iterations begin=0, end=50), # Warm up for the first 50 iterations # The main LRScheduler dict(type='MultiStepLR', by_epoch=True, # Updated by epochs milestones=[8, 11], gamma=0.1) ]
Note that the
end parameters are added here. These two parameters specify the valid interval of the scheduler. The valid interval usually only needs to be set when multiple schedulers are combined, and can be ignored when using a single scheduler. When the
end parameters are specified, it means that the scheduler only takes effect in the [begin, end) interval, and the unit is determined by the
In the above example, the
LinearLR in the warm-up phase is False, which means that the scheduler only takes effect in the first 50 iterations. After more than 50 iterations, the scheduler will no longer take effect, and the second scheduler, which is
MultiStepLR, will control the learning rate. When combining different schedulers, the
by_epoch parameter does not have to be the same for each scheduler.
Here is another example:
param_scheduler = [ # Use a linear warm-up at [0, 100) iterations dict(type='LinearLR', start_factor=0.001, by_epoch=False, begin=0, end=100), # Use a cosine learning rate at [100, 900) iterations dict(type='CosineAnnealingLR', T_max=800, by_epoch=False, begin=100, end=900) ]
The above example uses a linear learning rate warm-up for the first 100 iterations, and then uses a cosine annealing learning rate scheduler with a period of 800 from the 100th to the 900th iteration.
Users can combine any number of schedulers. If the valid intervals of two schedulers are not connected to each other which leads to an interval that is not covered, the learning rate of this interval remains unchanged. If the valid intervals of the two schedulers overlap, the adjustment of the learning rate will be triggered in the order of the scheduler config (similar with
We recommend using different learning rate scheduling strategies in different stages of training to avoid overlapping of the valid intervals. Be careful If you really need to stack two schedulers overlapped. We recommend using learning rate visualization tool to visualize the learning rate after stacking, to avoid the adjustment not as expected.
How to adjust other hyperparameters¶
Like learning rate, momentum is a schedulable hyperparameter in the optimizer’s parameter group. The momentum scheduler is used in exactly the same way as the learning rate scheduler. Just add the momentum scheduler config to the list in the
param_scheduler = [ # the lr scheduler dict(type='LinearLR', ...), # the momentum scheduler dict(type='LinearMomentum', start_factor=0.001, by_epoch=False, begin=0, end=1000) ]
Generic parameter scheduler¶
MMEngine also provides a set of generic parameter schedulers for scheduling other hyperparameters in the
param_groups of the optimizer. Change
LR in the class name of the learning rate scheduler to
Param, such as
LinearParamScheduler. Users can schedule the specific hyperparameters by setting the
param_name variable of the scheduler.
Here is an example:
param_scheduler = [ dict(type='LinearParamScheduler', param_name='lr', # adjust the 'lr' in `optimizer.param_groups` start_factor=0.001, by_epoch=False, begin=0, end=1000) ]
By setting the
'lr', this parameter scheduler is equivalent to
In addition to learning rate and momentum, users can also schedule other parameters in
optimizer.param_groups. The schedulable parameters depend on the optimizer used. For example, when using the SGD optimizer with
weight_decay can be adjusted as follows:
param_scheduler = [ dict(type='LinearParamScheduler', param_name='weight_decay', # adjust 'weight_decay' in `optimizer.param_groups` start_factor=0.001, by_epoch=False, begin=0, end=1000) ]