DeepSpeedStrategy¶

class mmengine._strategy.DeepSpeedStrategy(*, config=None, zero_optimization=None, gradient_clipping=None, fp16=None, inputs_to_half=None, bf16=None, amp=None, activation_checkpointing=None, aio=None, train_micro_batch_size_per_gpu=None, gradient_accumulation_steps=None, steps_per_print=10000000000000, exclude_frozen_parameters=None, **kwargs)[source]¶

Support training models with DeepSpeed.

Note

The detailed usage of parameters can be found at https://www.deepspeed.ai/docs/config-json/.

Parameters:

config (str or dict, optional) – If it is a string, it is a path to load config for deepspeed. Defaults to None.
zero_optimization (dict, optional) – Enabling and configuring ZeRO memory optimizations. Defaults to None.
gradient_clipping (float, optional) – Enable gradient clipping with value. Defaults to None.
fp16 (dict, optional) – Configuration for using mixed precision/FP16 training that leverages NVIDIA’s Apex package. Defaults to None.
inputs_to_half (list[int or str], optional) – Which inputs are to converted to half precision. Defaults to None. If fp16 is enabled, it also should be set.
bf16 (dict, optional) – Configuration for using bfloat16 floating-point format as an alternative to FP16. Defaults to None.
amp (dict, optional) – Configuration for using automatic mixed precision (AMP) training that leverages NVIDIA’s Apex AMP package. Defaults to None.
activation_checkpointing (dict, optional) – Reduce memory usage by clearing activations of certain layers and recomputing them during a backward pass. Defaults to None.
aio (dict, optional) – Configuring the asynchronous I/O module for offloading parameter and optimizer states to persistent (NVMe) storage. This module uses Linux native asynchronous I/O (libaio). Defaults to None.
train_micro_batch_size_per_gpu (int, optional) – Batch size to be processed by one GPU in one step (without gradient accumulation). Defaults to None.
gradient_accumulation_steps (int, optional) – Number of training steps to accumulate gradients before averaging and applying them. Defaults to None.
exclude_frozen_parameters (bool, optional) – Exclude frozen parameters from saved checkpoint.
steps_per_print (int) –

load_checkpoint(filename, *, map_location='cpu', strict=False, revise_keys=[('^module.', '')], callback=None)[source]¶

Load checkpoint from given filename.

Warning

map_localtion and callback parameters are not supported yet.

Parameters:

filename (str) – Accept local filepath, URL, torchvision://xxx, open-mmlab://xxx.
map_location (str | Callable) –
strict (bool) –
revise_keys (list) –
callback (Callable | None) –

Return type:

dict

prepare(model, *, optim_wrapper=None, param_scheduler=None, compile=False, dispatch_kwargs=None)[source]¶

Prepare model and some components.

Parameters:

model (torch.nn.Module or dict) – The model to be run. It can be a dict used for build a model.
optim_wrapper (BaseOptimWrapper | dict | None) –
param_scheduler (_ParamScheduler | Dict | List | None) –
compile (dict | bool) –
dispatch_kwargs (dict | None) –

Keyword Arguments:

optim_wrapper (BaseOptimWrapper or dict, optional) – Computing the gradient of model parameters and updating them. Defaults to None. See build_optim_wrapper() for examples.
param_scheduler (_ParamScheduler or dict or list, optional) – Parameter scheduler for updating optimizer parameters. If specified, optim_wrapper should also be specified. Defaults to None. See build_param_scheduler() for examples.
compile (dict, optional) – Config to compile model. Defaults to False. Requires PyTorch>=2.0.
dispatch_kwargs (dict, optional) – Kwargs to be passed to other methods of Strategy. Defaults to None.

resume(filename, *, resume_optimizer=True, resume_param_scheduler=True, map_location='default', callback=None)[source]¶

Resume training from given filename.

Warning

map_location and callback parameters are not supported yet.

Parameters:

filename (str) – Accept local filepath.
resume_optimizer (bool) –
resume_param_scheduler (bool) –
map_location (str | Callable) –
callback (Callable | None) –

Keyword Arguments:

resume_optimizer (bool) – Whether to resume optimizer state. Defaults to True.
resume_param_scheduler (bool) – Whether to resume param scheduler state. Defaults to True.

Return type:

dict

save_checkpoint(filename, *, save_optimizer=True, save_param_scheduler=True, extra_ckpt=None, callback=None)[source]¶

Save checkpoint to given filename.

Warning

callback parameter is not supported yet.

Parameters:

filename (str) – Filename to save checkpoint.
save_optimizer (bool) –
save_param_scheduler (bool) –
extra_ckpt (dict | None) –
callback (Callable | None) –

Keyword Arguments:

save_param_scheduler (bool) – Whether to save the param_scheduler to the checkpoint. Defaults to True.
extra_ckpt (dict, optional) – Extra checkpoint to save. Defaults to None.

Return type:

None