Shortcuts

DeepSpeedStrategy

class mmengine._strategy.DeepSpeedStrategy(*, config=None, zero_optimization=None, gradient_clipping=None, fp16=None, inputs_to_half=None, bf16=None, amp=None, activation_checkpointing=None, aio=None, train_micro_batch_size_per_gpu=None, gradient_accumulation_steps=None, steps_per_print=10000000000000, exclude_frozen_parameters=None, **kwargs)[source]

Support training models with DeepSpeed.

Note

The detailed usage of parameters can be found at https://www.deepspeed.ai/docs/config-json/.

Parameters:
  • config (str or dict, optional) – If it is a string, it is a path to load config for deepspeed. Defaults to None.

  • zero_optimization (dict, optional) – Enabling and configuring ZeRO memory optimizations. Defaults to None.

  • gradient_clipping (float, optional) – Enable gradient clipping with value. Defaults to None.

  • fp16 (dict, optional) – Configuration for using mixed precision/FP16 training that leverages NVIDIA’s Apex package. Defaults to None.

  • inputs_to_half (list[int or str], optional) – Which inputs are to converted to half precision. Defaults to None. If fp16 is enabled, it also should be set.

  • bf16 (dict, optional) – Configuration for using bfloat16 floating-point format as an alternative to FP16. Defaults to None.

  • amp (dict, optional) – Configuration for using automatic mixed precision (AMP) training that leverages NVIDIA’s Apex AMP package. Defaults to None.

  • activation_checkpointing (dict, optional) – Reduce memory usage by clearing activations of certain layers and recomputing them during a backward pass. Defaults to None.

  • aio (dict, optional) – Configuring the asynchronous I/O module for offloading parameter and optimizer states to persistent (NVMe) storage. This module uses Linux native asynchronous I/O (libaio). Defaults to None.

  • train_micro_batch_size_per_gpu (int, optional) – Batch size to be processed by one GPU in one step (without gradient accumulation). Defaults to None.

  • gradient_accumulation_steps (int, optional) – Number of training steps to accumulate gradients before averaging and applying them. Defaults to None.

  • exclude_frozen_parameters (bool, optional) – Exclude frozen parameters from saved checkpoint.

  • steps_per_print (int) –

load_checkpoint(filename, *, map_location='cpu', strict=False, revise_keys=[('^module.', '')], callback=None)[source]

Load checkpoint from given filename.

Warning

map_localtion and callback parameters are not supported yet.

Parameters:
  • filename (str) – Accept local filepath, URL, torchvision://xxx, open-mmlab://xxx.

  • map_location (str | Callable) –

  • strict (bool) –

  • revise_keys (list) –

  • callback (Callable | None) –

Return type:

dict

prepare(model, *, optim_wrapper=None, param_scheduler=None, compile=False, dispatch_kwargs=None)[source]

Prepare model and some components.

Parameters:
Keyword Arguments:
  • optim_wrapper (BaseOptimWrapper or dict, optional) – Computing the gradient of model parameters and updating them. Defaults to None. See build_optim_wrapper() for examples.

  • param_scheduler (_ParamScheduler or dict or list, optional) – Parameter scheduler for updating optimizer parameters. If specified, optim_wrapper should also be specified. Defaults to None. See build_param_scheduler() for examples.

  • compile (dict, optional) – Config to compile model. Defaults to False. Requires PyTorch>=2.0.

  • dispatch_kwargs (dict, optional) – Kwargs to be passed to other methods of Strategy. Defaults to None.

resume(filename, *, resume_optimizer=True, resume_param_scheduler=True, map_location='default', callback=None)[source]

Resume training from given filename.

Warning

map_location and callback parameters are not supported yet.

Parameters:
  • filename (str) – Accept local filepath.

  • resume_optimizer (bool) –

  • resume_param_scheduler (bool) –

  • map_location (str | Callable) –

  • callback (Callable | None) –

Keyword Arguments:
  • resume_optimizer (bool) – Whether to resume optimizer state. Defaults to True.

  • resume_param_scheduler (bool) – Whether to resume param scheduler state. Defaults to True.

Return type:

dict

save_checkpoint(filename, *, save_optimizer=True, save_param_scheduler=True, extra_ckpt=None, callback=None)[source]

Save checkpoint to given filename.

Warning

callback parameter is not supported yet.

Parameters:
  • filename (str) – Filename to save checkpoint.

  • save_optimizer (bool) –

  • save_param_scheduler (bool) –

  • extra_ckpt (dict | None) –

  • callback (Callable | None) –

Keyword Arguments:
  • save_param_scheduler (bool) – Whether to save the param_scheduler to the checkpoint. Defaults to True.

  • extra_ckpt (dict, optional) – Extra checkpoint to save. Defaults to None.

Return type:

None