BaseDataset¶
Introduction¶
The Dataset class in the algorithm toolbox is responsible for providing input data for the model during the training/testing process. The Dataset class in each algorithm toolbox under OpenMMLab projects has some common characteristics and requirements, such as the need for efficient internal data storage format, support for the concatenation of different datasets, dataset repeated sampling, and so on.
Therefore, MMEngine implements BaseDataset which provides some basic interfaces and implements some DatasetWrappers with the same interfaces. Most of the Dataset Classes in the OpenMMLab algorithm toolbox meet the interface defined by the BaseDataset
and use the same DatasetWrappers.
The basic function of the BaseDataset is to load the dataset information. Here, we divide the dataset information into two categories. One is meta information, which represents the information related to the dataset itself and sometimes needs to be obtained by the model or other external components. For example, the meta information of the dataset generally includes the category information classes
in the image classification task, since the classification model usually needs to record the category information of the dataset. The other is data information, which defines the file path and corresponding label information of specific data info. In addition, another function of the BaseDataset is to continuously send data into the data pipeline for data preprocessing.
The standard data annotation file¶
In order to unify the dataset interface of different tasks and facilitate multiple tasks training in one model, OpenMMLab formulate the OpenMMLab 2.0 dataset format specification. Dataset annotation files should conform to this specification, and the BaseDataset
reads and parses data annotation files based on this specification. If the data annotation file provided by the user does not conform to the specified format, the user can choose to convert it to the specified format and use OpenMMLab’s algorithm toolbox to conduct algorithm training and testing based on the converted data annotation file.
The OpenMMLab 2.0 dataset format specification states that annotation files must be in the format of json
or yaml
, yml
or pickle
, pkl
. The dictionary stored in the annotation file must contain two fields, metainfo
and data_list
. The metainfo
is a dictionary containing meta information about the dataset. The data_list
is a list in which each element is a dictionary and the dictionary defines a raw data info. Each raw data info contains one or more training/test samples.
Here is an example of a JSON annotation file (where each raw data info contains only one training/test sample):
{
"metainfo":
{
"classes": ["cat", "dog"]
},
"data_list":
[
{
"img_path": "xxx/xxx_0.jpg",
"img_label": 0
},
{
"img_path": "xxx/xxx_1.jpg",
"img_label": 1
}
]
}
We assume that the data is stored in the following path:
data
├── annotations
│ ├── train.json
├── train
│ ├── xxx/xxx_0.jpg
│ ├── xxx/xxx_1.jpg
│ ├── ...
The initialization process of the BaseDataset¶
The initialization process of the BaseDataset
is shown as follows:
load metainfo
: Obtain the meta information of the dataset. The meta information can be obtained from three sources with the priority from high to low:
The dict of
metainfo
passed by the user in the__init__()
function. The priority is high since the user can pass this argument when theBaseDataset
is instantiated;The dict of
BaseDataset.METAINFO
in the class attributes of BaseDataset. The priority is medium since the user can change the class attributesBaseDataset.METAINFO
in the custom dataset class;The dict of
metainfo
included in the annotation file. The priority is low since the annotation file is generally not changed.
If three sources have the same field, the source with the highest priority determines the value of the field. The priority comparison of these fields is: The fields in the metainfo
dictionary passed by the user > The fields in the BaseDataset.METAINFO
of BaseDataset > the fields in the metainfo
of annotation file.
join path
: Process the path of datainfo and annotating files;build pipeline
: Build data pipeline for the data preprocessing and data preparation;full init
: Fully initializes the BaseDataset. This step mainly includes the following operations:
load data list
: Read and parse the annotation files that meet the OpenMMLab 2.0 dataset format specification. In this step, theparse_data_info()
method is called. This method is responsible for parsing each raw data info in the annotation file;filter data
(optional): Filters unnecessary data based onfilter_cfg
, such as data samples that do not contain annotations. By default, there is no filtering operation, and downstream subclasses can override it according to their own needs.get subset
(optional): Sample a subset of dataset based on a given index or an integer value, such as only the first 10 samples for training/testing. By default, all data samples are used.serialize data
(optional): Serialize all data samples to save memory. Please see Save memory for more details. we serialize all data samples by default.
The parse_data_info()
method in the BaseDataset is used to process a raw data info in the annotation file into one or more training/test data samples. The user needs to implement the parse_data_info()
method if they want to customize dataset class.
The interface of BaseDataset¶
Once the BaseDataset is initialized, it supports __getitem__
method to index a data info and __len__
method to get the length of dataset, just like torch.utils.data.Dataset
. The Basedataset provides the following interfaces:
metainfo
: Return the meta information with a dictionary value.get_data_info(idx)
: Return the full data information of the givenidx
, and the return value is a dictionary.__getitem__(idx)
: Return the results of data pipeline(The input data of model) of the given ‘idx’, and the return value is a dictionary.__len__()
: Return the length of the dataset. The return value is an integer.get_subset_(indices)
: Modify the original dataset class in inplace according toindices
. Ifindices
isint
, then the original dataset class contains only the first few data samples. Ifindices
isSequence[int]
, the raw dataset class contains data samples specified according toSequence[int]
.get_subset(indices)
: Return a new sub-dataset class according to indices, i.e., re-copies a sub-dataset. Ifindices
isint
, the returned sub-dataset object contains only the first few data samples. Ifindices
isSequence[int]
, the returned sub-dataset object contains the data samples specified according toSequence[int]
.
Customize dataset class based on BaseDataset¶
We can customize the dataset class based on BaseDataset, after we understand the initialization process of BaseDataset and the provided interfaces of BaseDataset.
Annotation files that meet the OpenMMLab 2.0 dataset format specification¶
As mentioned above, users can overload parse_data_info()
to load annotation files that meet the OpenMMLab 2.0 dataset format specification. Here is an example of using BaseDataset to implement a specific dataset.
import os.path as osp
from mmengine.dataset import BaseDataset
class ToyDataset(BaseDataset):
# Take the above annotation file as example. The raw_data_info represents a dictionary in the data_list list:
# {
# 'img_path': "xxx/xxx_0.jpg",
# 'img_label': 0,
# ...
# }
def parse_data_info(self, raw_data_info):
data_info = raw_data_info
img_prefix = self.data_prefix.get('img_path', None)
if img_prefix is not None:
data_info['img_path'] = osp.join(
img_prefix, data_info['img_path'])
return data_info
Using Customized dataset class¶
The ToyDataset
can be instantiated with the following configuration, once it has been defined:
class LoadImage:
def __call__(self, results):
results['img'] = cv2.imread(results['img_path'])
return results
class ParseImage:
def __call__(self, results):
results['img_shape'] = results['img'].shape
return results
pipeline = [
LoadImage(),
ParseImage(),
]
toy_dataset = ToyDataset(
data_root='data/',
data_prefix=dict(img_path='train/'),
ann_file='annotations/train.json',
pipeline=pipeline)
At the same time, the external interface provided by the BaseDataset can be used to access specific data sample information:
toy_dataset.metainfo
# dict(classes=('cat', 'dog'))
toy_dataset.get_data_info(0)
# {
# 'img_path': "data/train/xxx/xxx_0.jpg",
# 'img_label': 0,
# ...
# }
len(toy_dataset)
# 2
toy_dataset[0]
# {
# 'img_path': "data/train/xxx/xxx_0.jpg",
# 'img_label': 0,
# 'img': a ndarray with shape (H, W, 3), which denotes the value of the image,
# 'img_shape': (H, W, 3) ,
# ...
# }
# The `get_subset` interface does not modify the original dataset class, i.e. make a complete copy of it
sub_toy_dataset = toy_dataset.get_subset(1)
len(toy_dataset), len(sub_toy_dataset)
# 2, 1
# The `get_subset_` interface modify the original dataset class in inplace
toy_dataset.get_subset_(1)
len(toy_dataset)
# 1
Following the above steps, we can see how to customize a dataset based on the BaseDataset and how to use the customized dataset.
Customize dataset for videos¶
In the above examples, each raw data info of the annotation file contains only one training/test sample (usually in the image field). If each raw data info contains several training/test samples (usually in the video domain), we only need to ensure that the return value of parse_data_info()
is list[dict]
:
from mmengine.dataset import BaseDataset
class ToyVideoDataset(BaseDataset):
# raw_data_info is still a dict, but it contains multiple samples
def parse_data_info(self, raw_data_info):
data_list = []
...
for ... :
data_info = dict()
...
data_list.append(data_info)
return data_list
The usage of ToyVideoDataset
is similar to that of ToyDataset
, which will not be repeated here.
Annotation files that do not meet the OpenMMLab 2.0 dataset format specification¶
For annotated files that do not meet the OpenMMLab 2.0 dataset format specification, there are two ways to use:
Convert the annotation files that do not meet the specifications into the annotation files that do meet the specifications, and then use the BaseDataset in the above way.
Implement a new dataset class that inherits from the
BaseDataset
and overloads theload_data_list(self):
function of theBaseDataset
to handle annotation files that don’t meet the specification and guarantee a return value oflist[dict]
, where eachdict
represents a data sample.
Other features of BaseDataset¶
The BaseDataset also contains the following features:
lazy init¶
When the BaseDataset is instantiated, the annotation file needs to be read and parsed, therefore it will take some time. However, in some cases, such as the visualization of prediction, only the meta information of the BaseDataset is required, and reading and parsing the annotation file may not be necessary. To save time on instantiating the BaseDataset in this case, the BaseDataset supports lazy init:
pipeline = [
LoadImage(),
ParseImage(),
]
toy_dataset = ToyDataset(
data_root='data/',
data_prefix=dict(img_path='train/'),
ann_file='annotations/train.json',
pipeline=pipeline,
# Pass the lazy_init variable in here
lazy_init=True)
When lazy_init=True
, the initialization of ToyDataset’s only performs steps 1, 2, and 3 of the BaseDataset initialization process. At this time, toy_dataset
was not fully initialized, since toy_dataset
will not read and parse the annotation file. The toy_dataset
only set the meta information of the dataset (metainfo
).
Naturally, if you need to access specific data information later, you can manually call the toy_dataset.full_init()
interface to perform the complete initialization process, during which the data annotation file will be read and parsed. Calling the get_data_info (independence idx)
, __len__ ()
, __getitem__ (independence idx)
, get_subset_ (indices)
and get_subset(indices)
interface will also automatically call the full_init()
interface to perform the full initialization process (only on the first call, later calls will not call the full_init()
interface repeatedly):
# Full initialization
toy_dataset.full_init()
# After initialization, you can now get the data info
len(toy_dataset)
# 2
toy_dataset[0]
# {
# 'img_path': "data/train/xxx/xxx_0.jpg",
# 'img_label': 0,
# 'img': a ndarray with shape (H, W, 3), which denotes the value the image,
# 'img_shape': (H, W, 3) ,
# ...
# }
Notice:
Performing full initialization by calling the __getitem__()
interface directly carries some risks: If a dataset object is not fully initialized by setting lazy_init=True
firstly, then it is directly sent to the dataloader. Different dataloader workers will read and parse the annotation file at the same time in the subsequent data reading process. Although this may work normally, it consumes a lot of time and memory. Therefore, it is recommended to manually call the full_init()
interface to perform the full initialization process before you need to access specific data.
The above is not fully initialized by setting lazy_init=True
, and then complete initialization according to the demand, called lazy init.
Save memory¶
In the specific process of reading data, the dataloader will usually prefetch data from multiple dataloader workers, and multiple workers have complete dataset object backup, so there will be multiple copies of the same data_list
in the memory. In order to save this part of memory consumption, The BaseDataset
can serialize data_list
into memory in advance, so that multiple workers can share the same copy of data_list
, so as to save memory.
By default, the BaseDataset stores the serialization of data_list
into memory. It is also possible to control whether the data will be serialized into memory ahead of time by using the serialize_data
argument (default is True
) :
pipeline = [
LoadImage(),
ParseImage(),
]
toy_dataset = ToyDataset(
data_root='data/',
data_prefix=dict(img_path='train/'),
ann_file='annotations/train.json',
pipeline=pipeline,
# Pass the serialize data argument in here
serialize_data=False)
The above example does not store the data_list
serialization into memory in advance, so it is not recommended to instantiate the dataset class, when using the dataloader to open multiple dataloader workers to load the data.
DatasetWrappers¶
In addition to BaseDataset, MMEngine also provides several DatasetWrappers: ConcatDataset
, RepeatDataset
, ClassBalancedDataset
. These dataset wrappers also support lazy init and have memory-saving features.
ConcatDataset¶
MMEngine provides a ConcatDataset
wrapper to concatenate datasets in the following way:
from mmengine.dataset import ConcatDataset
pipeline = [
LoadImage(),
ParseImage(),
]
toy_dataset_1 = ToyDataset(
data_root='data/',
data_prefix=dict(img_path='train/'),
ann_file='annotations/train.json',
pipeline=pipeline)
toy_dataset_2 = ToyDataset(
data_root='data/',
data_prefix=dict(img_path='val/'),
ann_file='annotations/val.json',
pipeline=pipeline)
toy_dataset_12 = ConcatDataset(datasets=[toy_dataset_1, toy_dataset_2])
The above example combines the train
set and the val
set of the dataset into one large dataset.
RepeatDataset¶
MMEngine provides RepeatDataset
wrapper to repeat a dataset several times, as follows:
from mmengine.dataset import RepeatDataset
pipeline = [
LoadImage(),
ParseImage(),
]
toy_dataset = ToyDataset(
data_root='data/',
data_prefix=dict(img_path='train/'),
ann_file='annotations/train.json',
pipeline=pipeline)
toy_dataset_repeat = RepeatDataset(dataset=toy_dataset, times=5)
The above example samples the train
set of the dataset five times.
ClassBalancedDataset¶
MMEngine provides ClassBalancedDataset
wrapper to repeatedly sample the corresponding samples based on the frequency of category occurrence in the dataset.
Notice:
The ClassBalancedDataset
wrapper assumes that the wrapped dataset class supports the get_cat_ids(idx)
method, which returns a list. The list contains the categories of data_info
given by ‘idx’. The usage is as follows:
from mmengine.dataset import BaseDataset, ClassBalancedDataset
class ToyDataset(BaseDataset):
def parse_data_info(self, raw_data_info):
data_info = raw_data_info
img_prefix = self.data_prefix.get('img_path', None)
if img_prefix is not None:
data_info['img_path'] = osp.join(
img_prefix, data_info['img_path'])
return data_info
# The necessary method that needs to return the category of data sample
def get_cat_ids(self, idx):
data_info = self.get_data_info(idx)
return [int(data_info['img_label'])]
pipeline = [
LoadImage(),
ParseImage(),
]
toy_dataset = ToyDataset(
data_root='data/',
data_prefix=dict(img_path='train/'),
ann_file='annotations/train.json',
pipeline=pipeline)
toy_dataset_repeat = ClassBalancedDataset(dataset=toy_dataset, oversample_thr=1e-3)
The above example resamples the train
set of the dataset with oversample_thr=1e-3
. Specifically, for categories whose frequency is less than 1e-3
in the dataset, samples corresponding to this category will be sampled repeatedly; otherwise, samples will not be sampled repeatedly. Please refer to the API documentation of ClassBalancedDataset
for specific sampling policies.
Customize DatasetWrapper¶
Since the BaseDataset support lazy init, some rules need to be followed when customizing the DatasetWrapper. Here is an example to show how to customize the DatasetWrapper:
from mmengine.dataset import BaseDataset
from mmengine.registry import DATASETS
@DATASETS.register_module()
class ExampleDatasetWrapper:
def __init__(self, dataset, lazy_init=False, ...):
# Build the source dataset (self.dataset)
if isinstance(dataset, dict):
self.dataset = DATASETS.build(dataset)
elif isinstance(dataset, BaseDataset):
self.dataset = dataset
else:
raise TypeError(
'elements in datasets sequence should be config or '
f'`BaseDataset` instance, but got {type(dataset)}')
# Record the meta information of source dataset
self._metainfo = self.dataset.metainfo
'''
1. Implement some code here to record some of the hyperparameters used to wrap the dataset.
'''
self._fully_initialized = False
if not lazy_init:
self.full_init()
def full_init(self):
if self._fully_initialized:
return
# Initialize the source dataset completely
self.dataset.full_init()
'''
2. Implement some code here to wrap the source dataset.
'''
self._fully_initialized = True
@force_full_init
def _get_ori_dataset_idx(self, idx: int):
'''
3. Implement some code here to map the wrapped index `idx` to the index of the source dataset 'ori_idx'.
'''
ori_idx = ...
return ori_idx
# Provide the same external interface as `self.dataset `.
@force_full_init
def get_data_info(self, idx):
sample_idx = self._get_ori_dataset_idx(idx)
return self.dataset.get_data_info(sample_idx)
# Provide the same external interface as `self.dataset `.
def __getitem__(self, idx):
if not self._fully_initialized:
warnings.warn('Please call `full_init` method manually to '
'accelerate the speed.')
self.full_init()
sample_idx = self._get_ori_dataset_idx(idx)
return self.dataset[sample_idx]
# Provide the same external interface as `self.dataset `.
@force_full_init
def __len__(self):
'''
4. Implement some code here to calculate the length of the wrapped dataset.
'''
len_wrapper = ...
return len_wrapper
# Provide the same external interface as `self.dataset `.
@property
def metainfo(self)
return copy.deepcopy(self._metainfo)