ClassBalancedDataset¶
- class mmengine.dataset.ClassBalancedDataset(dataset, oversample_thr, lazy_init=False)[source]¶
A wrapper of class balanced dataset.
Suitable for training on class imbalanced datasets like LVIS. Following the sampling strategy in the paper, in each epoch, an image may appear multiple times based on its “repeat factor”. The repeat factor for an image is a function of the frequency the rarest category labeled in that image. The “frequency of category c” in [0, 1] is defined by the fraction of images in the training set (without repeats) in which category c appears. The dataset needs to instantiate
get_cat_ids()
to support ClassBalancedDataset.The repeat factor is computed as followed.
For each category c, compute the fraction # of images that contain it: \(f(c)\)
For each category c, compute the category-level repeat factor: \(r(c) = max(1, sqrt(t/f(c)))\)
For each image I, compute the image-level repeat factor: \(r(I) = max_{c in I} r(c)\)
Note
ClassBalancedDataset
should not inherit fromBaseDataset
sinceget_subset
andget_subset_
could produce ambiguous meaning sub-dataset which conflicts with original dataset. If you want to use a sub-dataset ofClassBalancedDataset
, you should setindices
arguments for wrapped dataset which inherit fromBaseDataset
.- Parameters:
dataset (BaseDataset or dict) – The dataset to be repeated.
oversample_thr (float) – frequency threshold below which data is repeated. For categories with
f_c >= oversample_thr
, there is no oversampling. For categories withf_c < oversample_thr
, the degree of oversampling following the square-root inverse frequency heuristic above.lazy_init (bool, optional) – whether to load annotation during instantiation. Defaults to False
- get_subset(indices)[source]¶
Not supported in
ClassBalancedDataset
for the ambiguous meaning of sub-dataset.- Parameters:
- Return type: