datasets¶
- trojanzoo.datasets.add_argument(parser, dataset_name=None, dataset=None, config=config, class_dict={})[source]¶
- Add dataset arguments to argument parser.For specific arguments implementation, see
Dataset.add_argument()
.- Parameters:
parser (argparse.ArgumentParser) – The parser to add arguments.
dataset_name (str) – The dataset name.
dataset (str | Dataset) – Dataset instance or dataset name (as the alias of dataset_name).
config (Config) – The default parameter config, which contains the default dataset name if not provided.
class_dict (dict[str, type[Dataset]]) – Map from dataset name to dataset class. Defaults to
{}
.
- trojanzoo.datasets.create(dataset_name=None, dataset=None, config=config, class_dict={}, **kwargs)[source]¶
- Create a dataset instance.For arguments not included in
kwargs
, use the default values inconfig
.The default value offolder_path
is'{data_dir}/{data_type}/{name}'
.For dataset implementation, seeDataset
.- Parameters:
- Returns:
Dataset – Dataset instance.
- class trojanzoo.datasets.Dataset(batch_size=100, valid_batch_size=100, folder_path=None, download=False, split_ratio=0.8, num_workers=4, loss_weights=False, **kwargs)[source]¶
- An abstract class representing a dataset.It inherits
trojanzoo.utils.module.BasicObject
.Note
This is the implementation of dataset. For users, please use
create()
instead, which is more user-friendly.- Parameters:
batch_size (int) – Batch size of training set (negative number means batch size for each gpu). Defaults to
100
.valid_batch_size (int) – Batch size of validation set. Defaults to
100
.folder_path (str) –
Folder path to store dataset. Defaults to
None
.Note
folder_path
is usually'{data_dir}/{data_type}/{name}'
, which is claimed as the default value ofcreate()
.download (bool) – Download dataset if not exist. Defaults to
False
.split_ratio (float) –
Split training set for training and validation ifvalid_set
isFalse
.The ratio stands for .Defaults to0.8
.num_workers (int) – Used in
get_dataloader()
. Defaults to4
.loss_weights (bool | np.ndarray | torch.Tensor) –
The loss weights w.r.t. each class.ifFalse
, setloss_weights
asNone
.**kwargs – Any keyword argument (unused).
- Variables:
name (str) – Dataset Name. (need overriding)
loader (dict[str, DataLoader]) –
Preset dataloader for users at dataset initialization.It contains'train'
and'valid'
loaders.batch_size (int) – Batch size of training set (always positive). Defaults to
100
.valid_batch_size (int) – Batch size of validation set. Defaults to
100
.num_classes (int) – Number of classes. (need overriding)
folder_path (str) – Folder path to store dataset. Defaults to
None
.data_type (str) – Data type (e.g.,
'image'
). (need overriding)valid_set (bool) – Whether having a native validation set. Defaults to
True
.split_ratio (float) –
Split training set for training and validation ifvalid_set
isFalse
.The ratio stands for .Defaults to0.8
.loss_weights (torch.Tensor | None) – The loss weights w.r.t. each class.
num_workers (int) – Used in
get_dataloader()
. Defaults to4
.collate_fn (Callable | None) – Used in
get_dataloader()
. Defaults toNone
.
- classmethod add_argument(group)[source]¶
Add dataset arguments to argument parser group. View source to see specific arguments.
Note
This is the implementation of adding arguments. The concrete dataset class may override this method to add more arguments. For users, please use
add_argument()
instead, which is more user-friendly.
- check_files(**kwargs)[source]¶
Check if the dataset files are prepared.
- Parameters:
**kwargs – Keyword arguments passed to
get_org_dataset()
.- Returns:
bool – Whether the dataset files are prepared.
- static get_class_subset(dataset, class_list)[source]¶
Get a subset from dataset with certain classes.
- Parameters:
dataset (torch.utils.data.Dataset) – The entire dataset.
- Returns:
torch.utils.data.Subset – The subset with labels in
class_list
.- Example:
>>> from trojanzoo.utils.data import TensorListDataset >>> from trojanzoo.utils.data import get_class_subset >>> import torch >>> >>> data = torch.ones(11, 3, 32, 32) >>> targets = list(range(11)) >>> dataset = TensorListDataset(data, targets) >>> subset = get_class_subset(dataset, class_list=[2, 3]) >>> len(subset) 2
See also
The implementation is in
trojanzoo.utils.data.get_class_subset()
.
- get_data(data, **kwargs)[source]¶
Process data. Defaults to directly return
data
.- Parameters:
data (Any) – Unprocessed data.
**kwargs – Keyword arguments to process data.
- Returns:
Any – Processed data.
- get_dataloader(mode=None, dataset=None, batch_size=None, shuffle=None, num_workers=None, pin_memory=True, drop_last=False, collate_fn=None, **kwargs)[source]¶
Get dataloader. Call
get_dataset()
ifdataset
is not provided.- Parameters:
mode (str) – Dataset mode (e.g.,
'train'
or'valid'
).dataset (torch.utils.data.Dataset) – The pytorch dataset.
batch_size (int) – Defaults to
self.batch_size
for'train'
mode andself.valid_batch_size
for'valid'
mode.shuffle (bool) – Whether to shuffle. Defaults to
True
for'train'
mode andFalse
for'valid'
mode.num_workers (int) – Number of workers for dataloader. Defaults to
self.num_workers
.pin_memory (bool) – Whether to use pin memory. Defaults to
True
if there is any GPU available.drop_last (bool) – Whether drop the last batch if not full size. Defaults to
False
.collate_fn (Callable) – Passed to
torch.utils.data.DataLoader
.**kwargs – Keyword arguments passed to
get_dataset()
ifdataset
is not provided.
- Returns:
torch.utils.data.DataLoader – The pytorch dataloader.
- get_dataset(mode=None, seed=None, class_list=None, **kwargs)[source]¶
Get dataset. Call
split_dataset()
to split the training set ifvalid_set
isFalse
.- Parameters:
mode (str) – Dataset mode (e.g.,
'train'
or'valid'
).seed (int) – The random seed to split dataset using
numpy.random.shuffle
. Defaults toenv['data_seed']
.class_list (int | list[int]) – The class list to pick. Defaults to
None
.**kwargs – Keyword arguments passed to
get_org_dataset()
.
- Returns:
torch.utils.data.Dataset – The original dataset.
- get_loss_weights(file_path=None, verbose=True)[source]¶
Calculate
loss_weights
as reciprocal of data size of each class (to mitigate data imbalance).- Parameters:
- Returns:
torch.Tensor – The tensor of loss weights w.r.t. each class.
- get_org_dataset(mode, **kwargs)[source]¶
Get original dataset that is not splitted.
Note
This is a wrapper and the specific implementation is in
_get_org_dataset()
, which needs overriding.- Parameters:
mode (str) – Dataset mode (e.g.,
'train'
or'valid'
).transform (Callable) – The transform applied on dataset. Defaults to
get_transform()
.**kwargs – Keyword arguments passed to
_get_org_dataset()
.
- Returns:
torch.utils.data.Dataset – The original dataset.
See also
- abstract get_transform(mode)[source]¶
Get dataset transform for mode.
- Parameters:
mode (str) – Dataset mode (e.g.,
'train'
or'valid'
).- Returns:
~collections.abc.Callable – A callable transform.
- initialize(*args, **kwargs)[source]¶
Initialize the dataset (download and extract) if it’s not prepared yet (need overriding).
- static split_dataset(dataset, length=None, percent=None, shuffle=True, seed=None)[source]¶
Split a dataset into two subsets.
- Parameters:
dataset (torch.utils.data.Dataset) – The dataset to split.
length (int) – The length of the first subset. This argument cannot be used together with
percent
. IfNone
, usepercent
to calculate length instead. Defaults toNone
.percent (float) – The split ratio for the first subset. This argument cannot be used together with
length
.length = percent * len(dataset)
. Defaults toNone
.shuffle (bool) – Whether to shuffle the dataset. Defaults to
True
.seed (bool) – The random seed to split dataset using
numpy.random.shuffle
. Defaults toenv['data_seed']
.
- Returns:
(torch.utils.data.Subset, torch.utils.data.Subset) – The two splitted subsets.
- Example:
>>> from trojanzoo.utils.data import TensorListDataset >>> from trojanzoo.datasets import Dataset >>> import torch >>> >>> data = torch.ones(11, 3, 32, 32) >>> targets = list(range(11)) >>> dataset = TensorListDataset(data, targets) >>> set1, set2 = Dataset.split_dataset(dataset, length=3) >>> len(set1), len(set2) (3, 8) >>> set3, set4 = split_dataset(dataset, percent=0.5) >>> len(set3), len(set4) (5, 6)
See also
The implementation is in
trojanzoo.utils.data.split_dataset()
. The difference is that this method will setseed
asenv['data_seed']
when it isNone
.