Shortcuts

datasets

trojanzoo.datasets.add_argument(parser, dataset_name=None, dataset=None, config=config, class_dict={})[source]
Add dataset arguments to argument parser.
For specific arguments implementation, see Dataset.add_argument().
Parameters:
  • parser (argparse.ArgumentParser) – The parser to add arguments.

  • dataset_name (str) – The dataset name.

  • dataset (str | Dataset) – Dataset instance or dataset name (as the alias of dataset_name).

  • config (Config) – The default parameter config, which contains the default dataset name if not provided.

  • class_dict (dict[str, type[Dataset]]) – Map from dataset name to dataset class. Defaults to {}.

trojanzoo.datasets.create(dataset_name=None, dataset=None, config=config, class_dict={}, **kwargs)[source]
Create a dataset instance.
For arguments not included in kwargs, use the default values in config.
The default value of folder_path is '{data_dir}/{data_type}/{name}'.
For dataset implementation, see Dataset.
Parameters:
  • dataset_name (str) – The dataset name.

  • dataset (str) – The alias of dataset_name.

  • config (Config) – The default parameter config.

  • class_dict (dict[str, type[Dataset]]) – Map from dataset name to dataset class. Defaults to {}.

  • **kwargs – Keyword arguments passed to dataset init method.

Returns:

Dataset – Dataset instance.

class trojanzoo.datasets.Dataset(batch_size=100, valid_batch_size=100, folder_path=None, download=False, split_ratio=0.8, num_workers=4, loss_weights=False, **kwargs)[source]
An abstract class representing a dataset.

Note

This is the implementation of dataset. For users, please use create() instead, which is more user-friendly.

Parameters:
  • batch_size (int) – Batch size of training set (negative number means batch size for each gpu). Defaults to 100.

  • valid_batch_size (int) – Batch size of validation set. Defaults to 100.

  • folder_path (str) –

    Folder path to store dataset. Defaults to None.

    Note

    folder_path is usually '{data_dir}/{data_type}/{name}', which is claimed as the default value of create().

  • download (bool) – Download dataset if not exist. Defaults to False.

  • split_ratio (float) –

    Split training set for training and validation if valid_set is False.
    The ratio stands for # training subset# total training set\frac{\text{\# training\ subset}}{\text{\# total\ training\ set}}.
    Defaults to 0.8.

  • num_workers (int) – Used in get_dataloader(). Defaults to 4.

  • loss_weights (bool | np.ndarray | torch.Tensor) –

    The loss weights w.r.t. each class.
    if numpy.ndarray or torch.Tensor, directly set as loss_weights (cpu tensor).
    if True, set loss_weights as get_loss_weights();
    if False, set loss_weights as None.

  • **kwargs – Any keyword argument (unused).

Variables:
  • name (str) – Dataset Name. (need overriding)

  • loader (dict[str, DataLoader]) –

    Preset dataloader for users at dataset initialization.
    It contains 'train' and 'valid' loaders.

  • batch_size (int) – Batch size of training set (always positive). Defaults to 100.

  • valid_batch_size (int) – Batch size of validation set. Defaults to 100.

  • num_classes (int) – Number of classes. (need overriding)

  • folder_path (str) – Folder path to store dataset. Defaults to None.

  • data_type (str) – Data type (e.g., 'image'). (need overriding)

  • label_names (list[int]) – Number of classes. (optional)

  • valid_set (bool) – Whether having a native validation set. Defaults to True.

  • split_ratio (float) –

    Split training set for training and validation if valid_set is False.
    The ratio stands for # training subset# total training set\frac{\text{\# training\ subset}}{\text{\# total\ training\ set}}.
    Defaults to 0.8.

  • loss_weights (torch.Tensor | None) – The loss weights w.r.t. each class.

  • num_workers (int) – Used in get_dataloader(). Defaults to 4.

  • collate_fn (Callable | None) – Used in get_dataloader(). Defaults to None.

classmethod add_argument(group)[source]

Add dataset arguments to argument parser group. View source to see specific arguments.

Note

This is the implementation of adding arguments. The concrete dataset class may override this method to add more arguments. For users, please use add_argument() instead, which is more user-friendly.

check_files(**kwargs)[source]

Check if the dataset files are prepared.

Parameters:

**kwargs – Keyword arguments passed to get_org_dataset().

Returns:

bool – Whether the dataset files are prepared.

static get_class_subset(dataset, class_list)[source]

Get a subset from dataset with certain classes.

Parameters:
Returns:

torch.utils.data.Subset – The subset with labels in class_list.

Example:
>>> from trojanzoo.utils.data import TensorListDataset
>>> from trojanzoo.utils.data import get_class_subset
>>> import torch
>>>
>>> data = torch.ones(11, 3, 32, 32)
>>> targets = list(range(11))
>>> dataset = TensorListDataset(data, targets)
>>> subset = get_class_subset(dataset, class_list=[2, 3])
>>> len(subset)
2

See also

The implementation is in trojanzoo.utils.data.get_class_subset().

get_data(data, **kwargs)[source]

Process data. Defaults to directly return data.

Parameters:
  • data (Any) – Unprocessed data.

  • **kwargs – Keyword arguments to process data.

Returns:

Any – Processed data.

get_dataloader(mode=None, dataset=None, batch_size=None, shuffle=None, num_workers=None, pin_memory=True, drop_last=False, collate_fn=None, **kwargs)[source]

Get dataloader. Call get_dataset() if dataset is not provided.

Parameters:
  • mode (str) – Dataset mode (e.g., 'train' or 'valid').

  • dataset (torch.utils.data.Dataset) – The pytorch dataset.

  • batch_size (int) – Defaults to self.batch_size for 'train' mode and self.valid_batch_size for 'valid' mode.

  • shuffle (bool) – Whether to shuffle. Defaults to True for 'train' mode and False for 'valid' mode.

  • num_workers (int) – Number of workers for dataloader. Defaults to self.num_workers.

  • pin_memory (bool) – Whether to use pin memory. Defaults to True if there is any GPU available.

  • drop_last (bool) – Whether drop the last batch if not full size. Defaults to False.

  • collate_fn (Callable) – Passed to torch.utils.data.DataLoader.

  • **kwargs – Keyword arguments passed to get_dataset() if dataset is not provided.

Returns:

torch.utils.data.DataLoader – The pytorch dataloader.

get_dataset(mode=None, seed=None, class_list=None, **kwargs)[source]

Get dataset. Call split_dataset() to split the training set if valid_set is False.

Parameters:
  • mode (str) – Dataset mode (e.g., 'train' or 'valid').

  • seed (int) – The random seed to split dataset using numpy.random.shuffle. Defaults to env['data_seed'].

  • class_list (int | list[int]) – The class list to pick. Defaults to None.

  • **kwargs – Keyword arguments passed to get_org_dataset().

Returns:

torch.utils.data.Dataset – The original dataset.

get_loss_weights(file_path=None, verbose=True)[source]

Calculate loss_weights as reciprocal of data size of each class (to mitigate data imbalance).

Parameters:
  • file_path (str) –

    The file path of saved weights file.
    If exist, just load the file and return;
    else, calculate the weights, save and return.
    Defaults to {folder_path}/loss_weights.npy

  • verbose (bool) – Whether to print verbose information. Defaults to True.

Returns:

torch.Tensor – The tensor of loss weights w.r.t. each class.

get_org_dataset(mode, **kwargs)[source]

Get original dataset that is not splitted.

Note

This is a wrapper and the specific implementation is in _get_org_dataset(), which needs overriding.

Parameters:
  • mode (str) – Dataset mode (e.g., 'train' or 'valid').

  • transform (Callable) – The transform applied on dataset. Defaults to get_transform().

  • **kwargs – Keyword arguments passed to _get_org_dataset().

Returns:

torch.utils.data.Dataset – The original dataset.

See also

get_dataset()

abstract get_transform(mode)[source]

Get dataset transform for mode.

Parameters:

mode (str) – Dataset mode (e.g., 'train' or 'valid').

Returns:

~collections.abc.Callable – A callable transform.

initialize(*args, **kwargs)[source]

Initialize the dataset (download and extract) if it’s not prepared yet (need overriding).

static split_dataset(dataset, length=None, percent=None, shuffle=True, seed=None)[source]

Split a dataset into two subsets.

Parameters:
  • dataset (torch.utils.data.Dataset) – The dataset to split.

  • length (int) – The length of the first subset. This argument cannot be used together with percent. If None, use percent to calculate length instead. Defaults to None.

  • percent (float) – The split ratio for the first subset. This argument cannot be used together with length. length = percent * len(dataset). Defaults to None.

  • shuffle (bool) – Whether to shuffle the dataset. Defaults to True.

  • seed (bool) – The random seed to split dataset using numpy.random.shuffle. Defaults to env['data_seed'].

Returns:

(torch.utils.data.Subset, torch.utils.data.Subset) – The two splitted subsets.

Example:
>>> from trojanzoo.utils.data import TensorListDataset
>>> from trojanzoo.datasets import Dataset
>>> import torch
>>>
>>> data = torch.ones(11, 3, 32, 32)
>>> targets = list(range(11))
>>> dataset = TensorListDataset(data, targets)
>>> set1, set2 = Dataset.split_dataset(dataset, length=3)
>>> len(set1), len(set2)
(3, 8)
>>> set3, set4 = split_dataset(dataset, percent=0.5)
>>> len(set3), len(set4)
(5, 6)

See also

The implementation is in trojanzoo.utils.data.split_dataset(). The difference is that this method will set seed as env['data_seed'] when it is None.

Docs

Access comprehensive developer documentation for TrojanZoo

View Docs