Skip to content

declearn.dataset.Dataset

Abstract class defining an API to access training or testing data.

A 'Dataset' is an interface towards data that exposes methods to query batched data samples and key metadata while remaining agnostic of the way the data is actually being loaded (from a source file, a database, a network reader, another API...).

This is notably done to allow clients to use distinct data storage and loading architectures, even implementing their own subclass if needed, while ensuring that data access is straightforward to specify as part of FL algorithms.

Source code in declearn/dataset/_base.py
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
@create_types_registry
class Dataset(metaclass=abc.ABCMeta):
    """Abstract class defining an API to access training or testing data.

    A 'Dataset' is an interface towards data that exposes methods
    to query batched data samples and key metadata while remaining
    agnostic of the way the data is actually being loaded (from a
    source file, a database, a network reader, another API...).

    This is notably done to allow clients to use distinct data
    storage and loading architectures, even implementing their
    own subclass if needed, while ensuring that data access is
    straightforward to specify as part of FL algorithms.
    """

    @abc.abstractmethod
    def get_data_specs(
        self,
    ) -> DataSpecs:
        """Return a DataSpecs object describing this dataset."""

    @abc.abstractmethod
    def generate_batches(  # pylint: disable=too-many-arguments
        self,
        batch_size: int,
        shuffle: bool = False,
        drop_remainder: bool = True,
        replacement: bool = False,
        poisson: bool = False,
    ) -> Iterator[Batch]:
        """Yield batches of data samples.

        Parameters
        ----------
        batch_size: int
            Number of samples per batch.
        shuffle: bool, default=False
            Whether to shuffle data samples prior to batching.
            Note that the shuffling will differ on each call
            to this method.
        drop_remainder: bool, default=True
            Whether to drop the last batch if it contains less
            samples than `batch_size`, or yield it anyway.
            If `poisson=True`, this is used to determine the number
            of returned batches (notwithstanding their actual size).
        replacement: bool, default=False
            Whether to do random sampling with or without replacement.
            Ignored if `shuffle=False` or `poisson=True`.
        poisson: bool, default=False
            Whether to use Poisson sampling, i.e. make up batches by
            drawing samples with replacement, resulting in variable-
            size batches and samples possibly appearing in zero or in
            multiple emitted batches (but at most once per batch).
            Useful to maintain tight Differential Privacy guarantees.

        Yields
        ------
        inputs: (2+)-dimensional data array or list of data arrays
            Input features of that batch.
        targets: data array, list of data arrays or None
            Target labels or values of that batch.
            May be None for unsupervised or semi-supervised tasks.
        weights: 1-d data array or None
            Optional weights associated with the samples, that are
            typically used to balance a model's loss or metrics.
        """

generate_batches(batch_size, shuffle=False, drop_remainder=True, replacement=False, poisson=False) abstractmethod

Yield batches of data samples.

Parameters:

Name Type Description Default
batch_size int

Number of samples per batch.

required
shuffle bool

Whether to shuffle data samples prior to batching. Note that the shuffling will differ on each call to this method.

False
drop_remainder bool

Whether to drop the last batch if it contains less samples than batch_size, or yield it anyway. If poisson=True, this is used to determine the number of returned batches (notwithstanding their actual size).

True
replacement bool

Whether to do random sampling with or without replacement. Ignored if shuffle=False or poisson=True.

False
poisson bool

Whether to use Poisson sampling, i.e. make up batches by drawing samples with replacement, resulting in variable- size batches and samples possibly appearing in zero or in multiple emitted batches (but at most once per batch). Useful to maintain tight Differential Privacy guarantees.

False

Yields:

Name Type Description
inputs (2

Input features of that batch.

targets data array, list of data arrays or None

Target labels or values of that batch. May be None for unsupervised or semi-supervised tasks.

weights 1-d data array or None

Optional weights associated with the samples, that are typically used to balance a model's loss or metrics.

Source code in declearn/dataset/_base.py
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
@abc.abstractmethod
def generate_batches(  # pylint: disable=too-many-arguments
    self,
    batch_size: int,
    shuffle: bool = False,
    drop_remainder: bool = True,
    replacement: bool = False,
    poisson: bool = False,
) -> Iterator[Batch]:
    """Yield batches of data samples.

    Parameters
    ----------
    batch_size: int
        Number of samples per batch.
    shuffle: bool, default=False
        Whether to shuffle data samples prior to batching.
        Note that the shuffling will differ on each call
        to this method.
    drop_remainder: bool, default=True
        Whether to drop the last batch if it contains less
        samples than `batch_size`, or yield it anyway.
        If `poisson=True`, this is used to determine the number
        of returned batches (notwithstanding their actual size).
    replacement: bool, default=False
        Whether to do random sampling with or without replacement.
        Ignored if `shuffle=False` or `poisson=True`.
    poisson: bool, default=False
        Whether to use Poisson sampling, i.e. make up batches by
        drawing samples with replacement, resulting in variable-
        size batches and samples possibly appearing in zero or in
        multiple emitted batches (but at most once per batch).
        Useful to maintain tight Differential Privacy guarantees.

    Yields
    ------
    inputs: (2+)-dimensional data array or list of data arrays
        Input features of that batch.
    targets: data array, list of data arrays or None
        Target labels or values of that batch.
        May be None for unsupervised or semi-supervised tasks.
    weights: 1-d data array or None
        Optional weights associated with the samples, that are
        typically used to balance a model's loss or metrics.
    """

get_data_specs() abstractmethod

Return a DataSpecs object describing this dataset.

Source code in declearn/dataset/_base.py
61
62
63
64
65
@abc.abstractmethod
def get_data_specs(
    self,
) -> DataSpecs:
    """Return a DataSpecs object describing this dataset."""