BatchUp

Python library for extracting mini-batches of data from a data source for the purpose of training neural networks.

Quick example:

from batchup import data_source

# Construct an array data source
ds = data_source.ArrayDataSource([train_X, train_y])

# Iterate over samples, drawing batches of 64 elements in random order
for (batch_X, batch_y) in ds.batch_iterator(batch_size=64, shuffle=True):
    # Processes batches here...

Documentation available at https://batchup.readthedocs.io

Installation

Batch iteration

Processing data in mini-batches:

quick batch iteration; a basic example
iterating over subsets identified by indices
data augmentation
including sample indices in the mini-batches
infinite batch iteration; an iterator that generates batches endlessly
sample weighting to alter likelihood of samples (e.g. to compensate for class imbalance)
iterating over two data sets simultaneously where their sizes differ (e.g. for semi-supervised learning)
iterating over data sets that are NOT stored as NumPy arrays (e.g. on disk or generated on the fly)
parallel processing to speed up iteration where loading/preparing samples could be slow

Gathering results and loss values

removing the for-loop; predict values for samples in one line
computing mean loss/error values

Standard datasets

BatchUp supports some standard machine learning datasets. They will be automatically downloaded if necessary.

MNIST
SVHN
CIFAR-10
CIFAR-100
STL
USPS

Configuring BatchUp

Data paths, etc.

More details further down, but briefly, use either the ~/.batchup.cfg configuration file or the BATCHUP_HOME environment varible.

Installation

You can install BatchUp with:

> pip install batchup

Batch iteration

Quick batch iteration

Assume we have a training set loaded in the variables train_X and train_y:

from batchup import data_source

# Construct an array data source
ds = data_source.ArrayDataSource([train_X, train_y])

# Iterate over samples, drawing batches of 64 elements in random order
for (batch_X, batch_y) in ds.batch_iterator(batch_size=64, shuffle=np.random.RandomState(12345)):
    # Processes batches here...

Some notes:

the last batch will be short (have less samples than the requested batch size) if there isn't enough data to fill it.
using shuffle=True will use NumPy's default random number generator
not specifying shuffle will process the samples in-order

Iterating over subsets identified by indices

We can specify the indices of a subset of the samples in a dataset and draw mini-batches from only those samples:

import numpy as np

# Randomly choose a subset of 20,000 samples, by indices
subset_a = np.random.permutation(train_X.shape[0])[:20000]

# Construct an array data source that will only draw samples whose indices are in `subset_a`
ds = data_source.ArrayDataSource([train_X, train_y], indices=subset_a)

# Drawing batches of 64 elements in random order
for (batch_X, batch_y) in ds.batch_iterator(batch_size=64, shuffle=np.random.RandomState(12345)):
    # Processes batches here...

Data augmentation

We can define a function that applies data augmentation on the fly. Let's assume that train_X contains image data, has the shape (sample, channel, height, width) and that we wish to horizontally flip some of the images:

import numpy as np

# Define our batch augmentation function.
def augment_batch(batch_X, batch_y):
    # Create an array that selects samples with 50% probability and convert to `bool` dtype
    flip_flags = np.random.binomial(1, 0.5, size=(len(batch_X),)) != 0
    
    # Flip the width dimension in selected samples
    batch_X[flip_flags, ...] = flip_flags[flip_flags, :, :, ::-1]
    
    # Return the batch as a tuple
    return batch_X, batch_y
    
# Construct an array data source that will only draw samples whose indices are in `subset_a`
ds = data_source.ArrayDataSource([train_X, train_y])

# Apply augmentation
ds = ds.map(augment_batch)

# Drawing batches of 64 elements in random order
for (batch_X, batch_y) in ds.batch_iterator(batch_size=64, shuffle=np.random.RandomState(12345)):
    # Processes batches here...

More complex augmentation may incurr significant runtime cost. This can be alleviated by preparing batches in background threads. See the parallel processing section below.

Including sample indices in the mini-batches

We can ask to be provided with the indices of the samples that were drawn to form the mini-batch:

# Construct an array data source that will provide sample indices
ds = data_source.ArrayDataSource([train_X, train_y], include_indices=True)

# Drawing batches of 64 elements in random order
for (batch_ndx, batch_X, batch_y) in ds.batch_iterator(batch_size=64, shuffle=np.random.RandomState(12345)):
    # Processes batches here...

Infinite batch iteration

Lets say you need an iterator that extracts samples from your dataset and starts from the beginning when it reaches the end:

ds = data_source.ArrayDataSource([train_X, train_y], repeats=-1)

Now use the batch_iterator method as before.

The repeats parameter accepts either -1 for infininte, or any positive integer >= 1 for a specified number of repetitions.

This will also work if the dataset has less samples than the batch size; not a common use case but can happen in certain situations involving semi-supervised learning for instance.

Sample weighting to alter likelihood of samples

If you want some samples to be drawn more frequently than others, construct a sampling.WeightedSampler and pass it as the sampler argument to the ArrayDataSource constructor. In the example the per-sample weights are stored in train_w.

from batchup import sampling

sampler = sampling.WeightedSampler(weights=train_w)

ds = data_source.ArrayDataSource([train_X, train_y], sampler=sampler)

# Drawing batches of 64 elements in random order
for (batch_X, batch_y) in ds.batch_iterator(batch_size=64, shuffle=np.random.RandomState(12345)):
    # Processes batches here...

Note that in-order is NOT supported when using sampling.WeightedSampler, so shuffle cannot be False or None.

To draw from a subset of the dataset, use sampling.WeightedSubsetSampler:

from batchup import sampling

# NOTE that that parameter is called `sub_weights` (rather than `weights`) and that it must have the
# same length as `indices`.
sampler = sampling.WeightedSubsetSampler(sub_weights=train_w[subset_a], indices=subset_a)

ds = data_source.ArrayDataSource([train_X, train_y], sampler=sampler)

# Drawing batches of 64 elements in random order
for (batch_X, batch_y) in ds.batch_iterator(batch_size=64, shuffle=np.random.RandomState(12345)):
    # Processes batches here...

Class balancing helper

An alternate constructor sampling.WeightedSampler.class_balancing_sampler is available to construct a weighted sampler to compensate for class imbalance:

# Construct the sampler; NOTE that the `n_classes` argument is *optional*
sampler = sampling.WeightedSampler.class_balancing_sampler(y=train_y, n_classes=train_y.max() + 1)

ds = data_source.ArrayDataSource([train_X, train_y], sampler=sampler)

# Drawing batches of 64 elements in random order
for (batch_X, batch_y) in ds.batch_iterator(batch_size=64, shuffle=np.random.RandomState(12345)):
    # Processes batches here...

The sampling.WeightedSampler.class_balancing_sample_weights helper method constructs an array of sample weights, in case you wish to modify the weights first:

weights = sampling.WeightedSampler.class_balancing_sample_weights(y=train_y, n_classes=train_y.max() + 1)

# Assume `modify_weights` is defined above
weights = modify_weights(weights)

# Construct the sampler and the data source
sampler = sampling.WeightedSampler(weights=weights)
ds = data_source.ArrayDataSource([train_X, train_y], sampler=sampler)

# Drawing batches of 64 elements in random order
for (batch_X, batch_y) in ds.batch_iterator(batch_size=64, shuffle=np.random.RandomState(12345)):
    # Processes batches here...

Iterating over two data sources of wildly different sizes for semi-supervised learning

In semi-supervised learning we have a small dataset of labeled samples lab_X with ground truths lab_y and a larger set of unlabeled samples unlab_X. Lets say we want a single epoch to consist of the entire unlabeled dataset while looping over the labeled dataset repeatly. The CompositeDataSource class can help us here.

Without using CompositeDataSource:

rng = np.random.RandomState(12345)

# Construct the data sources; the labeled data source will repeat infinitely
ds_lab = data_source.ArrayDataSource([lab_X, lab_y], repeats=-1)
ds_unlab = data_source.ArrayDataSource([unlab_X])

# Construct an iterator to get samples from our labeled data source:
lab_iter = ds_lab.batch_iterator(batch_size=64, shuffle=rng)

# Iterate over the unlabled data set in the for-loop
for (batch_unlab_X,) in ds_unlab.batch_iterator(batch_size=64, shuffle=rng):
    # Extract batches from the labeled iterator ourselves
    batch_lab_X, batch_lab_y = next(lab_iter)
    
    # Process batches here...

Now using CompositeDataSource:

# Construct the data sources; the labeled data source will repeat infinitely
ds_lab = data_source.ArrayDataSource([lab_X, lab_y], repeats=-1)
ds_unlab = data_source.ArrayDataSource([unlab_X])
ds = data_source.CompositeDataSource([ds_lab, ds_unlab])

# Iterate over both the labeled and unlabeled samples:
for (batch_lab_X, batch_lab_y, batch_unlab_X) in ds.batch_iterator(batch_size=64, shuffle=rng):
    # Process batches here...

The two component data sources (ds_lab and ds_unlab) will be shuffled independently.

You can also have CompositeDataSource generate structured mini-batches that reflect the structure of the da

Batchup

Install / Use

README

BatchUp

Table of Contents

Installation

Batch iteration

Gathering results and loss values

Standard datasets

Configuring BatchUp

Installation

Batch iteration

Quick batch iteration

Iterating over subsets identified by indices

Data augmentation

Including sample indices in the mini-batches

Infinite batch iteration

Sample weighting to alter likelihood of samples

Class balancing helper

Iterating over two data sources of wildly different sizes for semi-supervised learning