Cords
Reduce end to end training time from days to hours (or hours to minutes), and energy requirements/costs by an order of magnitude using coresets and data selection.
Install / Use
/learn @decile-team/CordsREADME
In this README
- In this README
- What is CORDS?
- Highlights
- Starting with CORDS
- Applications
- Speedups achieved using CORDS
- Tutorials
- Documentation
- Mailing List
- Acknowledgment
- Team
- Resources
- Publications
What is CORDS?
CORDS is COReset and Data Selection library for making machine learning time, energy, cost, and compute efficient. CORDS is built on top of PyTorch. Today, deep learning systems are extremely compute-intensive, with significant turnaround times, energy inefficiencies, higher costs, and resource requirements [7, 8]. CORDS is an effort to make deep learning more energy, cost, resource, and time-efficient while not sacrificing accuracy. The following are the goals CORDS tries to achieve:
<p align="center"><i><b>Data Efficiency</b></i></p> <p align="center"><i><b>Reducing End to End Training Time</b></i></p> <p align="center"><i><b>Reducing Energy Requirement</b></i></p> <p align="center"><i><b>Faster Hyper-parameter tuning </b></i></p> <p align="center"><i><b>Reducing Resource (GPU) Requirement and Costs</b></i></p>The primary purpose of CORDS is to select the suitable representative data subsets from massive datasets, and it does so iteratively. CORDS uses recent advances in data subset selection, particularly ideas of coresets and submodularity select such subsets. CORDS implements several state-of-the-art data subset/coreset selection algorithms for efficient supervised learning(SL) and semi-supervised learning(SSL).
Some of the algorithms currently implemented with CORDS include:
For Efficient and Robust Supervised Learning:
- GLISTER
- GradMatch
- CRAIG
- SubmodularSelection (Facility Location, Feature Based Functions, Coverage, Diversity)
- RandomSelection
For Efficient and Robust Semi-supervised Learning:
We are continuously incorporating newer and better algorithms into CORDS. Some of the features of CORDS includes:
- Reproducibility of SOTA in Data Selection and Coresets: Enable easy reproducibility of SOTA described above. We are trying also to add more algorithms, so if you have an algorithm you would like us to include, please let us know,
- Benchmarking: We have benchmarked CORDS (and the algorithms present right now) on several datasets, including CIFAR-10, CIFAR-100, MNIST, SVHN, and ImageNet.
- Ease of Use: One of the main goals of CORDS is that it is easy to use and add to CORDS. Feel free to contribute to CORDS!
- Modular design: The data selection algorithms are directly incorporated into data loaders, allowing one to use their own training loop for varied utility scenarios.
- A broad number of use cases: CORDS is currently implemented for simple image classification tasks and hyperparameter tuning, but we are working on integrating several additional use cases like Auto-ML, object detection, speech recognition, semi-supervised learning, etc.
Highlights
- 3x to 5x speedups, cost reduction, and energy reductions in the training of deep models in supervised learning
- 3x+ speedups, cost/energy reduction for deep model training in semi-supervised learning
- 3x to 30x speedups and cost/energy reduction for Hyper-parameter tuning using subset selection with SOTA schedulers (Hyperband and ASHA) and algorithms (TPE, Random)
Starting with CORDS
Pip Installation
To install the latest version of the CORDS package using PyPI:
pip install cords
From Git Repository
To install using the source:
git clone https://github.com/decile-team/cords.git
cd cords
pip install -r requirements/requirements.txt
First Steps
To better understand CORDS's functionality, we have provided example Jupyter notebooks and python code in the examples folder, which can be easily executed by using Google Colab. We also provide a simple SL, SSL, and HPO training loops that runs experiments using a provided configuration file. To run this loop, you can look into following code examples:
Using subset selection based data loaders
Create a subset selection based data loader at train time and use the subset selection based data loader with your own training loop.
Essentially, with subset selection-based data loaders, it is pretty straightforward to use subset selection strategies directly because they are integrated directly into subset data loaders; this allows users to use subset selection strategies directly by using their respective subset selection data loaders.
Below is an example that shows the subset selection process is simplified by just calling a data loader in supervised learning setting,
from cords.utils.data.dataloader.SL.adaptive import GLISTERDataLoader
#Pass on necessary arguments for GLISTERDataLoader
dss_args = dict(model=model,
loss=criterion_nored,
eta=0.01,
num_classes=10,
num_epochs=300,
device='cuda',
fraction=0.1,
select_every=20,
kappa=0,
linear_layer=False,
selection_type='SL',
greedy='Stochastic')
dss_args = DotMap(dss_args)
#Create GLISTER subset selection dataloader
dataloader = GLISTERDataLoader(trainloader,
valloader,
dss_args,
logger,
batch_size=20,
shuffle=True,
pin_memory=False)
for epoch in range(num_epochs):
for _, (inputs, targets, weights) in enumerate(dataloader):
"""
Standard PyTorch training loop using weighted loss
Our training loop differs from the standard PyTorch training loop in that along with
data samples and their associated target labels; we also have additional sample weight
information from the subset data loader, which can be used to calculate the weighted
loss for gradient descent. We can calculate the weighted loss by using default PyTorch
loss functions with no reduction.
"""
In our current version, we deployed subset selection data loaders in supervised learning and semi-supervised learning settings.
Using default supervised training loop,
from train_sl import TrainClassifier
from cords.utils.config_utils import load_config_data
co
