&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp&nbsp <img src="https://github.com/decile-team/cords/blob/main/docs/source/imgs/cords_logo.png" width="500"/> COResets and Data Subset selection <a href="https://github.com/decile-team/cords/blob/main/LICENSE.txt"> <img alt="GitHub" src="https://img.shields.io/github/license/decile-team/cords?color=blue"> </a> <a href="https://decile.org/"> <img alt="Decile" src="https://img.shields.io/badge/website-online-green"> </a> <a href="https://cords.readthedocs.io/en/latest/"> <img alt="Documentation" src="https://img.shields.io/badge/docs-passing-brightgreen"> </a> <a href="#"> <img alt="GitHub Stars" src="https://img.shields.io/github/stars/decile-team/cords"> </a> <a href="#"> <img alt="GitHub Forks" src="https://img.shields.io/github/forks/decile-team/cords"> </a> <h3 align="center"> Reduce end to end training time from days to hours (or hours to minutes), and energy requirements/costs by an order of magnitude using coresets and data selection. </h3>

In this README

In this README
What is CORDS?
Highlights
Starting with CORDS
Applications
- Efficient Hyper-parameter Optimization(HPO)
Speedups achieved using CORDS
Tutorials
Documentation
Mailing List
Acknowledgment
Team
Resources
Publications

What is CORDS?

CORDS is COReset and Data Selection library for making machine learning time, energy, cost, and compute efficient. CORDS is built on top of PyTorch. Today, deep learning systems are extremely compute-intensive, with significant turnaround times, energy inefficiencies, higher costs, and resource requirements [7, 8]. CORDS is an effort to make deep learning more energy, cost, resource, and time-efficient while not sacrificing accuracy. The following are the goals CORDS tries to achieve:

Data Efficiency Reducing End to End Training Time Reducing Energy Requirement Faster Hyper-parameter tuning Reducing Resource (GPU) Requirement and Costs

The primary purpose of CORDS is to select the suitable representative data subsets from massive datasets, and it does so iteratively. CORDS uses recent advances in data subset selection, particularly ideas of coresets and submodularity select such subsets. CORDS implements several state-of-the-art data subset/coreset selection algorithms for efficient supervised learning(SL) and semi-supervised learning(SSL).

Some of the algorithms currently implemented with CORDS include:

For Efficient and Robust Supervised Learning:

GLISTER
GradMatch
CRAIG
SubmodularSelection (Facility Location, Feature Based Functions, Coverage, Diversity)
RandomSelection

For Efficient and Robust Semi-supervised Learning:

We are continuously incorporating newer and better algorithms into CORDS. Some of the features of CORDS includes:

Reproducibility of SOTA in Data Selection and Coresets: Enable easy reproducibility of SOTA described above. We are trying also to add more algorithms, so if you have an algorithm you would like us to include, please let us know,
Benchmarking: We have benchmarked CORDS (and the algorithms present right now) on several datasets, including CIFAR-10, CIFAR-100, MNIST, SVHN, and ImageNet.
Ease of Use: One of the main goals of CORDS is that it is easy to use and add to CORDS. Feel free to contribute to CORDS!
Modular design: The data selection algorithms are directly incorporated into data loaders, allowing one to use their own training loop for varied utility scenarios.
A broad number of use cases: CORDS is currently implemented for simple image classification tasks and hyperparameter tuning, but we are working on integrating several additional use cases like Auto-ML, object detection, speech recognition, semi-supervised learning, etc.

Highlights

3x to 5x speedups, cost reduction, and energy reductions in the training of deep models in supervised learning
3x+ speedups, cost/energy reduction for deep model training in semi-supervised learning
3x to 30x speedups and cost/energy reduction for Hyper-parameter tuning using subset selection with SOTA schedulers (Hyperband and ASHA) and algorithms (TPE, Random)

Starting with CORDS

Pip Installation

To install the latest version of the CORDS package using PyPI:

pip install cords

From Git Repository

To install using the source:

git clone https://github.com/decile-team/cords.git
cd cords
pip install -r requirements/requirements.txt

First Steps

To better understand CORDS's functionality, we have provided example Jupyter notebooks and python code in the examples folder, which can be easily executed by using Google Colab. We also provide a simple SL, SSL, and HPO training loops that runs experiments using a provided configuration file. To run this loop, you can look into following code examples:

Using subset selection based data loaders

Create a subset selection based data loader at train time and use the subset selection based data loader with your own training loop.

Essentially, with subset selection-based data loaders, it is pretty straightforward to use subset selection strategies directly because they are integrated directly into subset data loaders; this allows users to use subset selection strategies directly by using their respective subset selection data loaders.

Below is an example that shows the subset selection process is simplified by just calling a data loader in supervised learning setting,

from cords.utils.data.dataloader.SL.adaptive import GLISTERDataLoader

#Pass on necessary arguments for GLISTERDataLoader
dss_args = dict(model=model,
                loss=criterion_nored,
                eta=0.01,
                num_classes=10,
                num_epochs=300,
                device='cuda',
                fraction=0.1,
                select_every=20,
                kappa=0,
                linear_layer=False,
                selection_type='SL',
                greedy='Stochastic')
dss_args = DotMap(dss_args)

#Create GLISTER subset selection dataloader
dataloader = GLISTERDataLoader(trainloader, 
                                valloader, 
                                dss_args, 
                                logger, 
                                batch_size=20, 
                                shuffle=True,
                                pin_memory=False)

for epoch in range(num_epochs):
    for _, (inputs, targets, weights) in enumerate(dataloader):
        """
        Standard PyTorch training loop using weighted loss
        
        Our training loop differs from the standard PyTorch training loop in that along with 
        data samples and their associated target labels; we also have additional sample weight
        information from the subset data loader, which can be used to calculate the weighted 
        loss for gradient descent. We can calculate the weighted loss by using default PyTorch
        loss functions with no reduction.
        """

In our current version, we deployed subset selection data loaders in supervised learning and semi-supervised learning settings.

Using default supervised training loop,

from train_sl import TrainClassifier
from cords.utils.config_utils import load_config_data

co

Cords

Install / Use

README