Opendataval
OpenDataVal: a Unified Benchmark for Data Valuation in Python (NeurIPS 2023)
Install / Use
/learn @opendataval/OpendatavalREADME
<a name="readme-top" id="readme-top"></a>
<!-- PROJECT LOGO --> <div width="175" align="right"> <a href="https://github.com/opendataval/opendataval"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://github.com/opendataval/opendataval/blob/main/docs/_static/logo-dark-mode.png"> <source media="(prefers-color-scheme: light)" srcset="https://github.com/opendataval/opendataval/blob/main/docs/_static/logo-light-mode.png"> <img alt="Logo toggles light and dark mode" src="https://github.com/opendataval/opendataval/blob/main/docs/_static/logo-light-mode.png" width="300" align="right"> </picture> </a> </div>OpenDataVal: a Unified Benchmark for Data Valuation
<!-- > A unified library for transparent data valuation benchmarks -->Assessing the quality of individual data points is critical for improving model performance and mitigating biases. However, there is no way to systematically benchmark different algorithms.
OpenDataVal is an open-source initiative that with a diverse array of datasets/models (image, NLP, and tabular), data valuation algorithms, and evaluation tasks using just a few lines of code.
OpenDataVal also provides a leaderboards for data evaluation tasks. We've curated and added
artificial noise to some datasets. Create your own DataEvaluator to top the leaderboards. OpenDataVal is accepted at NeurIPS 2023 track on Datasets and Benchmarks.
| Overview | |
|----------|-|
|Paper| Paper link |
|Python||
|Dependencies|[![Pytorch][PyTorch-shield]][PyTorch-url] [![scikit-learn][scikit-learn-shield]][scikit-learn-url] [![numpy][numpy-shield]][numpy-url] |
|Documentation|
|
|CI/CD|[![Build][test-shield]][test-url] ![Coverage][coverage_badge] |
|Issues| [![Issues][issues-shield]][issues-url] |
|License|[![MIT License][license-shield]][license-url]|
|Releases|[![Releases][release-shield]][release-url]|
|Citation| [Cite Us][citation-url] |
:sparkles: Features
| Feature | Status | Links | Notes |
|---------|--------|-------|-------|
| Datasets | Stable | Docs | Embeddings available for image/NLP datasets |
| Models | Stable | Docs | Support available for sk-learn models |
| Data Evaluators | Stable | Docs | |
| Experiments | Stable | Docs | |
| Examples | Stable | | |
| CLI | Experimental | opendataval --help | No support for null values |
:hourglass_flowing_sand: Installation options
It is highly reccomended to use a virtual environment for opendataval. Check out conda!
- Install with pip
pip install opendataval - Clone the repo and install
a. Install optional dependencies if you're contributinggit clone https://github.com/opendataval/opendataval.git make install
b. If you want to pull in kaggle datasets, I'd reccomend looking how to add a kaggle folder to the current directory. Tutorial heremake install-dev
:zap: Quick Start
To set up an experiment on DataEvaluators. Feel free to change the source code as needed for a project.
import opendataval
from opendataval.experiment import ExperimentMediator
from opendataval.dataval import DataOob
from opendataval.experiment import discover_corrupted_sample, noisy_detection
exper_med = ExperimentMediator.model_factory_setup(
dataset_name='iris',
force_download=False,
train_count=50,
valid_count=50,
test_count=50,
model_name='ClassifierMLP',
train_kwargs={'epochs': 5, 'batch_size': 20},
)
list_of_data_evaluators = [DataOob()] # Define evaluators here
eval_med = exper_med.compute_data_values(list_of_data_evaluators)
# Runs a discover the noisy data experiment for each DataEvaluator and plots
data, fig = eval_med.plot(discover_corrupted_sample)
# Runs non-plottable experiment
data = eval_med.evaluate(noisy_detection)
:computer: CLI
opendataval comes with a quick CLI tool, The tool is under development and the template for a csv input is found at cli.csv. Note that for kwarg arguments, the input must be valid json.
To use run the following command if installed with make-install:
opendataval --file cli.csv -n [job_id] -o [path/to/output/]
To run without installing the script:
python opendataval --file cli.csv -n [job_id] -o [path/to/output/]
<p align="right">(<a href="#readme-top">Back to top</a>)</p>
:control_knobs: API
Here are the 4 interacting parts of opendataval
DataFetcher, Loads data and holds meta data regarding splitsModel, trainable prediction model.DataEvaluator, Measures the data values of input data point for a specified model.ExperimentMediator, facilitates experiments regarding data values across severalDataEvaluators
DataFetcher
The DataFetcher takes the name of a Register dataset and loads, transforms, splits, and adds noise to the data set.
from opendataval.dataloader import DataFetcher
DataFetcher.datasets_available() # ['dataset_name1', 'dataset_name2']
fetcher = DataFetcher(dataset_name='dataset_name1')
fetcher = fetcher.split_dataset_by_count(70, 20, 10)
fetcher = fetcher.noisify(mix_labels, noise_rate=.1)
x_train, y_train, x_valid, y_valid, x_test, y_test = fetcher.datapoints
<p align="right">(<a href="#readme-top">Back to top</a>)</p>
Model
Model is the predictive model for Data Evaluators.
from opendataval.model import LogisticRegression
model = LogisticRegression(input_dim, output_dim)
model.fit(x, y)
model.predict(x)
>>> torch.Tensor(...)
<p align="right">(<a href="#readme-top">Back to top</a>)</p>
DataEvaluator
We have a catalog of DataEvaluator to run experiments. To do so, input the Model, DataFetcher, and an evaluation metric (such as accuracy).
from opendataval.dataval.ame import AME
dataval = (
AME(num_models=8000)
.train(fetcher=fetcher, pred_model=model, metric=metric)
)
data_values = dataval.data_values # Cached values
data_values = dataval.evaluate_data_values() # Recomputed values
>>> np.ndarray([.888, .132, ...])
<p align="right">(<a href="#readme-top">Back to top</a>)</p>
ExperimentMediator
ExperimentMediator is helps make a cohesive and controlled experiment. NOTE Warnings are raised if errors occur in a specific DataEvaluator.
expermed = ExperimentrMediator(fetcher, model, train_kwargs, metric_name).compute_data_values(data_evaluators)
Run experiments by passing in an experiment function: (DataEvaluator, DataFetcher, ...) - > dict[str, Any]. There are 5 found exper_methods.py with three being plotable.
df = expermed.evaluate(noisy_detection)
df, figure = expermed.plot(discover_corrupted_sample)
For more examples, please refer to the Documentation
<p align="right">(<a href="#readme-top">Back to top</a>)</p>:medal_sports: opendataval Leaderboards
For datasets that start with the prefix challenge, we provide leaderboards. Compute the data values with an ExperimentMediator and use the save_dataval function to save a csv. Upload it to here! Uploading will allow us to systematically compare your DataEvaluator against others in the field.
The available challenges are currently:
challenge-iris
exper_med = ExperimentMediator.model_factory_setup(
dataset_name='challenge-...', model_name=model_name, train_kwar
