FLamby
Cross-silo Federated Learning playground in Python. Discover 7 real-world federated datasets to test your new FL strategies and try to beat the leaderboard.
Install / Use
/learn @owkin/FLambyREADME
FLamby
Updates
-
(January 2024) :warning: Please checkout FLamby's 0.1.0 recent release, which introduces changes to the Fed-Camelyon16 benchmark as well as fixes some reproducibility issue across datasets :warning:
-
(May 2024) :blue_heart: Checkout this super cool tutorial in Fed-BioMed done with Zama's concrete-ML to add another layer of privacy to FL. :blue_heart:
-
(June 2024) :fire: :fire: :fire: Now FLamby has a proof of concept integration with BenchOpt. This should allow easier benchmarking of new federated strategies on FLamby's datasets. Come and fork the benchopt code ! :fire: :fire: :fire:
Table of Contents
Overview
:arrow_right:The API doc is available here:arrow_left:
FLamby is a benchmark for cross-silo Federated Learning with natural partitioning, currently focused in healthcare applications. It spans multiple data modalities and should allow easy interfacing with most Federated Learning frameworks (including Fed-BioMed, FedML, Substra...). It contains implementations of different standard federated learning strategies. A companion paper describing it was published at NeurIPS 2022 in the Datasets & Benchmarks track.
The FLamby package contains:
- Data loaders that automatically handle data preprocessing and partitions of distributed datasets.
- Evaluation functions to evaluate trained models on the different tracks as defined in the companion paper.
- Benchmark code using the utilities below to obtain the performances of baselines using different strategies.
It does not contain datasets, which have to be downloaded separately (see the section below).
FLamby was tested on Ubuntu and MacOS environment. If you are facing any problems installing or executing FLamby code please help us improve it by filing an issue on FLamby github page ensuring to explain it in detail.
Dataset suite
FLamby is a dataset suite instead of a repository. We provide code to easily access existing datasets stored in other repositories. In particular, we do not distribute datasets in this repository, and we do not own copyrights on any of the datasets.
The use of any of the datasets included in FLamby requires accepting its corresponding license on the original website. We refer to each dataset's README for more details.
For any problem or question with respect to any license related matters, please open a github issue on this repository.
Installation
We recommend using anaconda and pip. You can install anaconda by downloading and executing appropriate installers from the Anaconda website, pip often comes included with python otherwise check the following instructions. We support all Python version starting from 3.8.
You may need make for simplification. The following command will install all packages used by all datasets within FLamby. If you already know you will only need a fraction of the datasets inside the suite you can do a partial installation and update it along the way using the options described below.
Create and launch the environment using:
git clone https://github.com/owkin/FLamby.git
cd FLamby
make install
conda activate flamby
To limit the number of installed packages you can use the enable argument to specify which dataset(s)
you want to build required dependencies for and if you need to execute the tests (tests) and build the documentation (docs):
git clone https://github.com/owkin/FLamby.git
cd FLamby
make enable=option_name install
conda activate flamby
where option_name can be one of the following:
cam16, heart, isic2019, ixi, kits19, lidc, tcga, docs, tests
if you want to use more than one option you can do it using comma
(WARNING: there should be no space after ,), eg:
git clone https://github.com/owkin/FLamby.git
cd FLamby
make enable=cam16,kits19,tests install
conda activate flamby
Be careful, each command tries to create a conda environment named flamby therefore make install will fail if executed
numerous times as the flamby environment will already exist. Use make update as explained in the next section if you decide to
use more datasets than intended originally.
Update environment
Use the following command if new dependencies have been added, and you want to update the environment for additional datasets:
make update
or you can use enable option:
make enable=cam16 update
In case you don't have the make command (e.g. Windows users)
You can install the environment by running:
git clone https://github.com/owkin/FLamby.git
cd FLamby
conda env create -f environment.yml
conda activate flamby
pip install -e .[all_extra]
or if you wish to install the environment for only one or more datasets, tests or documentation:
git clone https://github.com/owkin/FLamby.git
cd FLamby
conda env create -f environment.yml
conda activate flamby
pip install -e .[option_name]
where option_name can be one of the following:
cam16, heart, isic2019, ixi, kits19, lidc, tcga, docs, tests. If you want to use more than one option you can do it
using comma (,) (no space after comma), eg:
pip install -e .[cam16,ixi]
Accepting data licensing
Then proceed to read and accept the different licenses and download the data from all the datasets you are interested in by following the instructions provided in each folder:
Quickstart
Follow the quickstart section to learn how to get started with FLamby.
Reproduce benchmark and figures from the companion article
Benchmarks
The results are stored in flamby/results in corresponding subfolders results_benchmark_fed_dataset for each dataset.
These results can be plotted using:
python plot_results.py
which produces the plot at the end of the main article.
In order to re-run each of the benchmark on your machine, first download the dataset you are interested in and then run the following command replacing config_dataset.json by one of the listed config files (config_camelyon16.json, config_heart_disease.json, config_isic2019.json, config_ixi.json, config_kits19.json, config_lidc_idri.json, config_tcga_brca.json):
cd flamby/benchmarks
python fed_benchmark.py --seed 42 -cfp ../config_dataset.json
python fed_benchmark.py --seed 43 -cfp ../config_dataset.json
python fed_benchmark.py --seed 44 -cfp ../config_dataset.json
python fed_benchmark.py --seed 45 -cfp ../config_dataset.json
python fed_benchmark.py --seed 46 -cfp ../config_dataset.json
We have observed that results vary from machine to machine and are sensitive to GPU randomness. However you should be able to reproduce the results up to some variance and results on the same machine should be perfecty reproducible. Please open an issue if it is not the case.
The script extract_config.py allows to go from a results file to a config.py.
See the quickstart section to change parameters.
Containerized execution
A good step towards float-perfect reproducibility in your future benchmarks is to use docker. We give a base docker image and examples containing dataset download and benchmarking.
For Fed-Heart-Disease, cd to the flamby dockers folder, replace myusername and mypassword with your git credentials (OAuth token) in the command below and run:
docker build -t flamby-heart -f Dockerfile.base --build-arg DATASET_PREFIX="heart" --build-arg GIT_USER="myusername" --build-arg GIT_PWD="mypassword" .
docker build -t flamby-heart-benchmark -f Dockerfile.heart .
docker run -it flamby-heart-benchmark
If you are convinced you will use many datasets with docker, build the base image using all_extra option for flamby's install, you will be able to reuse it for all datasets with multi-stage build:
docker build -t flamby-all -f Dockerfile.base --build-arg DATASET_PREFIX="all_extra" --build-arg GIT_USER="myusername" --build-arg GIT_PWD="mypassword" .
# modify Dockerfile.* line 1 to FROM flamby-all by replacing * with the dataset name of the dataset you are interested in
# Then run the following command replacing * similarly
#docker build -t flamby-* -f Dockerfile.* .
#docker run -it flamby-*-benchmark
Checkout Dockerfile.tcga.
Similar dockerfiles can be theoretically easily built for the other datasets as well by
replicating instructions found in each dataset folder following the model of Dockerfile.heart.
Note that for bigger datasets execution can be prohibitively slow and docker can run out of time/memory.
Using FLamby with FL-frameworks
FLamby can be easily adapted to different frameworks
