CompBioDatasetsForMachineLearning
A Curated List of Computational Biology Datasets Suitable for Machine Learning
Install / Use
/learn @AdaptInfer/CompBioDatasetsForMachineLearningREADME
Computational Biology Datasets Suitable For Machine Learning
This is a curated list of computational biology datasets that have been pre-processed for machine learning. This list is a work in progress, please submit a pull request for any dataset you would like to advertise!
Genotyping
|Name | Description | Comments | |:-:|---|---| |The Cancer Genome Atlas| Variety of Cancer Data | most cancer types have 100-1000 samples | |NIH GDC| Cancer, many types of genomic data | | |UK Biobank | | | |European Genome-Phenome Archive| | | |METABRIC| The genomic profiles (somatic mutations [targeted sequencing], copy number alterations, and gene expression) of 2509 breast cancers.| | |HapMap| | | |23andMe| 2280 Public Domain Curated Genotypes | | |Mice | SNPs, 2000+ samples | 4 generations. It might be possible to learn a family structure out of the data. | |Arabidopsis | SNPs, 100+ phenotypes | |
Promoter-Enhancer Pairs
|Name | Description | Comments | |:-:|---|---| |TargetFinder|~100,000 DNA-DNA interaction pairs | |
Gene/Protein Expression
|Name | Description | Comments | |:-:|---|---| |GEO | Main place for NCBI data | | |ENCODE | Variety of assays to identify functional elements | | |ArrayExpress | DNA sequencing, gene/protein expression, epigenetics | | |Cytometry Continuous | flow cytometry data of 11 proteins+phospholipids, Discretized and cleaned data available offline | Classical benchmark dataset for learning graphical models; contains known errors | |Transcription factor binding | ChIP-Seq data on 12 TFs | | |GTEx | Landmark study for EQTL analysis | | |PharmacoGenomics DB | | | |ProteomeXChange| | | |BeatAML| whole-exome sequencing, RNA sequencing and analyses of ex vivo drug sensitivity | 672 tumour specimens collected from 562 patients |
Single-cell Data
|Name | Description | Comments | |:-:|---|---| |Single-cell expression atlas | | | |scPerturb | single-cell perturbation-response datasets | harmonized and preprocessed across 44 original datasets |
Regulatory Networks
|Name | Description | Comments | |:-:|---|---| |TRRUST| manually curated database of human transcriptional regulatory network | | |Yeast Network| 23-million yeast 2-hybrid experiments to investigate genetic interactions | | |Perturb-Seq| Integrated model of perturbations, single cell phenotypes, and epistatic interactions | | |KEGG Metabolic Regulatory Network (Undirected) | 65554 instances, 29 attributes each | | |KEGG Metabolic Regulatory Network (Directed) |53414 instance, 24 attributes each | |
Images
|Name | Description | Comments | |:-:|---|---| |The Cancer Imaging Archive| Extracts the images from the TCGA data | | |Multiple Myeloma DREAM Challenge| Challenge to identify Multiple Myeloma Patients | | |Breast Cancer Wisconsin (Diagnostic) Data Set| Predict whether the cancer is benign or malignant | | |DDSM|Mammogram Database | | |Kaggle Soft Tissue Sarcomas| Preprocessed subset of the TCIA study "Soft Tissue Sarcoma" | segmentation task | |Kaggle Cervical Cancer Screening| Classify cervix type from images| | |CMELYON17 | Pathology challenge - automated detection and classification of breast cancer metastases in whole-slide images of histological lymph node sections| | |Grand Challenges | Datasets from biomedical image analysis competitions | | |Breast Cancer MRI Dataset | Demographic, clinical, pathology, treatment, outcomes, and genomic data + MRI images | |
fMRI
|Name | Description | Comments | |:-:|---|---| |ENGIMA Cerebellum| Goal: Examine the relationships between regional atrophy and motor and cognitive dysfunction | | |Seizure Prediction | Goal: Classify EEG time series into pre-seizure vs. interictal (i.e., not preceding a seizure). | |
Electronic Medical Records
|Name | Description | Comments | |:-:|---|---| |MIMIC| 59,000 EHRs | | |UCI Diabetes| 130 US hospital data for 1999-2008| | |i2b2 | Clinical notes only, designed for NLP tasks | | |PhysioNet | | | |Metadata Acquired from Clinical Case Reports (MACCRs) | 3,100 curated clinical case reports spanning 15 disease groups and more than 750 reports of rare diseases | | |eICU| 200k EHRs | | |All of Us| >250k EHRs, some genomic data | | |PMC-Patients| 167k patient summaries with 3.1 M patient-article relevance annotations and 293k patient-patient similarity annotations | |
Radiographs
|Name | Description | Comments | |:-:|---|---| |CheXPert | 200k chest radiographs | Competition and leaderboard associated | |MIMIC-CXR | ~400k chest x-rays, 14 labels | Data on PhysioNet | |PadChest | 160k chest x-rays, 174 different findings | |
Protein-Protein Interactions
|Name | Description | Comments | |:-:|---|---| |HINT (High-quality INTeractomes) | curated compilation of high-quality protein-protein interactions from 8 interactome resources | |
Longitudinal Studies
|Name | Description | Comments | |:-:|---|---| |National Population Health Survey| Longitudinal Survey that collects health information via surveys every two years. | |
Protein Structure
|Name | Description | Comments | |:-:|---|---| |ProteinNet | Standardized dataset for learning protein structure. Includes sequences, structures, alignments, PSSMs, and standardized train/test/valid splits. | |
Natural Language Data
|Name | Description | Comments | |:-:|---|---| |BioASQ | Abstracts of medical articles (from PubMed); ontologies of medical concepts. | Tasks: MLC, QA. | |Cases | Articles from medical case studies. | | |UPMC Pathology | UPMC Pathology case studies. | |
Therapeutics
|Name | Description | Comments | |:-:|---|---| |Therapeutic Data Commons| Many preprocessed datasets for therapeutic discovery, including target discovery, activity modeling, efficacy and safety, and manufacturing. | Available as Python modules. | |Cancer Omics Drug Experiment Response Dataset| Molecular datasets paired with corresponding drug sensitivity data | Seeks to standardize datasets of cancer drug responses into a standard schema |
Related Skills
YC-Killer
2.7kA library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.
best-practices-researcher
The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app
mentoring-juniors
Community-contributed instructions, agents, skills, and configurations to help you make the most of GitHub Copilot.
groundhog
399Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).
Security Score
Audited on Mar 2, 2026
