SkillAgentSearch skills...

CompBioDatasetsForMachineLearning

A Curated List of Computational Biology Datasets Suitable for Machine Learning

Install / Use

/learn @AdaptInfer/CompBioDatasetsForMachineLearning

README

Computational Biology Datasets Suitable For Machine Learning

This is a curated list of computational biology datasets that have been pre-processed for machine learning. This list is a work in progress, please submit a pull request for any dataset you would like to advertise!

Genotyping

|Name | Description | Comments | |:-:|---|---| |The Cancer Genome Atlas| Variety of Cancer Data | most cancer types have 100-1000 samples | |NIH GDC| Cancer, many types of genomic data | | |UK Biobank | | | |European Genome-Phenome Archive| | | |METABRIC| The genomic profiles (somatic mutations [targeted sequencing], copy number alterations, and gene expression) of 2509 breast cancers.| | |HapMap| | | |23andMe| 2280 Public Domain Curated Genotypes | | |Mice | SNPs, 2000+ samples | 4 generations. It might be possible to learn a family structure out of the data. | |Arabidopsis | SNPs, 100+ phenotypes | |

Promoter-Enhancer Pairs

|Name | Description | Comments | |:-:|---|---| |TargetFinder|~100,000 DNA-DNA interaction pairs | |

Gene/Protein Expression

|Name | Description | Comments | |:-:|---|---| |GEO | Main place for NCBI data | | |ENCODE | Variety of assays to identify functional elements | | |ArrayExpress | DNA sequencing, gene/protein expression, epigenetics | | |Cytometry Continuous | flow cytometry data of 11 proteins+phospholipids, Discretized and cleaned data available offline | Classical benchmark dataset for learning graphical models; contains known errors | |Transcription factor binding | ChIP-Seq data on 12 TFs | | |GTEx | Landmark study for EQTL analysis | | |PharmacoGenomics DB | | | |ProteomeXChange| | | |BeatAML| whole-exome sequencing, RNA sequencing and analyses of ex vivo drug sensitivity | 672 tumour specimens collected from 562 patients |

Single-cell Data

|Name | Description | Comments | |:-:|---|---| |Single-cell expression atlas | | | |scPerturb | single-cell perturbation-response datasets | harmonized and preprocessed across 44 original datasets |

Regulatory Networks

|Name | Description | Comments | |:-:|---|---| |TRRUST| manually curated database of human transcriptional regulatory network | | |Yeast Network| 23-million yeast 2-hybrid experiments to investigate genetic interactions | | |Perturb-Seq| Integrated model of perturbations, single cell phenotypes, and epistatic interactions | | |KEGG Metabolic Regulatory Network (Undirected) | 65554 instances, 29 attributes each | | |KEGG Metabolic Regulatory Network (Directed) |53414 instance, 24 attributes each | |

Images

|Name | Description | Comments | |:-:|---|---| |The Cancer Imaging Archive| Extracts the images from the TCGA data | | |Multiple Myeloma DREAM Challenge| Challenge to identify Multiple Myeloma Patients | | |Breast Cancer Wisconsin (Diagnostic) Data Set| Predict whether the cancer is benign or malignant | | |DDSM|Mammogram Database | | |Kaggle Soft Tissue Sarcomas| Preprocessed subset of the TCIA study "Soft Tissue Sarcoma" | segmentation task | |Kaggle Cervical Cancer Screening| Classify cervix type from images| | |CMELYON17 | Pathology challenge - automated detection and classification of breast cancer metastases in whole-slide images of histological lymph node sections| | |Grand Challenges | Datasets from biomedical image analysis competitions | | |Breast Cancer MRI Dataset | Demographic, clinical, pathology, treatment, outcomes, and genomic data + MRI images | |

fMRI

|Name | Description | Comments | |:-:|---|---| |ENGIMA Cerebellum| Goal: Examine the relationships between regional atrophy and motor and cognitive dysfunction | | |Seizure Prediction | Goal: Classify EEG time series into pre-seizure vs. interictal (i.e., not preceding a seizure). | |

Electronic Medical Records

|Name | Description | Comments | |:-:|---|---| |MIMIC| 59,000 EHRs | | |UCI Diabetes| 130 US hospital data for 1999-2008| | |i2b2 | Clinical notes only, designed for NLP tasks | | |PhysioNet | | | |Metadata Acquired from Clinical Case Reports (MACCRs) | 3,100 curated clinical case reports spanning 15 disease groups and more than 750 reports of rare diseases | | |eICU| 200k EHRs | | |All of Us| >250k EHRs, some genomic data | | |PMC-Patients| 167k patient summaries with 3.1 M patient-article relevance annotations and 293k patient-patient similarity annotations | |

Radiographs

|Name | Description | Comments | |:-:|---|---| |CheXPert | 200k chest radiographs | Competition and leaderboard associated | |MIMIC-CXR | ~400k chest x-rays, 14 labels | Data on PhysioNet | |PadChest | 160k chest x-rays, 174 different findings | |

Protein-Protein Interactions

|Name | Description | Comments | |:-:|---|---| |HINT (High-quality INTeractomes) | curated compilation of high-quality protein-protein interactions from 8 interactome resources | |

Longitudinal Studies

|Name | Description | Comments | |:-:|---|---| |National Population Health Survey| Longitudinal Survey that collects health information via surveys every two years. | |

Protein Structure

|Name | Description | Comments | |:-:|---|---| |ProteinNet | Standardized dataset for learning protein structure. Includes sequences, structures, alignments, PSSMs, and standardized train/test/valid splits. | |

Natural Language Data

|Name | Description | Comments | |:-:|---|---| |BioASQ | Abstracts of medical articles (from PubMed); ontologies of medical concepts. | Tasks: MLC, QA. | |Cases | Articles from medical case studies. | | |UPMC Pathology | UPMC Pathology case studies. | |

Therapeutics

|Name | Description | Comments | |:-:|---|---| |Therapeutic Data Commons| Many preprocessed datasets for therapeutic discovery, including target discovery, activity modeling, efficacy and safety, and manufacturing. | Available as Python modules. | |Cancer Omics Drug Experiment Response Dataset| Molecular datasets paired with corresponding drug sensitivity data | Seeks to standardize datasets of cancer drug responses into a standard schema |

Related Skills

View on GitHub
GitHub Stars197
CategoryEducation
Updated25d ago
Forks26

Security Score

85/100

Audited on Mar 2, 2026

No findings