Genoml2
GenoML (genoml2) is an open source Python package. It is an automated machine learning (autoML) platform for genomics data
Install / Use
/learn @GenoML/Genoml2README
GenoML
<p align="center"> <img width="300" height="300" src="https://github.com/GenoML/genoml2/blob/master/logo.png"> </p>Updated 17 June 2025: Latest Release on pip! v1.5.4
How to Get Started with GenoML
Introduction
GenoML (Genomics + Machine Learning) is an automated Machine Learning (autoML) for genomics data. In general, use a Linux or Mac with Python 3.9-3.12 for best results. This repository and pip package are under active development!
This README is a brief look into how to structure arguments and what arguments are available at each phase for the GenoML CLI.
If you are using GenoML for your own work, please cite the following papers:
- Makarious, M. B., Leonard, H. L., Vitale, D., Iwaki, H., Saffo, D., Sargent, L., ... & Faghri, F. (2021). GenoML: Automated Machine Learning for Genomics. arXiv preprint arXiv:2103.03221
- Makarious, M. B., Leonard, H. L., Vitale, D., Iwaki, H., Sargent, L., Dadu, A., ... & Nalls, M. A. (2021). Multi-Modality Machine Learning Predicting Parkinson’s Disease. bioRxiv.
Installing + Downloading Example Data
- Install this repository directly from GitHub (from source; master branch)
git clone https://github.com/GenoML/genoml2.git
- Install using pip or upgrade using pip
pip install genoml2
OR
pip install genoml2 --upgrade
- To install the
examples/directory (~315 KB), you can use SVN (pre-installed on most Macs)
svn export https://github.com/GenoML/genoml2.git/trunk/examples
Note: When you pip install this package, the examples/ folder is also downloaded! However, if you still want to download the directory and SVN is not pre-installed, you can download it via Homebrew if you have that installed using
brew install svn
CHANGELOG
- 16-JUN-2025: Addition of multiclass prediction functionality using the same base models that are used for the discrete module. We have additionally restructured the munging functionality to allow users to process training and testing data all at once to ensure they are munged under the same conditions, as well as including multiple GWAS summary stats files for SNP filtering. We also upgraded from plink1.9 to plink2 for genomic data processing. Finally, we have added a log file in the output directory to facilitate full reproducbility of results.
READMEupdated to reflect these changes. - 8-OCT-2024: Big changes to output file structure, so now output files go in subdirectories named for each step, and prefixes are not required.
READMEupdated to reflect these changes.
Table of Contents
0. (OPTIONAL) How to Set Up a Virtual Environment via Conda
1. Munging with GenoML
1b. Harmonizing with GenoML
2. Training with GenoML
3. Tuning with GenoML
4. Testing/Validating with GenoML
5. Full pipeline example
6. Experimental Features
<a id="0"></a>
0. [OPTIONAL] How to Set Up a Virtual Environment via Conda
You can create a virtual environment to run GenoML, if you prefer. If you already have the Anaconda Distribution, this is fairly simple.
To create and activate a virtual environment:
# To create a virtual environment
conda create -n GenoML python=3.12
# To activate a virtual environment
conda activate GenoML
# To install requirements via pip
pip install -r requirements.txt
# If issues installing xgboost from requirements - (3 options)
# Option 1: use Homebrew to
# xcode-select --install
# brew install gcc@7
# or Option 2: conda install -c conda-forge xgboost
# or Option 3: pip install xgboost==2.0.3
# If issues installing umap
# pip install umap-learn
# If issues installing pytables/dependency issue
# conda install -c conda-forge pytables
# If issues with blosc
# conda install -c conda-forge tables blosc
## MISC
# To deactivate the virtual environment
# conda deactivate GenoML
# To delete your virtual environment
# conda env remove -n GenoML
To install the GenoML in the user's path in a virtual environment, you can do the following:
# Install the package at this path
pip install .
# MISC
# To save out the environment requirements to a .txt file
# pip freeze > requirements.txt
# Removing a conda virtualenv
# conda remove --name GenoML --all
Note: The following examples are for discrete data, but if you substitute following commands with
continuousormulticlassinstead of discrete, you can munge, harmonize, train, tune, and test your continuous/multiclass data!
<a id="1"></a>
1. Munging with GenoML
Munging with GenoML will, at minimum, do the following:
- Prune your genotypes using PLINK v2 (if
--genoflag is used) - Impute per column using median or mean (can be changed with the
--imputeflag) - Z-scaling of features and removing columns with a std dev = 0
Required arguments for GenoML munging are --prefix and --pheno
data: Are the datacontinuous,discrete, ormulticlass?method: Do you want to usesupervisedorunsupervisedmachine learning? (unsupervised currently under development)mode: would you like tomunge,harmonize,train,tune, ortestyour model? Here, you will usemunge.--prefix: Where would you like your outputs to be saved?--pheno: Where is your phenotype file? This file only has 2 columns, ID in one, and PHENO in the other (0 for controls and 1 for cases when using thediscretemodule, 0, ..., n-1 when using themulticlassmodule for n distinct phenotypes, or numeric values when using thecontinuousmodule).
Be sure to have your files formatted the same as the examples, key points being:
- Your phenotype file consisting only of the "ID" and "PHENO" columns
- Your sample IDs matching across all files
- Your sample IDs not consisting with only integers (add a prefix or suffix to all sample IDs ensuring they are alphanumeric if this is the case prior to running GenoML)
- Please avoid the use of characters like commas, semi-colons, etc. in the column headers (it is Python after all!)
If you would like to munge just with genotypes (in PLINK binary format), the simplest command is the following:
# Running GenoML munging on discrete data using PLINK genotype binary files and a phenotype file
genoml discrete supervised munge \
--prefix outputs \
--geno examples/discrete/training \
--pheno examples/discrete/training_pheno.csv
If you would like a more detailed log printed to your console, you may use the --verbose flag as follows:
# Running GenoML munging on discrete data using PLINK genotype binary files and a phenotype file with a detailed log printed to the console
genoml discrete supervised munge \
--prefix outputs \
--geno examples/discrete/training \
--pheno examples/discrete/training_pheno.csv \
--verbose
Note: The
--verboseflag may be used like this for any GenoML command, not just munging.
To properly evaluate your model, it must be applied to a dataset it's never seen before (testing data). If you have both training and testing data, GenoML allows you to munge them together upfront. To do this with your training and testing phenotype/genotype data, the simplest command is the following:
# Running GenoML munging on discrete data using PLINK genotype binary files and phenotype files for both the training and testing datasets.
genoml discrete supervised munge \
--prefix outputs \
--geno examples/discrete/training \
--pheno examples/discrete/training_pheno.csv \
--geno_test examples/discrete/validation \
--pheno_test examples/discrete/validation_pheno.csv
If you would like to control the pruning stringency in genotypes:
# Running GenoML munging on discrete data using PLINK genotype binary files and a phenotype file
genoml discrete supervised munge \
--prefix outputs \
--geno examples/discrete/training \
--r2_cutoff 0.3 \
--pheno examples/discrete/training_pheno.csv
You can choose to skip pruning your SNPs at this stage by including the --skip_prune flag
# Running GenoML munging on discrete data using PLINK genotype binary files and a phenotype file
genoml discrete supervised munge \
--prefix outputs \
--geno examples/discrete/training \
--skip_prune \
--pheno examples/discrete/training_pheno.csv
You can choose to impute on mean or median by modifying the --impute flag, like so (default is median):
# Running GenoML munging on discrete data using PLINK genotype binary files and a phenotype file and specifying impute
genoml discrete supervised munge \
--prefix outputs \
--geno examples/discrete/training \
--pheno examples/discrete/training_pheno.csv \
--impute mean
If you suspect collinear variables, and think this will be a problem for training the model moving forward, you can use variance inflation factor (VIF) filtering:
# Running GenoML munging on discrete data using PLINK genotype binary files and a phenotype file while using VIF to remove multicollinearity
genoml discrete supervised munge \
--prefix outputs \
--geno examples/discrete/training \
--pheno examples/discrete/training_pheno.csv \
--vif 5 \
--vif_iter 1
- The
--vifflag specifies the VIF threshold you would like to use (5 is recommended) - The number of iterations you'd like to run can be modified with the
--vif_iterflag (if you have or anticipate many collinear variables, it's a good idea to increase the iterations)
Well,
