GenoML

Updated 17 June 2025: Latest Release on pip! v1.5.4

How to Get Started with GenoML

Introduction

GenoML (Genomics + Machine Learning) is an automated Machine Learning (autoML) for genomics data. In general, use a Linux or Mac with Python 3.9-3.12 for best results. This repository and pip package are under active development!

This README is a brief look into how to structure arguments and what arguments are available at each phase for the GenoML CLI.

If you are using GenoML for your own work, please cite the following papers:

Makarious, M. B., Leonard, H. L., Vitale, D., Iwaki, H., Saffo, D., Sargent, L., ... & Faghri, F. (2021). GenoML: Automated Machine Learning for Genomics. arXiv preprint arXiv:2103.03221
Makarious, M. B., Leonard, H. L., Vitale, D., Iwaki, H., Sargent, L., Dadu, A., ... & Nalls, M. A. (2021). Multi-Modality Machine Learning Predicting Parkinson’s Disease. bioRxiv.

Installing + Downloading Example Data

Install this repository directly from GitHub (from source; master branch)

git clone https://github.com/GenoML/genoml2.git

Install using pip or upgrade using pip

pip install genoml2

pip install genoml2 --upgrade

To install the examples/ directory (~315 KB), you can use SVN (pre-installed on most Macs)

svn export https://github.com/GenoML/genoml2.git/trunk/examples

Note: When you pip install this package, the examples/ folder is also downloaded! However, if you still want to download the directory and SVN is not pre-installed, you can download it via Homebrew if you have that installed using brew install svn

CHANGELOG

16-JUN-2025: Addition of multiclass prediction functionality using the same base models that are used for the discrete module. We have additionally restructured the munging functionality to allow users to process training and testing data all at once to ensure they are munged under the same conditions, as well as including multiple GWAS summary stats files for SNP filtering. We also upgraded from plink1.9 to plink2 for genomic data processing. Finally, we have added a log file in the output directory to facilitate full reproducbility of results. README updated to reflect these changes.
8-OCT-2024: Big changes to output file structure, so now output files go in subdirectories named for each step, and prefixes are not required. README updated to reflect these changes.

0. (OPTIONAL) How to Set Up a Virtual Environment via Conda

1. Munging with GenoML

1b. Harmonizing with GenoML

2. Training with GenoML

3. Tuning with GenoML

4. Testing/Validating with GenoML

5. Full pipeline example

6. Experimental Features

0. [OPTIONAL] How to Set Up a Virtual Environment via Conda

You can create a virtual environment to run GenoML, if you prefer. If you already have the Anaconda Distribution, this is fairly simple.

To create and activate a virtual environment:

# To create a virtual environment
conda create -n GenoML python=3.12

# To activate a virtual environment
conda activate GenoML

# To install requirements via pip 
pip install -r requirements.txt
    # If issues installing xgboost from requirements - (3 options)
        # Option 1: use Homebrew to 
            # xcode-select --install
            # brew install gcc@7
        # or Option 2: conda install -c conda-forge xgboost 
        # or Option 3: pip install xgboost==2.0.3
    # If issues installing umap 
        # pip install umap-learn
    # If issues installing pytables/dependency issue 
        # conda install -c conda-forge pytables
    # If issues with blosc 
        # conda install -c conda-forge tables blosc

## MISC
# To deactivate the virtual environment
# conda deactivate GenoML

# To delete your virtual environment
# conda env remove -n GenoML

To install the GenoML in the user's path in a virtual environment, you can do the following:

# Install the package at this path
pip install .

# MISC
	# To save out the environment requirements to a .txt file
# pip freeze > requirements.txt

	# Removing a conda virtualenv
# conda remove --name GenoML --all

Note: The following examples are for discrete data, but if you substitute following commands with continuous or multiclass instead of discrete, you can munge, harmonize, train, tune, and test your continuous/multiclass data!

1. Munging with GenoML

Munging with GenoML will, at minimum, do the following:

Prune your genotypes using PLINK v2 (if --geno flag is used)
Impute per column using median or mean (can be changed with the --impute flag)
Z-scaling of features and removing columns with a std dev = 0

Required arguments for GenoML munging are --prefix and --pheno

data : Are the data continuous, discrete, or multiclass?
method: Do you want to use supervised or unsupervised machine learning? (unsupervised currently under development)
mode: would you like to munge, harmonize, train, tune, or test your model? Here, you will use munge.
--prefix : Where would you like your outputs to be saved?
--pheno : Where is your phenotype file? This file only has 2 columns, ID in one, and PHENO in the other (0 for controls and 1 for cases when using the discrete module, 0, ..., n-1 when using the multiclass module for n distinct phenotypes, or numeric values when using the continuous module).

Be sure to have your files formatted the same as the examples, key points being:

Your phenotype file consisting only of the "ID" and "PHENO" columns
Your sample IDs matching across all files
Your sample IDs not consisting with only integers (add a prefix or suffix to all sample IDs ensuring they are alphanumeric if this is the case prior to running GenoML)
Please avoid the use of characters like commas, semi-colons, etc. in the column headers (it is Python after all!)

If you would like to munge just with genotypes (in PLINK binary format), the simplest command is the following:

# Running GenoML munging on discrete data using PLINK genotype binary files and a phenotype file 

genoml discrete supervised munge \
--prefix outputs \
--geno examples/discrete/training \
--pheno examples/discrete/training_pheno.csv

If you would like a more detailed log printed to your console, you may use the --verbose flag as follows:

# Running GenoML munging on discrete data using PLINK genotype binary files and a phenotype file with a detailed log printed to the console

genoml discrete supervised munge \
--prefix outputs \
--geno examples/discrete/training \
--pheno examples/discrete/training_pheno.csv \
--verbose

Note: The --verbose flag may be used like this for any GenoML command, not just munging.

To properly evaluate your model, it must be applied to a dataset it's never seen before (testing data). If you have both training and testing data, GenoML allows you to munge them together upfront. To do this with your training and testing phenotype/genotype data, the simplest command is the following:

# Running GenoML munging on discrete data using PLINK genotype binary files and phenotype files for both the training and testing datasets.

genoml discrete supervised munge \
--prefix outputs \
--geno examples/discrete/training \
--pheno examples/discrete/training_pheno.csv \
--geno_test examples/discrete/validation \
--pheno_test examples/discrete/validation_pheno.csv

If you would like to control the pruning stringency in genotypes:

# Running GenoML munging on discrete data using PLINK genotype binary files and a phenotype file 

genoml discrete supervised munge \
--prefix outputs \
--geno examples/discrete/training \
--r2_cutoff 0.3 \
--pheno examples/discrete/training_pheno.csv

You can choose to skip pruning your SNPs at this stage by including the --skip_prune flag

# Running GenoML munging on discrete data using PLINK genotype binary files and a phenotype file 

genoml discrete supervised munge \
--prefix outputs \
--geno examples/discrete/training \
--skip_prune \
--pheno examples/discrete/training_pheno.csv

You can choose to impute on mean or median by modifying the --impute flag, like so (default is median):

# Running GenoML munging on discrete data using PLINK genotype binary files and a phenotype file and specifying impute

genoml discrete supervised munge \
--prefix outputs \
--geno examples/discrete/training \
--pheno examples/discrete/training_pheno.csv \
--impute mean

If you suspect collinear variables, and think this will be a problem for training the model moving forward, you can use variance inflation factor (VIF) filtering:

# Running GenoML munging on discrete data using PLINK genotype binary files and a phenotype file while using VIF to remove multicollinearity 

genoml discrete supervised munge \
--prefix outputs \
--geno examples/discrete/training \
--pheno examples/discrete/training_pheno.csv \
--vif 5 \
--vif_iter 1

The --vif flag specifies the VIF threshold you would like to use (5 is recommended)
The number of iterations you'd like to run can be modified with the --vif_iter flag (if you have or anticipate many collinear variables, it's a good idea to increase the iterations)

Well,

Genoml2

Install / Use

README

GenoML

How to Get Started with GenoML

Introduction

Installing + Downloading Example Data

CHANGELOG

Table of Contents

0. (OPTIONAL) How to Set Up a Virtual Environment via Conda

1. Munging with GenoML

1b. Harmonizing with GenoML

2. Training with GenoML

3. Tuning with GenoML

4. Testing/Validating with GenoML

5. Full pipeline example

6. Experimental Features

0. [OPTIONAL] How to Set Up a Virtual Environment via Conda

1. Munging with GenoML