Genoloader
genoloader: derived allele polarization and genetic load estimates with high and low coverage data
Install / Use
/learn @emitruc/GenoloaderREADME
GenoLoader
GenoLoader is a Python package for derived allele polarization from VCF files and genetic load estimation, designed to work with both standard- and low-coverage genomic sequencing data.
Given a Freebayes- or GATK-generated VCF annotated with functional effects (e.g. via SnpEff), GenoLoader main function polarizes each SNP as ancestral or derived and recodes individual genotypes as derived-allele dosage values (0, 1, 2) into a custom format text file. It supports multiple polarization strategies — outgroup-based, major allele in one or multiple populations — and implements a read-resampling approach for low-coverage (i.e. ancient) DNA samples. All samples present in the VCF are converted regardless of their population assignment.
Alongside the Python package provided as a Jupyter Lab/Notebook, GenoLoader main function (write_gt in the Python package) is also provided as an AI-translated c++ scritp. While the Pyhton code allows for easier understanding of the script, implementation within existing pipelines and customization, the c++ version provides the exact same features with a 10X faster execution speed. The latter is recommended for processing very large files.
Genetic load estimates and plotting options are provided as additional functions in the Python package only as they do not require long execution time.
Table of Contents
- Features
- Requirements
- Installation
- Input files
- Usage
- Parameters
- Polarization strategies
- Polarization flags
- Output files
- Low-coverage resampling
- Example
- Citation
Features
- Polarizes ancestral/derived alleles at biallelic SNP positions using flexible, user-defined strategies
- Supports outgroup-based unfolded polarization (
POP_OUT) as well as ingroup-only (POP1,POP2, orPOP1POP2) and reference-based options - Recodes genotypes as derived-allele dosage:
0(homozygous ancestral),1(heterozygous),2(homozygous derived),nan(missing) - Parses SnpEff
ANN=annotations to classify variants by functional category (missense, synonymous, intergenic, intronic, upstream, downstream) and putative fitness impact (HIGH, MODERATE, LOW, MODIFIER) - Assigns a diagnostic polarization flag to every locus for full traceability of the supporting data at that site
- Implements a one-read-per-genotype random resampling mode for low-coverage (ancient) DNA data
Requirements
Install dependencies via conda or pip:
conda install scipy numpy pandas tqdm matplotlib seaborn
# or
pip install scipy numpy pandas tqdm matplotlib seaborn
Installation
Clone this repository:
git clone https://github.com/emitruc/genoloader.git
cd genoloader
Python script
No compilation or package installation is required. The script GenoLoader.v3.2.ipynb can be run interactively in a Jupyter Lab/Notebook or imported as a module.
C++ script (only for the main function write_gt fromt the python package)
In the bash terminal, compile the cpp file using g++:
g++ -O2 -std=c++17 -o genoloader GenoLoader.v3.2.cpp
Change permissions:
chmod u+x genoloader
Run the script
./genoloader sample.ann.vcf --p1 pop1.txt --p2 pop2.txt --p0 outgroup.txt \
--m1 5 --m2 5 --m0 2 --polX POP_OUT --low_cov YES
Disclaimer: While the content of the c++ script has not been checked, its output (gt file) was identical with the gt file produced by the Python script on a 2,296,974-variant input vcf as checked with bash sha256sum.
Input files
1. VCF file (required)
A biallelic SNP VCF file generated by Freebayes or GATK, annotated with functional effects using SnpEff (the ANN= tag must be present in the INFO field). Both phased and unphased genotype notation are accepted.
The FORMAT field must include:
GT— genotype (required for all modes)AD— allelic depth (required only whenlow_cov='YES')
Note: Only loci carrying a SnpEff
ANN=annotation are written to the output. Unannotated loci are silently skipped.
2. Population list files (optional, plain text)
Each population is defined by a plain-text file containing one sample name per line, matching the sample names in the VCF header exactly. For example:
sample_A
sample_B
sample_C
- Pop1 (
p1): target population 1 (ingroup 1) - Pop2 (
p2): target population 2 (ingroup 2), or closely related outgroup, used in combined polarization modes - Pop_out (
p0): the outgroup population/species, used for unfolded polarization
Population files are used only to determine the ancestral allele. All individuals present in the VCF are recoded and written to the output, including those not assigned to any population group. IMPORTANT: The same individual may appear in more than one population file. You may want to play with individual assignment to refine your polarization.
Usage
Import and call the write_gt function directly in Python:
from GiNOLOADER_v3_2 import write_gt
write_gt(
vcf_file = 'my_samples.ann.vcf',
p1 = 'pop1_samples.txt',
p2 = 'pop2_samples.txt',
p0 = 'outgroup_samples.txt',
m1 = 'min_ind_p1',
m2 = 'min_ind_p2',
m0 = 'min_ind_p0',
polX = 'polarization strategy',
low_cov = 'one_read_resampling'
)
Parameters
| Parameter | Type | Default | Description |
|------------|---------|----------|-------------|
| vcf_file | str | required | Path to the SnpEff-annotated VCF file |
| p1 | str | None | Path to Pop1 sample list file |
| p2 | str | None | Path to Pop2 sample list file |
| p0 | str | None | Path to outgroup sample list file |
| m1 | int | 0 | Minimum number of non-missing individuals required in Pop1 to retain a locus |
| m2 | int | 0 | Minimum number of non-missing individuals required in Pop2 to retain a locus |
| m0 | int | 0 | Minimum number of non-missing individuals required in the outgroup to retain a locus |
| polX | str | 'REF' | Polarization strategy (see Polarization strategies) |
| low_cov | str | 'NO' | Enable low-coverage read resampling: 'YES' or 'NO' |
Tip: For genetic load estimation, set
m1andm2(or onlym1depending on your polarization strategy), equal to the total number of individuals in the respective population files to exclude any locus with missing data. No missing data is good practice when estimating genetic load.
Polarization strategies
The polX parameter controls how the ancestral allele is determined at each locus.
| polX value | Description |
|--------------|-------------|
| 'REF' | No repolarization. The VCF reference allele is treated as ancestral. |
| 'POP_OUT' | Outgroup-based unfolded polarization. The allele present and fixed in the outgroup (p0) is inferred as ancestral. If not fixed, alternative behaviour is recorded |
| 'POP1' | The major allele in Pop1 is used as the ancestral state. |
| 'POP2' | The major allele in Pop2 is used as the ancestral state. |
| 'POP1POP2' | The major allele across Pop1 and Pop2 combined is used as the ancestral state. |
Note: If a population file is not provided but the strategy requires it, the function falls back gracefully to using whatever information is available, as described in Polarization flags.
Polarization flags
Every locus in the output is assigned a diagnostic flag describing the allelic configuration used to determine the ancestral allele. These flags are written to the flag column of the .gt output file and allow full traceability of polarization decisions.

| Flag | Description |
|------|-------------|
| reference | No repolarization; VCF reference allele used as ancestral (polX='REF') |
| allFix | All groups (outgroup and ingroups) monomorphic for the same allele |
| allMiss | All alleles missing across all groups; locus assigned missing |
| inMiss | Both ingroup populations missing; ancestral allele inferred from outgroup modal allele |
| InvarOutMiss | Outgroup missing; both ingroups monomorphic for the same allele |
| altFix | Outgroup missing; Pop1 and Pop2 fixed for different alleles; Pop1 allele used (conservative) |
| in1FoldAnc | Outgroup and Pop2 missing; Pop1 monomorphic; Pop1 allele used as ancestral |
| in1FoldSeg | Outgroup and Pop2 missing; Pop1 polymorphic; major Pop1 allele used |
| in2FoldAnc | Outgroup and Pop1 missing; Pop2 monomorphic; Pop2 allele used as ancestral |
| in2FoldSeg | Outgroup and Pop1 missing; Pop2 polymorphic; major Pop2 allele used |
| unfoldFix1outMiss | Outgroup missing; Pop1 monomorphic, Pop2 polymorphic; Pop1 allele used |
| unfoldFix2outMiss | Outgroup missing; Pop2 monomorphic, Pop1 polymorphic; Pop2 allele used |
| inFold | Outgroup missing; both ingroups polymorphic; major ingroup allele used (folded fallback) |
| unfolded2Miss1 | Outgroup monomorphic; Pop1 missing, Pop2 present; outgroup allele used |
| unfolded1Miss2 | Outgroup monomorphic; Pop2 missing, Pop1 present; outgroup allele used |
| unfoldedAltFix | Outgroup monomorphic; both ingroups monomorphic for di
