GenoLoader

GenoLoader is a Python package for derived allele polarization from VCF files and genetic load estimation, designed to work with both standard- and low-coverage genomic sequencing data.

Given a Freebayes- or GATK-generated VCF annotated with functional effects (e.g. via SnpEff), GenoLoader main function polarizes each SNP as ancestral or derived and recodes individual genotypes as derived-allele dosage values (0, 1, 2) into a custom format text file. It supports multiple polarization strategies — outgroup-based, major allele in one or multiple populations — and implements a read-resampling approach for low-coverage (i.e. ancient) DNA samples. All samples present in the VCF are converted regardless of their population assignment.

Alongside the Python package provided as a Jupyter Lab/Notebook, GenoLoader main function (write_gt in the Python package) is also provided as an AI-translated c++ scritp. While the Pyhton code allows for easier understanding of the script, implementation within existing pipelines and customization, the c++ version provides the exact same features with a 10X faster execution speed. The latter is recommended for processing very large files.

Genetic load estimates and plotting options are provided as additional functions in the Python package only as they do not require long execution time.

Features
Requirements
Installation
Input files
Usage
Parameters
Polarization strategies
Polarization flags
Output files
Low-coverage resampling
Example
Citation

Features

Polarizes ancestral/derived alleles at biallelic SNP positions using flexible, user-defined strategies
Supports outgroup-based unfolded polarization (POP_OUT) as well as ingroup-only (POP1, POP2, or POP1POP2) and reference-based options
Recodes genotypes as derived-allele dosage: 0 (homozygous ancestral), 1 (heterozygous), 2 (homozygous derived), nan (missing)
Parses SnpEff ANN= annotations to classify variants by functional category (missense, synonymous, intergenic, intronic, upstream, downstream) and putative fitness impact (HIGH, MODERATE, LOW, MODIFIER)
Assigns a diagnostic polarization flag to every locus for full traceability of the supporting data at that site
Implements a one-read-per-genotype random resampling mode for low-coverage (ancient) DNA data

Requirements

Python ≥ 3.7
scipy
numpy
pandas
tqdm
matplotlib
seaborn

Install dependencies via conda or pip:

conda install scipy numpy pandas tqdm matplotlib seaborn
# or
pip install scipy numpy pandas tqdm matplotlib seaborn

Installation

Clone this repository:

git clone https://github.com/emitruc/genoloader.git
cd genoloader

Python script

No compilation or package installation is required. The script GenoLoader.v3.2.ipynb can be run interactively in a Jupyter Lab/Notebook or imported as a module.

C++ script (only for the main function write_gt fromt the python package)

In the bash terminal, compile the cpp file using g++:

g++ -O2 -std=c++17 -o genoloader GenoLoader.v3.2.cpp

Change permissions:

chmod u+x genoloader

Run the script

./genoloader sample.ann.vcf --p1 pop1.txt --p2 pop2.txt --p0 outgroup.txt \
             --m1 5 --m2 5 --m0 2 --polX POP_OUT --low_cov YES

Disclaimer: While the content of the c++ script has not been checked, its output (gt file) was identical with the gt file produced by the Python script on a 2,296,974-variant input vcf as checked with bash sha256sum.

Input files

1. VCF file (required)

A biallelic SNP VCF file generated by Freebayes or GATK, annotated with functional effects using SnpEff (the ANN= tag must be present in the INFO field). Both phased and unphased genotype notation are accepted.

The FORMAT field must include:

GT — genotype (required for all modes)
AD — allelic depth (required only when low_cov='YES')

Note: Only loci carrying a SnpEff ANN= annotation are written to the output. Unannotated loci are silently skipped.

2. Population list files (optional, plain text)

Each population is defined by a plain-text file containing one sample name per line, matching the sample names in the VCF header exactly. For example:

sample_A
sample_B
sample_C

Pop1 (p1): target population 1 (ingroup 1)
Pop2 (p2): target population 2 (ingroup 2), or closely related outgroup, used in combined polarization modes
Pop_out (p0): the outgroup population/species, used for unfolded polarization

Population files are used only to determine the ancestral allele. All individuals present in the VCF are recoded and written to the output, including those not assigned to any population group. IMPORTANT: The same individual may appear in more than one population file. You may want to play with individual assignment to refine your polarization.

Usage

Import and call the write_gt function directly in Python:

from GiNOLOADER_v3_2 import write_gt

write_gt(
    vcf_file = 'my_samples.ann.vcf',
    p1       = 'pop1_samples.txt',
    p2       = 'pop2_samples.txt',
    p0       = 'outgroup_samples.txt',
    m1       = 'min_ind_p1',
    m2       = 'min_ind_p2',
    m0       = 'min_ind_p0',
    polX     = 'polarization strategy',
    low_cov  = 'one_read_resampling'
)

Parameters

| Parameter | Type | Default | Description | |------------|---------|----------|-------------| | vcf_file | str | required | Path to the SnpEff-annotated VCF file | | p1 | str | None | Path to Pop1 sample list file | | p2 | str | None | Path to Pop2 sample list file | | p0 | str | None | Path to outgroup sample list file | | m1 | int | 0 | Minimum number of non-missing individuals required in Pop1 to retain a locus | | m2 | int | 0 | Minimum number of non-missing individuals required in Pop2 to retain a locus | | m0 | int | 0 | Minimum number of non-missing individuals required in the outgroup to retain a locus | | polX | str | 'REF' | Polarization strategy (see Polarization strategies) | | low_cov | str | 'NO' | Enable low-coverage read resampling: 'YES' or 'NO' |

Tip: For genetic load estimation, set m1 and m2 (or only m1 depending on your polarization strategy), equal to the total number of individuals in the respective population files to exclude any locus with missing data. No missing data is good practice when estimating genetic load.

Polarization strategies

The polX parameter controls how the ancestral allele is determined at each locus.

| polX value | Description | |--------------|-------------| | 'REF' | No repolarization. The VCF reference allele is treated as ancestral. | | 'POP_OUT' | Outgroup-based unfolded polarization. The allele present and fixed in the outgroup (p0) is inferred as ancestral. If not fixed, alternative behaviour is recorded | | 'POP1' | The major allele in Pop1 is used as the ancestral state. | | 'POP2' | The major allele in Pop2 is used as the ancestral state. | | 'POP1POP2' | The major allele across Pop1 and Pop2 combined is used as the ancestral state. |

Note: If a population file is not provided but the strategy requires it, the function falls back gracefully to using whatever information is available, as described in Polarization flags.

Polarization flags

Every locus in the output is assigned a diagnostic flag describing the allelic configuration used to determine the ancestral allele. These flags are written to the flag column of the .gt output file and allow full traceability of polarization decisions.

Polarization flags based on data configuration in target and outgroup samples

| Flag | Description | |------|-------------| | reference | No repolarization; VCF reference allele used as ancestral (polX='REF') | | allFix | All groups (outgroup and ingroups) monomorphic for the same allele | | allMiss | All alleles missing across all groups; locus assigned missing | | inMiss | Both ingroup populations missing; ancestral allele inferred from outgroup modal allele | | InvarOutMiss | Outgroup missing; both ingroups monomorphic for the same allele | | altFix | Outgroup missing; Pop1 and Pop2 fixed for different alleles; Pop1 allele used (conservative) | | in1FoldAnc | Outgroup and Pop2 missing; Pop1 monomorphic; Pop1 allele used as ancestral | | in1FoldSeg | Outgroup and Pop2 missing; Pop1 polymorphic; major Pop1 allele used | | in2FoldAnc | Outgroup and Pop1 missing; Pop2 monomorphic; Pop2 allele used as ancestral | | in2FoldSeg | Outgroup and Pop1 missing; Pop2 polymorphic; major Pop2 allele used | | unfoldFix1outMiss | Outgroup missing; Pop1 monomorphic, Pop2 polymorphic; Pop1 allele used | | unfoldFix2outMiss | Outgroup missing; Pop2 monomorphic, Pop1 polymorphic; Pop2 allele used | | inFold | Outgroup missing; both ingroups polymorphic; major ingroup allele used (folded fallback) | | unfolded2Miss1 | Outgroup monomorphic; Pop1 missing, Pop2 present; outgroup allele used | | unfolded1Miss2 | Outgroup monomorphic; Pop2 missing, Pop1 present; outgroup allele used | | unfoldedAltFix | Outgroup monomorphic; both ingroups monomorphic for di

Genoloader

Install / Use

README