HapNe
Haplotype-based inference of recent effective population size in modern and ancient DNA samples
Install / Use
/learn @PalamaraLab/HapNeREADME
HapNe
Haplotype-based inference of recent effective population size in modern and ancient DNA samples.
- System Requirements
- Installation Guide
- Demo
- Instructions for use
- Understanding the output
- FAQ
- Citation
- Acknowledgements
1. System Requirements
The software dependencies are listed in the conda_environment.yml and setup.cfg files. The software works on Unix systems (Linux and macOS). The software has been tested on macOS (Ventura 13.3.1, M1 and Intel) and Linux (Ubuntu 22.04.3 LTS). It does not require non-standard hardware.
2. Installation Guide
If plink and plink2 are already installed on your system, you can install HapNe using pip:
pip install hapne
However, we strongly recommend to install HapNe within a conda environment by running:
conda env create --file conda_environment.yml
using the following conda_environment.yml file (~ 5 minutes):
name: HapNe
channels:
- conda-forge
- bioconda
- defaults
dependencies:
- python
- pytest
- numpy
- pandas
- plink
- plink2
- flake8
- numba
You can then install HapNe using pip (1 minute):
pip install hapne
If you wish to modify the code, you can install HapNe in editable mode by running pip install . within this repository.
3. Demo
The tests folder contains a demo for all functionalities of HapNe, and the expected output of each functionality.
Running pytest tests is typically done in less than 20 minutes.
4. Instructions for use
4.1 HapNe-LD
HapNe-LD can be run by adapting the following config file:
[CONFIG]
vcf_file=data
keep=data.keep
map=genetic_map_chr@_combined_b37.txt
pseudo_diploid=False
output_folder=HapNe/data
population_name=POP
genome_build=grch37
- vcf_file: path to the vcf file (without the .vcf.gz extension)
- keep (optional): samples to keep, useful to filter out relatives. The file should contain one id per row
- map: path to SHAPEIT-format recombination map files. The first column is the physical position (in bp), the second one is the rate (in cM/Mb) and the third one is the genetic position (in cM).
- pseudo_diploid: False for modern, true for ancient data
- output_folder: folder where the results will be saved
- population_name: name to be used for the analysis
- genome_build: genome build used (grch37 (default) or grch38)
HapNe can be run using the following Pyton script:
from configparser import ConfigParser
import argparse
from hapne.convert.tools import split_convert_vcf
from hapne.ld import compute_ld, compute_ccld
from hapne import hapne_ld
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='HapNe-LD pipeline')
parser.add_argument('--config_file',
help='configfile')
args = parser.parse_args()
config = ConfigParser()
config.read(args.config_file)
print("Starting stage 1")
split_convert_vcf(config)
print("Starting stage 2")
compute_ld(config)
compute_ccld(config)
print("Starting stage 3")
hapne_ld(config)
4.2 HapNe-IBD
Running HapNe-IBD requires the use of an IBD detection software as a first step. Experiments in the HapNe paper used HapIBD with the postprocessing tool provided here, following the instructions and recommendations from the publications and website.
As input, HapNe-IBD considers IBD length histograms that are split by chromosome arm. We provide a naive pipeline that can make it easier to get the desired outcome. Starting from a vcf file with phased samples containing all chromosomes, the pipeline allows splitting the vcf file by chromosome arm:
from hapne.convert.tools import split_vcf
split_vcf(vcf_file: str, save_in: str, keep=None, genome_build='grch37')
where:
- vcf_file: path to the vcf file (without the .vcf.gz extension)
- save_in: folder where the results will be saved
- keep (optional): samples to keep in plink format, useful to filter out relatives
- genome_build: genome build used (grch37 (default) or grch38)
The method will create 39 new vcfs files in the save_in folder. You can then run HapIBD on each of these files separately, and create *.ibd.gz files following the same naming convention.
For example, you can use a script like:
for file in ./data/*.vcf.gz
do
CHR=$(echo $file | awk -F'chr' '{print $2}' | awk -F'.' '{print $1}')
PREFIX=$(echo $file | awk -F'.vcf.gz' '{print $1}' | awk -F'/' '{print $NF}')
java -jar hap-ibd.jar gt=$file map=plink.chr$CHR.GRCh38.map out=IBD/$PREFIX
echo "Hap-IBD done for $file"
done
Do not forget to merge ibd and hbd files if you use this script. It is also recommended to merge adjacent segments.
HapNe-IBD requires a config file to set the following options:
[CONFIG]
output_folder=output_folder
nb_samples=nb_diploid_samples_in_analysis
population_name=pop_name
ibd_files=output_folder_of_ibd_files
column_cm_length=8
genome_build=grch38
Where:
- column_cm_length is the index of the column containing the length (in centimorgans) for each IBD segment in the ibd.gz file.
- nb_samples is the number of diploid samples, aka the number of individuals, used in the analysis.
Using this config file, HapNe-IBD can be run using the following script:
from configparser import ConfigParser
from hapne.ibd import build_hist
import argparse
from hapne import hapne_ibd
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='HapNe-IBD preprocessing pipeline')
parser.add_argument('--config_file',
help='configfile')
args = parser.parse_args()
config = ConfigParser()
config.read(args.config_file)
build_hist(config)
hapne_ibd(config)
For example:
python hapne_ibd.py --config_file /path/to/my/config/file/config.txt
4.3 Analyses of ancient samples
The HapNe package contains a pipeline to easily analyze samples from the "Allen Ancient DNA Resource" data set. After downloading the data, HapNe can take a file with the indices of samples to be analyzed as input (Caribbean_Ceramic_recent.keep in the following example).
Note that it is assumed that samples present in the keep file are unrelated. Kinship information is usually present in the anno file and can be used to filter out related individuals.
To perform the analysis, create the following configuration file:
[CONFIG]
eigen_root=DATA/v50.0_1240k_public
anno_file=DATA/v50.0_1240k_public.anno
keep=CONFIG/Caribbean_Ceramic_recent.keep
pseudo_diploid=True
output_folder=RESULTS/Caribbean_Ceramic_recent
population_name=Caribbean_Ceramic_recent
eigen_root describes the location to the main data set, anno_file points to the annotation file, keep refers to a file containing the indices of the individuals to be analyzed (one index per row).
The output will be written in a new output_folder folder. pseudo_diploid must be set to true when studying ancient data. Finally, population_name will be used to name the output files.
Next, the following pipeline.py script can be run using python pipeline.py --config_file config.ini
from configparser import ConfigParser
import pandas as pd
import argparse
from hapne.convert.tools import eigenstrat2vcf
from hapne.convert.tools import split_convert_vcf
from hapne.ld import compute_ld, compute_ccld, create_cc_file
from hapne.utils import get_age_from_anno
from hapne import hapne_ld
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='HapNe-LD pipeline')
parser.add_argument('--config_file',
help='configfile')
args = parser.parse_args()
config = ConfigParser()
config.read(args.config_file)
print("Starting stage 1")
eigenstrat2vcf(config)
print("Starting stage 2")
split_convert_vcf(config)
print("Starting stage 3")
compute_ld(config)
compute_ccld(config)
print("Starting stage 4")
get_age_from_anno(config)
hapne_ld(config)
5. Understanding the output
HapNe creates different folders in the output folder provided in the config file. The main output is written in the HapNe folder. It contains the following files:
config.ini: copy of the config file usedsummary.txt: summary of the analysis, with warnings cautioning about potential biases.hapne.csv: The haploid effective population size at each generation. The csv files contains the Maximum-Likelihood estimate (MLE), as well as the estimated quantiles (0.025, 0.25, 0.5, 0.75, 0.975) obtained from the bootstrap procedure.assessment.png: visualization of the Pearson residuals of each chromosome arm. The residuals should be normally distributed, and centered around 0.hapne_results.png: A visual representation of the results. The blue line represents the MLE, and the light-shaded area represents the 95% confidence interval.
Note: the population size produced in output by HapNe is haploid (divide by 2 to obtain diploid size estimates).
6. FAQ
General Questions
- Can HapNe be used to analyze non-human data?
Our testing has been limited to human-like evolutionary models. HapNe may also work in non-human data, but we recommend using simulations to test the accuracy under other evolutionary settings. We are also updating the software to allow using non-human genome builds and welcome suggestions on how to facilitate these analyses. Currocam implemented a snakemake to run HapNe in non-human organisms. The code can b
