Sex.DetERRmine
A python script to calculate the relative coverage of X and Y chromosomes, and their associated error bars, from the depth of coverage at specified SNPs.
Install / Use
/learn @TCLamnidis/Sex.DetERRmineREADME
Sex.DetERRmine
A python script to calculate the relative coverage of X and Y chromosomes, and their associated error bars, from the depth of coverage at specified SNPs.
<br></br> Mathematical equations added to README using this tool.
Instructions
The python script takes a modified output from samtools depth as input, via stdin. The samtools depth file should be manually modified to include a header that begins with a # and is including the sample names (generic or specific) as column headers, like below:
#Chr Pos Sample1 Sample2 Sample3 Sample4 Sample5
1 752566 1 0 1 0 1
1 776546 0 0 0 0 0
1 832918 0 1 0 0 0
1 842013 0 1 0 3 1
...
Alternatively, a Sample/bam list can be provided using the -f option. This list should include 1 name per line, and can be the same list used for the samtools depth command.
For instructions on the options available you can try running the script with the -h flag:
$Sex.DetERRmine.py -h
usage: Sex.DetERRmine.py [-h] [-I <INPUT FILE>] [-f SAMPLELIST]
Calculate the relative X- and Y-chromosome coverage of data, as well as the
associated error bars for each.
optional arguments:
-h, --help show this help message and exit
-I <INPUT FILE>, --Input <INPUT FILE>
The input samtools depth file. Omit to read from
stdin.
-f SAMPLELIST, --SampleList SAMPLELIST
A list of samples/bams that were in the depth file.
One per line. Should be in the order of the samtools
depth output.
The script will print out the number of SNPs and the number of reads found on each of Autosomes/X/Y, as well as the relative X/Y coverage and their associated errors.
It is possible to pipe the samtools depth output directly to this script:
samtools depth -a -q30 -Q30 -b <BED File> -f <BAM file list> | Sex.DetERRmine.py -f <BAM file list>
Citation
If you use Sex.DetERRmine in your analysis, please cite:
Lamnidis, T.C. et al., 2018. Ancient Fennoscandian genomes reveal origin and spread of Siberian ancestry in Europe. Nature communications, 9(1), p.5018. Available at: http://dx.doi.org/10.1038/s41467-018-07483-5.
Mathematical explanation
We assume that sequenced reads are distributed along the genome randomly and independently from each other. The "genome" here is made up only of positions in the input depth file.
<p align="center"><img src="https://latex.codecogs.com/gif.latex?\bg_white&space;\LARGE&space;\textsl{N}=\sum_{i}&space;\textsl{N}_{i}" title="\textsl{N}=\sum_{i} \textsl{N}_{i}" /></p>N<sub>i</sub> is the number of sequenced reads in a a chunk of the genome i, the sum of which is the total number of reads on target, N.
We can then calculate:
<p align="center"><img src="https://latex.codecogs.com/gif.latex?\bg_white&space;\LARGE&space;\textsl{p}_{i}=\frac{\textsl{N}_{i}}{\textsl{N}}" title="\textsl{p}_{i}=\frac{\textsl{N}_{i}}{\textsl{N}}" /></p> <p align="center"><img src="https://latex.codecogs.com/gif.latex?\bg_white&space;\LARGE&space;\textsl{Err}\left&space;(&space;N_i&space;\right&space;)=\sqrt{N\times&space;p_i\times&space;(1-p_i)}" title="\LARGE \textsl{Err}\left ( N_i \right )=\sqrt{N\times p_i\times (1-p_i)}" /></p>Where p<sub>i</sub> is the proportion of all sequenced reads that map to SNPs in i, estimated from the input depths. The error around N<sub>i</sub> is the error of the binomial distribution. Then:
<p align="center"><img src="https://latex.codecogs.com/gif.latex?\bg_white&space;\LARGE&space;\textsl{d}_{i}=\frac{\textsl{N}_{i}}{\textsl{S}_{i}}" title="\textsl{d}_{i}=\frac{\textsl{N}_{i}}{\textsl{S}_{i}}" /></p> <p align="center"><img src="https://latex.codecogs.com/gif.latex?\bg_white&space;\LARGE&space;\textsl{Err(d}_{i})=\frac{\textsl{Err(N}_{i})}{S_{i}}" title="\LARGE \textsl{Err(d}_{i})=\frac{\textsl{Err(N}_{i})}{S_{i}}" /></p>Where d<sub>i</sub> is the average depth on SNPs within i, and S<sub>i</sub> is the number of SNPs in i.
The relative coverage on the X and Y chromosomes can then be calculated as:
<p align="center"><img src="https://latex.codecogs.com/gif.latex?\bg_white&space;\LARGE&space;\textsl{rate}=\frac{\textsl{d}_{i}}{\textsl{d}_{Aut}}" title="\textsl{rate}=\frac{\textsl{d}_{X/Y}}{\textsl{d}_{Aut}}" /></p>We can then use error propagation to calculate the errors around the relative X and Y coverages:
<p align="center"><img src="https://latex.codecogs.com/gif.latex?\bg_white&space;\LARGE&space;\textsl{Err(x/y&space;rate)}=\sqrt{\left&space;(&space;Err(d_{x/y})\times&space;\frac{1}{d_{aut}}\right&space;)^{2}+&space;\left&space;(Err(d_{aut})\times&space;\frac{d_{x/y}}{{d_{aut}}^{2}}&space;\right&space;)^{2}}" title="\LARGE \textsl{Err(x/y rate)}=\sqrt{\left ( Err(d_{x/y})\times \frac{1}{d_{aut}}\right )^{2}+ \left (Err(d_{aut})\times \frac{d_{x/y}}{{d_{aut}}^{2}} \right )^{2}}" /></p>