Sex.DetERRmine

A python script to calculate the relative coverage of X and Y chromosomes, and their associated error bars, from the depth of coverage at specified SNPs.

Mathematical equations added to README using this tool.

Instructions

The python script takes a modified output from samtools depth as input, via stdin. The samtools depth file should be manually modified to include a header that begins with a # and is including the sample names (generic or specific) as column headers, like below:

#Chr	Pos	Sample1	Sample2	Sample3	Sample4	Sample5
1	752566	1	0	1	0	1
1	776546	0	0	0	0	0
1	832918	0	1	0	0	0
1	842013	0	1	0	3	1
 ...

Alternatively, a Sample/bam list can be provided using the -f option. This list should include 1 name per line, and can be the same list used for the samtools depth command.

For instructions on the options available you can try running the script with the -h flag:

$Sex.DetERRmine.py -h

usage: Sex.DetERRmine.py [-h] [-I <INPUT FILE>] [-f SAMPLELIST]

Calculate the relative X- and Y-chromosome coverage of data, as well as the
associated error bars for each.

optional arguments:
  -h, --help            show this help message and exit
  -I <INPUT FILE>, --Input <INPUT FILE>
                        The input samtools depth file. Omit to read from
                        stdin.
  -f SAMPLELIST, --SampleList SAMPLELIST
                        A list of samples/bams that were in the depth file.
                        One per line. Should be in the order of the samtools
                        depth output.

The script will print out the number of SNPs and the number of reads found on each of Autosomes/X/Y, as well as the relative X/Y coverage and their associated errors.

It is possible to pipe the samtools depth output directly to this script:

samtools depth -a -q30 -Q30 -b <BED File> -f <BAM file list> | Sex.DetERRmine.py -f <BAM file list>

Citation

If you use Sex.DetERRmine in your analysis, please cite:

Lamnidis, T.C. et al., 2018. Ancient Fennoscandian genomes reveal origin and spread of Siberian ancestry in Europe. Nature communications, 9(1), p.5018. Available at: http://dx.doi.org/10.1038/s41467-018-07483-5.

Mathematical explanation

We assume that sequenced reads are distributed along the genome randomly and independently from each other. The "genome" here is made up only of positions in the input depth file.

Ni is the number of sequenced reads in a a chunk of the genome i, the sum of which is the total number of reads on target, N.

We can then calculate:

Where pi is the proportion of all sequenced reads that map to SNPs in i, estimated from the input depths. The error around Ni is the error of the binomial distribution. Then:

Where di is the average depth on SNPs within i, and Si is the number of SNPs in i.

The relative coverage on the X and Y chromosomes can then be calculated as:

We can then use error propagation to calculate the errors around the relative X and Y coverages:

Sex.DetERRmine

Install / Use

README

Sex.DetERRmine

Instructions

Citation

Mathematical explanation