Cypiripi
A tool for finding out the CYP2D6/CYP2D7 genotypes in the HTS reads
Install / Use
/learn @sfu-compbio/CypiripiREADME
Cypiripi has been discontinued and deprecated in favor of Aldy.
Aldy is much more capable than Cypiripi and covers all aspect of the original algorithm. This repository is purely for archival purposes. No support requests will be handled anymore.
I am finally releasing source code for this project--- mainly for
archival purposes. It's licensed under
CRAPL license, which best
describes its development process and the resulting source code quality (please
don't ask me how to compile it or how to use those scripts in util folder).
As for the original simulations used in the paper, I lost original FASTQ and SAM files (they probably disappeared when somebody asked me to clean up my multi-TB home directory on the shared cluster, and only later I realized that the purge was too successful). Sorry for that...
Anyways, if you really need this, good luck (really)! I highly recommend anybody who needs to get the job done to use Aldy, which basically started as an attempt to "fix" Cypiripi (hopeless attempt, as it turned out), and which can do anything Cypiripi currently does in a way more reproducible and stable manner.
Cypiripi
Cypiripi is a computational tool for CYP2D6 genotyping.
Introduction
Software manual is located at http://sfu-compbio.github.io/cypiripi/.
Algorithm details are published at ISMB 2015. Publication can be accessed at http://www.ncbi.nlm.nih.gov/pubmed/26072492.
F.A.Q.
What kind of data does Cypiripi currently support?
Cypiripi is designed to work with Solexa/Illumina whole-genome sequencing technologies (e.g. Genome Analyzer, HiSeq and similar sequencers).
We have tested Cypiripi on HiSeq simulations and real data (CEPH trio). Cypiripi can work with other technologies as long as the following conditions are met:
-
Coverage is uniform
This is the most important requirement. Uneven coverage will break copy-number detection, and predictions will suffer.
-
Coverage is greater or equal than 40x
You can get some results on lower coverage, but Cypiripi is not designed to handle such cases.
-
Read lengths are greater or equal than 75bp
-
Read lengths are consistent
Read lengths should be of the same length within the sample.
If they are not, you can add
--crop Xparameter to the mrsfast and mrfast, where X is the length of smallest read in the sample.Be warned: cropping will resize all long reads! This means that if you have thousand 100bp reads and one 30bp read, all reads will be considered as 30bp reads. This should be fine if you have reads of size 100 and 101bp, but if you have large variations between read lengths, cropping will not produce expected results.
Warning: Use original FASTQ files of the sequencing experiment if possible. Do not use FASTQ files created from post-processed SAM/BAM files. These FASTQ files usually omit some reads due to the mapping misses and might even contain hard-clipped reads.
How to run Cypiripi on large samples
Current versions of mrfast (2.6.1.0 as the time of writing) will load the whole FASTQ file into the memory. This is impractical for large FASTQ files, unless you have 1 TB of RAM.
If you happen to have a large FASTQ file, it is recommended to split it into several smaller files and invoke the mapping manually. Here is a quick step-by-step guide on how to do that.
-
Split the FASTQ files
split -d -l 20000000 input.fq split/input.fqThis will generate files small FASTQ files in the directory
split. Each small file will have 5,000,000 records (totaling 20,000,000 lines). -
Run mrfast and mrsfast for every small file
Before starting, make sure that you have indexed the reference. If you do not have reference indexes, generate them by running:
mrsfast --index reference.cyp2d6.fa mrfast --index reference.cyp2d7.faFirst, generate single-end mapping (assuming that you have reference files provided with Cypiripi in the same directory):
for i in split/input.fq*; do mrsfast --search reference.cyp2d6.fa -e 2 --seqcomp --seq ${i} -o ${i}.cyp2D6.sam --disable-sam-header --disable-nohits --threads 4 ; mrfast --search reference.cyp2d7.fa -e 5 --seqcomp --seq ${i} -o ${i}.cyp2D7.sam -u ${i}.cyp2D6.sam.unmap ; doneAfterwards, generate paired-end mapping:
for i in split/input_1.fq.*; do mrsfast --search reference.cyp2d6.fa -e 2 --seqcomp --pe --seq ${i} -o ${i}.cyp2D6.sam.paired --disable-sam-header --disable-nohits --min 250 --max 650 --threads 4 ; mrfast --search reference.cyp2d7.fa -e 5 --seqcomp --pe --seq ${i} -o ${i}.cyp2D7.sam.paired -u {1}.cyp2D6.sam.unmap --min 250 --max 650 ; doneIn this example, it is assumed that the distance between the pairs is in the range from 250 to 650. If your sequencing experiment has different insert sizes, feel free to adjust the
--minand--maxparameters as necessary.These commands can be ran in parallel (e.g. on cluster or via GNU Parallel).
-
Concatenate the output
In this step, just concatenate the small
.samfiles into a large one by invoking:cat split/input.fq*.cyp2D6.sam split/input.fq*.cyp2D7.sam > input.sam cat split/input.fq*.cyp2D6.sam.paired split/input.fq*.cyp2D7.sam.paired > input.sam.paired -
Run the Cypiripi
Assuming that you have
input.samandinput.sam.pairedready, you can now invoke Cypiripi by running:python ./cypiripi.py --fasta reference --sam input.sam --cov Xwhere
Xis the coverage of the data.
