Tracespipe
Reconstruction and analysis of viral and host genomes at multi-organ level
Install / Use
/learn @viromelab/TracespipeREADME
1. About
TRACESPipe is a next-generation sequencing pipeline for identification, assembly, and analysis of viral and human-host genomes at multi-organ level. The identification and assembly of viral genomes rely on cooperation between three modalities:
<ul> <li>compression-based predictors;</li> <li>sequence alignments;</li> <li><i>de-novo</i> assembly.</li> </ul> The compression-based prediction applies FALCON-meta technology with ultra-fast comparative quantification to find the best reference genome (from a large viral database) containing the highest similarity relative to the sequenced reads. After identification, the reads are aligned according to the best reference by Bowtie2. A consensus sequence is produced with specific filters using Bcftools. Then, <i>de-novo</i> assembly (metaSPAdes) is involved in building scaffolds. The high coverage scaffolds that overlap totally or partially the consensus sequence (aligned by bwa) are used to validate or either augment the new genome. The final analysis of the assembly is interactively supervised with the IGV with the goal of drafting the final sequence.For the human-host variant call identification, the same procedure is followed although directly starting within the second point, given the use of the same reference (revised Cambridge Reference) to all the cases.
<br> <p align="center"> <img src="imgs/pipeline.png" alt="TRACESPipe architecture" height="500" border="0" /> </p> <br>The previous image shows the architecture of TRACESPipe, where the green line stands for the mitochondrial human line. This pipeline has been tested in Illumina HiSeq and NovaSeq platforms. The operating system required to run it is Linux. In windows use cygwin (https://www.cygwin.com/) and make sure that it is included in the installation: cmake, make, zcat, unzip, wget, tr, grep (and any dependencies). If you install the complete cygwin packet then all these will be installed. After, all steps will be the same as in Linux.
The TRACESPipe includes methods for ancient DNA authentication, namely using the quantification of damage (in the tips of the reads) relative to a reference. Other feature is the quantification of y-chromosome presence through compression-based predictors.
Additionally, the TRACESPipe includes read trimming and filtering, PhiX removal, and redundancy controls (at the Database level and for each candidate reference genomes) to improve the consistency and quality of the data.
2. Installation, Structure and Configuration
2.1 Installation
<img src="imgs/conda_logo.png" alt="CONDA" height="14" border="0" /> is needed for installation.<br> To install Conda use the following steps:
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
Additional instructions can be found here:
https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html
To install TRACESPipe, run the following commands in a Linux OS:
git clone https://github.com/viromelab/tracespipe.git
cd tracespipe/src/
chmod +x TRACES*.sh
./TRACESPipe.sh --install
./TRACESPipe.sh --get-all-aux
Development note
Install, Update, Version, and Check scripts, as well as this README have sections which are automatically generated based on the dependencies using:
make all
As user you should not need to run this command, but if you have yq (a CLI YAML parser), you may.
As a developer this should be run whenever there are changes to the system_files/dependencies.yml file,
the generator scripts, or the relevant files. A suggestion is to add this to a pre_commit git hook.
2.2 Structure
In the tracespipe/ folder the following structure exists:
tracespipe/
│
├── meta_data/ # information about the filenames in input_data/ and organ names
│ └── meta_info.txt # see Configuration section for this file.
│
├── input_data/ # where the NGS reads must be placed (and compressed with gzip)
│
├── output_data/ # where the results will appear using the following subfolders:
│ │
│ │
│ ├── TRACES_preprocessed_reads/ # trimmed and adapter removed fastq files
│ ├── TRACES_results/ # where the files regarding the metagenomic
│ │ # analysis, redundancy (complexity) and control will appear
│ ├── TRACES_results/profiles/ # where the redundancy (complexity) profiles appear
│ │
│ ├── TRACES_viral_alignments/ # where viral alignments and index will appear
│ ├── TRACES_viral_consensus/ # where viral consensus (FASTA) will appear
│ ├── TRACES_viral_bed/ # where viral BED files will appear (SNPs and Coverage)
│ ├── TRACES_viral_statistics/ # where viral statistics appear (depth/wide coverage)
│ │
│ ├── TRACES_mtdna_alignments/ # where mtdna alignments and index will appear
│ ├── TRACES_mtdna_consensus/ # where mtdna consensus (FASTA) will appear
│ ├── TRACES_mtdna_bed/ # where mtdna BED files will appear (SNPs and Coverage)
│ ├── TRACES_mtdna_statistics/ # where mtdna statistics appear (depth/wide coverage)
│ ├── TRACES_mtdna_authentication/ # where mtdna species and population authentication appears
│ │
│ ├── TRACES_cy_alignments/ # where cy alignments and index will appear
│ ├── TRACES_cy_consensus/ # where cy consensus (FASTA) will appear
│ ├── TRACES_cy_bed/ # where cy BED files will appear (SNPs and Coverage)
│ ├── TRACES_cy_statistics/ # where cy statistics appear (depth/wide coverage)
│ │
│ ├── TRACES_specific_alignments/ # where specific alignments and index will appear
│ ├── TRACES_specific_consensus/ # where specific consensus (FASTA) will appear
│ ├── TRACES_specific_bed/ # where specific BED files will appear
│ ├── TRACES_specific_statistics/ # where specific statistics appear (depth/wide coverage)
│ │
│ ├── TRACES_mtdna_damage_<ORGAN>/ # where the mtdna damage estimation files will appear
│ │
│ ├── TRACES_denovo_<ORGAN>/ # where the output of de-novo assembly appears
│ │
│ ├── TRACES_hybrid_alignments/ # where the hybrid data appears
│ ├── TRACES_hybrid_consensus/ # where the hybrid data appears
│ ├── TRACES_hybrid_bed/ # where the hybrid data appears
│ │
│ ├── TRACES_hybrid_R2_alignments/ # where the second round hybrid data appears
│ ├── TRACES_hybrid_R2_consensus/ # where the second round hybrid data appears
│ ├── TRACES_hybrid_R2_bed/ # where the second round hybrid data appears
│ │
│ ├── TRACES_hybrid_R3_alignments/ # where the third round hybrid data appears
│ ├── TRACES_hybrid_R3_consensus/ # where the third round hybrid data appears
│ ├── TRACES_hybrid_R3_bed/ # where the third round hybrid data appears
│ │
│ ├── TRACES_hybrid_R4_alignments/ # where the fourth round hybrid data appears
│ ├── TRACES_hybrid_R4_consensus/ # where the fourth round hybrid data appears
│ ├── TRACES_hybrid_R4_bed/ # where the fourth round hybrid data appears
│ │
│ ├── TRACES_hybrid_R5_consensus/ # where the automatic choosen hybrid consensus
│ │ # appears (diff will be made using this data)
│ │
│ ├── TRACES_multiorgan_alignments/ # where the multi-organ alignments data appears
│ ├── TRACES_multiorgan_consensus/ # where the multi-organ consensus data appears
│ │
│ ├── TRACES_diff/ # where the dnadiff results appear (identity & SNPs)
│ ├── TRACES_specific_diff/ # where the dnadiff results appear for specific
│ │
│ └── TRACES_blasts/ # where the specific blasted results appears
│
├── to_encrypt_data/ # where the NGS files to encrypt must be before encryption
├── encrypted_data/ # where the encrypted data will appear
├── decrypted_data/ # where the decrypted data will appear
│
├── logs/ # where the logs (stdout, stderr, and system) will appear
│
├── src/ # where the bash code is and where the commands must be call
│
└── imgs/ # images related with the pipeline
2.3 Configuration
To configure TRACESPipe add your <b>FASTQ files gziped</b> at the folder
input_data/
Then, add a file exclusively with name <b>meta_info.txt</b> at the folder
meta_data/
This file needs to specify the organ type (with a single word name) and the filenames for the paired end reads. An example of the content of meta_info.txt is the following:
skin:V1_S44_R1_001.fastq.gz:V1_S44_R2_001.fastq.gz
brain:V2_S29_R1_001.
