CfDNApipe
cfDNApipe: A comprehensive quality control and analysis pipeline for cell-free DNA high-throughput sequencing data
Install / Use
/learn @XWangLabTHU/CfDNApipeREADME
cfDNApipe
- Introduction
- Section 1: Installation Tutorial
- Section 2: cfDNApipe Highlights
- Section 3: A Quick Tutorial for Analysis WGBS data
- Section 4: Perform Case-Control Analysis for WGBS data
- Section 5: How to Build Customized Pipeline using cfDNApipe
- Section 6: A Basic Quality Control: Fragment Length Distribution
- Section 7: Nucleosome Positioning
- Section 8: Inferring Tissue-Of-Origin based on deconvolution
- Section 9: Additional Function: WGS SNV/InDel Analysis
- Section 10: Additional Function: Virus Detection
- Section 11: Other Functions
- Section 12: How to use cfDNApipe results in Bioconductor/R
- FAQ
Links:
- cfDNApipe documentaion
- codes for pipeline test
- codes for functional test
- demo report
- cfDNA test data (google drive)
Introduction
cfDNApipe(<u>c</u>ell <u>f</u>ree <u>DNA</u> <u>Pipe</u>line) is an integrated pipeline for analyzing cell-free DNA WGBS/WGS data. It contains many cfDNA quality control and statistical algorithms. Also we collected some useful cell free DNA references and provided them here.Users can access the cfDNApipe documentation Here.
The whole pipeline is established based on the processing graph principle. Users can use the preset pipeline for WGBS/WGS data as well as build their own analysis pipeline from any intermediate data like bam files. The main functions are as the following picture.
<center> <img style="border-radius: 0.3125em; box-shadow: 0 2px 4px 0 rgba(34,36,38,.12),0 2px 10px 0 rgba(34,36,38,.08);" src="./pics/pipeline.png"> <br> <div style="color:orange; border-bottom: 1px solid #d9d9d9; display: inline-block; color: #999; padding: 2px;">cfDNApipe Functions</div> </center>Section 1: Installation Tutorial
Section 1.1: System requirement
The popular WGBS/WGS analysis toolkits are released on Unix/Linux system, based on different program languages, like FASTQC and Bowtie2. Therefore, it's very difficult to rewrite all the software in one language. Fortunately, conda/bioconda program collected many prevalent python modules and bioinformatics software, so we can install all the dependencies through conda/bioconda and arrange pipelines using python.
We recommend using conda/Anaconda and create a virtual environment to manage all the dependencies. If you did not install conda before, please follow this tutorial to install conda first.
After installation, you can create a new virtual environment for cfDNA analysis. Virtual environment management means that you can install all the dependencies in this virtual environment and delete them easily by removing this virtual environment.
Section 1.2: Create environment and Install Dependencies
We tested our pipeline using different versions of software and provide an environment yml file for users. Users can download this file and create the environment in one command line.
First, please download the yml file.
wget https://xwanglabthu.github.io/cfDNApipe/environment.yml
Then, run the following command. The environment will be created and all the dependencies as well as the latest cfDNApipe will be installed.
# clean unused packages before installation
conda clean -y --all
# install environment
conda env create -n cfDNApipe -f environment.yml
<font color=red>Note:</font> The environment name can be changed by replacing "-n cfDNApipe" to "-n environment_name".
<font color=red>Note:</font> If errors about <font color=blue>unavailable or invalid channel</font> occur, please check that whether the .condarc file in your ~ directory had been modified. Modifing .condarc file may cause wrong channel error. In this case, just rename/backup your .condarc file. Once the installation finished, this file can be recoveried. Of course, you can delete .condarc file if necessary.
Section 1.3: Activate Environment and Use cfDNApipe
Once the environment is created, the users can enter the environment using the following command.
conda activate cfDNApipe
Now, just open python and process ** cell-free DNA WGBS/WGS paired/single end** data. For more detailed explanation for each function and parameters, please see cfDNApipe documentation.
Section 2: cfDNApipe Highlights
cfDNApipe is a highly integrated cfDNA WGS/WGBS data processing pipeline. We designed many useful build-in mechanisms. Here, we will introduce some important features.
Section 2.1: Dataflow Graph for WGS and WGBS Data Processing
cfDNApipe is organized by a built-in dataflow with a strictly defined up- and down-stream data interface. The following figure shows how WGS and WGBS data are processed.
<br/> <center> <img style="border-radius: 0.3125em; box-shadow: 0 2px 4px 0 rgba(34,36,38,.12),0 2px 10px 0 rgba(34,36,38,.08);" src="./pics/cfDNApipe_flowchart.png"> <br> <div style="color:orange; border-bottom: 1px solid #d9d9d9; display: inline-block; color: #999; padding: 2px;">cfDNApipe Flowchart Overview</div> </center> <br/>For detailed data flow diagrams, please see this cfDNApipe documentaion. In this documentation, we give thorough up- and down-stream relationships for every step.
Section 2.2: Reference Auto Download and Building
For any HTS data analysis, the initial step is to set reference files such as genome sequence and annotation files. cfDNApipe can download references and build reference indexes automatically. If the reference and index files already exist, cfDNApipe will use these files instead of download or rebuilding.
<font color=green>What reference files does cfDNApipe need?</font>
-
For analyzing WGS data (taken hg19 as example) genome sequence file and indexes: hg19.fa, hg19.chrom.sizes, hg19.dict, hg19.fa.fai bowtie2 related files: hg19.1.bt2 ~ hg19.4.bt2, hg19.rev.1.bt2~ hg19.rev.2.bt2 Other reference files: like blacklist file and cytoBand file, we provide them here.
-
For analyzing WGBS data (taken hg19 as an example) genome sequence file and indexes: hg19.fa, hg19.chrom.sizes, hg19.dict, hg19.fa.fai bismark related files: Bisulfite_Genome folder with CT_conversion and GA_conversion Other reference files: like CpG island file and cytoBand file, we provide them here.
Here, we introduced the global reference configure function in cfDNApipe to download and build reference files automatically.
cfDNApipe contains 2 types of global reference configure function, pipeConfigure and pipeConfigure2. Function pipeConfigure is for single group data analysis (without control group). Function pipeConfigure2 is for case and control analysis. Either function will check the reference files, such as bowtie2 and bismark references. If not detected, references will be downloaded and built. This step is <font color=red>necessary</font> and puts things right once and for all.
*<font c
