Circompara2
Improved bioinformatic pipeline to identify and quantify circRNA expression from RNA-seq data by combining multiple circRNA detection methods
Install / Use
/learn @egaffo/Circompara2README
Title: CirComPara2
Subtitle: CircRNA detection from RNA-seq data using multiple methods
Project: CirComPara2
Author: Enrico Gaffo
Affiliation: Compgen - University of Padova
Web: http://compgen.bio.unipd.it
Date: January 20, 2021
output:
html_document:
toc: yes
number_sections: no
Circompara2
CirComPara2 is a computational pipeline to detect, quantify, and correlate expression of linear and circular RNAs from RNA-seq data that combines multiple circRNA-detection methods.
<!--TODO: more exhaustive description -->Quick install
Execute the following commands to download and install (locally) in your system the scripts and tools required to run circompara2. If something goes wrong with the installation process try to manually install each software listed below.
Required software before installation
You'll need some libraries and software installed in your system before starting the circompara2 installation. In a fresh Ubuntu 20.04 (Focal) you need to install the following packages by running:
sudo apt install git python2.7 wget unzip pkg-config default-jre r-base-core libcurl4-openssl-dev libxml2-dev libssl-dev curl pigz python-is-python2 python-dev-is-python2
Virtual environment
Because not all software integrated in circompara2 runs on Python3, circompara2 still uses python2.7. If you system default is Python3, then you might want to consider installing and running circompara2 under a virtual environment, such as one generated with virtualenv:
virtualenv -p /usr/bin/python2.7 p2.7venv
## activate the virtual environment
source p2.7venv/bin/activate
Now you can proceed with the installation (or lanch circompara2 if you have already installed it).
Installation commands
Download and extract the latest release of CirComPara, or clone the GIT repository, enter circompara2 directory and run the automatic installer script:
git clone http://github.com/egaffo/circompara2
cd circompara2
./src/utils/bash/install_circompara
## make a link to the circompara2 main script into the main directory
ln -s src/utils/bash/circompara circompara2
Test your installation
cd test_circompara/analysis
../../circompara2
If you plan to use single-end reads, test with:
cd test_circompara/analysis_se
../../circompara2
Add circompara2 to your environment
Once completed the installation, if you do not want to type the whole path to the circompara2 executable each time, you can update your PATH environment variable. From the terminal type the following command (replace the /path/to/circompara2/install/dir string with circompara2's actual path)
export PATH=/path/to/circompara2/install/dir:$PATH
Another way is to link circompara2's main script in your local bin directory
cd /home/user/bin
ln -s /path/to/circompara2/install/dir/circompara2
Alternative installation: the circompara2 Docker image
A Docker image of CirComPara2 is available from DockerHub in case you are struggling with the installation. The Docker image saves you from the installation burden, just pull the image:
docker pull egaffo/circompara2:v0.1.2.1
How to use
Set your analysis project
This section shows how to set your project directory and run the analysis. To run an analysis usually you want to specify your data (the sequenced reads in FASTQ format) and a reference genome in FASTA format.
Compose META file
You have to specify read files and sample names in a metadata table file. The file format is a comma separated text file with the following header line:
file,sample
Then, each row corresponds to a read file. If you have paired-end sequenced samples write one line per file with the same sample name.
An example of the metadata table:
| file | sample | |------------------------|--------| | /path/to/reads_S1_1.fq | S1 | | /path/to/reads_S1_2.fq | S1 | | /path/to/reads_S2_1.fq | S2 | | /path/to/reads_S2_1.fq | S2 |
and metadata file content:
file,sample
/path/to/reads_S1_1.fq,S1
/path/to/reads_S1_2.fq,S1
/path/to/reads_S2_1.fq,S2
/path/to/reads_S2_1.fq,S2
In the meta file you can also specify the adapter sequences to preprocess the reads, just add an adapter column with the adpter file.
| file | sample | adapter | |------------------------|--------|---------------------| | /path/to/reads_S1_1.fq | S1 | /path/to/adapter.fa | | /path/to/reads_S1_2.fq | S1 | /path/to/adapter.fa |
Specify the reference genome file
A required parameter is the reference genome. You can either pass the reference genome from the command line
./circompara2 "GENOME_FASTA='/home/user/genomes/Homo_sapiens.GRCh38.dna.primary_assembly.fa'"
or by setting the GENOME_FASTA parameter in the vars.py file; e.g.:
GENOME_FASTA = '/home/user/genomes/Homo_sapiens.GRCh38.dna.primary_assembly.fa'
Specify options in vars.py
Although parameters can be set from command line (sorrounded by quotes), you can set them in the vars.py file, which must be placed into the directory where circompara2 is called.
Below there is the full list of the parameters.
Parameters
META: The metadata table file where you specify the project samples, etc.
default: meta.csv
ANNOTATION: Gene annotation file (like Ensembl GTF/GFF)
default:
GENOME_FASTA: The FASTA file with the reference genome
default:
CIRCRNA_METHODS: Comma separated list of circRNA detection methods to use. Repeated values will be collapsed into unique values. Currently supported: ciri, dcc, circrna_finder, find_circ, circexplorer2_star, circexplorer2_bwa, circexplorer2_tophat, circexplorer2_segemehl, testrealign (a.k.a. Segemehl). Set an empty string to use all methods available (including deprecated methods).
default: ciri,find_circ,circexplorer2_star,circexplorer2_bwa,circexplorer2_segemehl,circexplorer2_tophat,dcc
CPUS: Set number of CPUs
default: 1
GENEPRED: The genome annotation in GenePred format
default:
GENOME_INDEX: The index of the reference genome for HISAT2
default:
SEGEMEHL_INDEX: The .idx index for segemehl
default:
BWA_INDEX: The index of the reference genome for BWA
default:
BOWTIE2_INDEX: The index of the reference genome for BOWTIE2
default:
STAR_INDEX: The directory path where to find Star genome index
default:
BOWTIE_INDEX: The index of the reference genome for BOWTIE when using CIRCexplorer2_tophat
default:
HISAT2_EXTRA_PARAMS: Extra parameters to add to the HISAT2 aligner fixed parameters '--dta --dta-cufflinks --rg-id <SAMPLE> --no-discordant --no-mixed --no-overlap'. For instance, '--rna-strandness FR' if stranded reads are used.
default: --seed 123
BWA_PARAMS: Extra parameters for BWA
default: -T 19
SEGEMEHL_PARAMS: SEGEMEHL extra parameters
default: -D 0
TOPHAT_PARAMS: Extra parameters to pass to TopHat
default:
STAR_PARAMS: Extra parameters to pass to STAR
default: --runRNGseed 123 --outSJfilterOverhangMin 15 15 15 15 --alignSJoverhangMin 15 --alignSJDBoverhangMin 15 --seedSearchStartLmax 30 --outFilterScoreMin 1 --outFilterMatchNmin 1 --outFilterMismatchNmax 2 --chimSegmentMin 15 --chimScoreMin 15 --chimScoreSeparation 10 --chimJunctionOverhangMin 15
BOWTIE2_PARAMS: Extra parameters to pass to Bowtie2 in addition to -p $CPUS --reorder --score-min=C,-15,0 -q
default: --seed 123
STRINGTIE_PARAMS: Stringtie extra parameters. F.i. '--rf' assumes a stranded library fr-firststrand, to be used if dUTPs stranded library were sequenced
default:
CIRI_EXTRA_PARAMS: CIRI additional parameters
default:
DCC_EXTRA_PARAMS: DCC additional parameters
default: -fg -M -F -Nr 1 1 -N
CE2_PARAMS: Parameters to pass to CIRCexplorer2 annotate
default:
TESTREALIGN_PARAMS: Segemehl/testrealign filtering parameters-q indicates the minimum median quality of backsplices ends (like the Haarz parameter)
default: -q median_1
FINDCIRC_EXTRA_PARAMS: Parameters for find_circ.py. Additional parameters: --best-qual INT is used to filter find_circ results according to best_qual_left and best_qual_right fields >= INT. Default: INT = 40. --filter-tags TAG is used to filter lines of find_circ.py output (sites.bed). Repeat it if multiple consecutive filter tags has to be applied.
default: --best-qual 40 --filter-tags UNAMBIGUOUS_BP --filter-tags ANCHOR_UNIQUE
CFINDER_EXTRA_PARAMS: Parameters for CircRNA_finder
default:
PREPROCESSOR: The read preprocessing tool to use. Currently, only "trimmomatic" is supported.Leave empty for no read preprocessing.
default:
PREPROCESSOR_PARAMS: Read preprocessor extra parameters. F.i. if Trimmomatic, an empty string defaults to MAXINFO:40:0.5 LEADING:20 TRAILING:20 SLIDINGWINDOW:4:30 MINLEN:50 AVGQUAL:30
default:
LINEAR_EXPRESSION_METHODS: The method to be used for the linear expression estimates/transcriptome reconstruction. To run more methods use a comma separated list. However, only the first method in the list will be used in downstream processing. Currently supported methods: stringtie,cufflinks,htseq.
default: stringtie
TOGGLE_TRANSCRIPTOME_RECONSTRUCTION: Set True to enable transcriptome reconstruction. Default only quantifies genes and transcripts from the given annotation GTF file
default: False
READSTAT_METHODS: Comma separated list of methods to use for read statistics. Currently supported: fastqc
default: fastqc
