ChimeraTE
A pipeline to detect chimeric transcripts derived from genes and transposable elements.
Install / Use
/learn @OliveiraDS-hub/ChimeraTEREADME
The pipeline
ChimeraTE is a pipeline to detect chimeric transcripts derived from genes and transposable elements (TEs). It has two running Modes:
-
Mode 1 chimeric transcripts detection based upon exons and TE copies positions in the genome sequence;
-
Mode 2 chimeric transcripts detection regardless the genomic position, allowing the detection of chimeras from TEs that are not present in the referece genome, but with less sensitivity.
Install <a name="installation"></a>
Conda <a name="conda"></a>
The installation may be easily done with conda. If you don't have conda installed in your machine, please follow this tutorial.
Once you have installed conda, you need to enable Bioconda channel with:
conda config --add channels bioconda
conda config --add channels conda-forge
conda config --set channel_priority strict
Then, all dependencies to run ChimeraTE can be easily installed in a new conda environment by using the chimeraTE.yml file:
Download repository from github:<br />git clone https://github.com/OliveiraDS-hub/ChimeraTE.git
Change to the ChimeraTE's folder:<br />cd ChimeraTE
Create chimeraTE environment with all dependencies:<br />conda env create -f chimeraTE.yml
Activate the new environment:<br />conda activate chimeraTE
Note: We advise you to return your condarc config to the default with:
conda config --remove channels bioconda
conda config --remove channels conda-forge
conda config --set channel_priority false
Singularity <a name="singularity"></a>
Alternatively to conda, you can use singularity v3.10.0+ to build a container with all dependencies for ChimeraTE.
If you don't have sudo permissions:
singularity build --fakeroot chimeraTE.simg singularity.def
If you have sudo:
sudo singularity build chimeraTE.simg singularity.def
Then, to run ChimeraTE:
singularity exec chimeraTE.simg python3 chimTE_mode1.py --help
<br />singularity exec chimeraTE.simg python3 chimTE_mode2.py --help<br />
Requirements <a name="requirements"></a>
If you don't have conda or singularity, you can install all dependecies as an old school bioinformatician. It's important to highlight that all of them must be installed in your path.
-
Python dependencies
-
Softwares
Required data <a name="req_data"></a>
In order to run ChimeraTE, the following files are required according to the running Mode:
| Data | Mode 1 | Mode 2 | Mode 2 --assembly | | -------- | -------- | -------- | -------- | | Stranded paired-end RNA-seq - Fastq files | X | X | X | | Assembled genome - Fasta file with chromosomes/scaffolds/contigs sequences | X | | | | Gene annotation - GTF file with gene annotations (UTRs,exons,CDS) | X | | | | TE annotation - GTF file with TE insertions | X | | | | Reference transcripts - Fasta file with reference transcripts | | X | X | | Reference TEs - Fasta with ref. TE insertions | | X | | | Dfam taxonomy OR fasta with ref. TE consensuses | | | X |
ChimeraTE genome-guided - Mode1 <a name="mode1"></a>
In the Mode 1, chimeric transcripts will be detected considering the genomic location of TE insertions and exons. Chimeras from this Mode can be classified as TE-initiated TE-exonized, and TE-terminated transcripts. Mode 1 does not detect chimeric transcripts derived from TE insertions absent from the reference genome that is provided.
cd ChimeraTE/
python3 chimTE_mode1.py --help
ChimeraTE Mode 1: The genome-guided approach to detect chimeric transcripts with RNA-seq data.
Required arguments:
--genome Genome in fasta
--input Paired-end files and their respective group/replicate
--project Directory name with output data
--te GTF file containing TE information
--gene GTF file containing gene information
--strand Define the strandness direction of the RNA-seq. Two options:
"rf-stranded" OR "fwd-stranded"
Optional arguments:
--chimera Identify specific type of chimera: "TE-initiated" OR "TE-
exonized" OR "TE-terminated"
--window Upstream and downstream window size (default = 3000)
--replicate Minimum recurrency of chimeric transcripts between RNA-seq
replicates (default 2)
--coverage Minimum coverage (mean between replicates default 2 for
chimeric transcripts detection)
--fpkm Minimum fpkm to consider a gene as expressed (default 1)
--threads Number of threads (default 6)
--overlap Minimum overlap between chimeric reads and TE insertions (default 0.50)
--index Absolute path to pre-existing STAR index
Prepare your data for Mode 1! <a name="prep_data"></a>
Input table
The input tab-delimited table provided with --input must have a specific format:
First column: Mate 1 from the paired-end data
Second column: Mate 2 from the paired-end data
Third column: Replicate/group name
| mate1 | mate2 | rep | | -------- | -------- | -------- | | /home/user/ChimeraTE/mate1_control1.fastq.gz | /home/user/ChimeraTE/mate2_control1.fastq.gz | rep1 | | /home/user/ChimeraTE/mate1_control2.fastq.gz | /home/user/ChimeraTE/mate2_control2.fastq.gz | rep2 | | /home/user/ChimeraTE/mate1_control3.fastq.gz | /home/user/ChimeraTE/mate2_control3.fastq.gz | rep3 |
The header must be absent, as it follows in the example --input table at example_data/mode1/input_example.tsv
GTF for TEs
Usually, the coordinates for TE insertions is given as the .out file from RepeatMasker in many databases. If you already have a .out file from RepeatMasker, you can convert it to .gtf on Linux with:
tail -n +4 RMfile.out | egrep -v 'Satellite|Simple_repeat|rRNA|Low_complexity|RNA|ARTEFACT' | awk -v OFS='\t' '{Sense=$9;sub(/C/,"-",Sense);$9=Sense;print $5,"RepeatMasker","similarity",$6,$7,$2,$9,".",$10}' > RMfile.gtf
If you don't have the .out file for your genome assembly, check it out the util section.
Example Data Mode 1 <a name="example_m1"></a>
After installation, you can run ChimeraTE with the example data from the sampled RNA-seq from D. melanogaster used in our paper.
#Do not forget to activate your conda environment:
conda activate chimeraTE
#One-line
python3 chimTE_mode1.py --genome example_data/mode1/dmel_genome_sample.fa --input example_data/mode1/input_mode1.tsv --project example_mode1 --te example_data/mode1/dmel_TEs_sample.gtf --gene example_data/mode1/dmel_genes_sample.gtf --strand rf-stranded
#Multi-line
python3 chimTE_mode1.py --genome example_data/mode1/dmel_genome_sample.fa \
--input example_data/mode1/input_mode1.tsv \
--project example_mode1 \
--te example_data/mode1/dmel_5TEs_sample.gtf \
--gene example_data/mode1/dmel_5genes_sample.gtf \
--strand rf-stranded
If you have more than 6 threads available on your machine, you can use --threads to speed up the process.
Output Mode 1 <a name="output_m1"></a>
The output files can be found at ChimeraTE/projects/$your_project_name. For instance, for the example data, you can find the output at ChimeraTE/projects/example_mode1. Inside this directory, you might found 3 tables:
- TE-initiated_final.ct
- TE-exonized_final.ct
- TE-terminated_final.ct
These tables contain the chimeric transcripts list with the location of genes and TE insertions generating chimeras, as well as their corresponding coverage of chimeric reads (support). At the 7th column of TE-exonized_final.ct, you can find the position of the TE within the gene region (Embedded, Intronic, or Overlapped). As it follows in the example below:
=========================> TE-initiated_final.ct <=========================
| gene_id | gene_strand | gene_p
