The pipeline

ChimeraTE is a pipeline to detect chimeric transcripts derived from genes and transposable elements (TEs). It has two running Modes:

Mode 1 chimeric transcripts detection based upon exons and TE copies positions in the genome sequence;
Mode 2 chimeric transcripts detection regardless the genomic position, allowing the detection of chimeras from TEs that are not present in the referece genome, but with less sensitivity.

Install
Required data
ChimeraTE Mode 1
ChimeraTE Mode 2

Install <a name="installation"></a>

Conda <a name="conda"></a>

The installation may be easily done with conda. If you don't have conda installed in your machine, please follow this tutorial.

Once you have installed conda, you need to enable Bioconda channel with:

conda config --add channels bioconda
conda config --add channels conda-forge
conda config --set channel_priority strict

Then, all dependencies to run ChimeraTE can be easily installed in a new conda environment by using the chimeraTE.yml file:

Download repository from github: git clone https://github.com/OliveiraDS-hub/ChimeraTE.git

Change to the ChimeraTE's folder: cd ChimeraTE

Create chimeraTE environment with all dependencies: conda env create -f chimeraTE.yml

Activate the new environment: conda activate chimeraTE

Note: We advise you to return your condarc config to the default with:

conda config --remove channels bioconda
conda config --remove channels conda-forge
conda config --set channel_priority false

Singularity <a name="singularity"></a>

Alternatively to conda, you can use singularity v3.10.0+ to build a container with all dependencies for ChimeraTE.

If you don't have sudo permissions:

singularity build --fakeroot chimeraTE.simg singularity.def

If you have sudo:

sudo singularity build chimeraTE.simg singularity.def

Then, to run ChimeraTE:

singularity exec chimeraTE.simg python3 chimTE_mode1.py --help singularity exec chimeraTE.simg python3 chimTE_mode2.py --help

Requirements <a name="requirements"></a>

If you don't have conda or singularity, you can install all dependecies as an old school bioinformatician. It's important to highlight that all of them must be installed in your path.

Python dependencies
Softwares

Required data <a name="req_data"></a>

In order to run ChimeraTE, the following files are required according to the running Mode:

| Data | Mode 1 | Mode 2 | Mode 2 --assembly | | -------- | -------- | -------- | -------- | | Stranded paired-end RNA-seq - Fastq files | X | X | X | | Assembled genome - Fasta file with chromosomes/scaffolds/contigs sequences | X | | | | Gene annotation - GTF file with gene annotations (UTRs,exons,CDS) | X | | | | TE annotation - GTF file with TE insertions | X | | | | Reference transcripts - Fasta file with reference transcripts | | X | X | | Reference TEs - Fasta with ref. TE insertions | | X | | | Dfam taxonomy OR fasta with ref. TE consensuses | | | X |

ChimeraTE genome-guided - Mode1 <a name="mode1"></a>

In the Mode 1, chimeric transcripts will be detected considering the genomic location of TE insertions and exons. Chimeras from this Mode can be classified as TE-initiated TE-exonized, and TE-terminated transcripts. Mode 1 does not detect chimeric transcripts derived from TE insertions absent from the reference genome that is provided.

cd ChimeraTE/
python3 chimTE_mode1.py --help

ChimeraTE Mode 1: The genome-guided approach to detect chimeric transcripts with RNA-seq data.

Required arguments:
  --genome      Genome in fasta
  --input       Paired-end files and their respective group/replicate
  --project     Directory name with output data
  --te          GTF file containing TE information
  --gene        GTF file containing gene information
  --strand      Define the strandness direction of the RNA-seq. Two options:
                "rf-stranded" OR "fwd-stranded"

Optional arguments:
  --chimera     Identify specific type of chimera: "TE-initiated" OR "TE-
                exonized" OR "TE-terminated"
  --window      Upstream and downstream window size (default = 3000)
  --replicate   Minimum recurrency of chimeric transcripts between RNA-seq
                replicates (default 2)
  --coverage    Minimum coverage (mean between replicates default 2 for
                chimeric transcripts detection)
  --fpkm        Minimum fpkm to consider a gene as expressed (default 1)
  --threads     Number of threads (default 6)
  --overlap     Minimum overlap between chimeric reads and TE insertions (default 0.50)
  --index       Absolute path to pre-existing STAR index

Prepare your data for Mode 1! <a name="prep_data"></a>

Input table

The input tab-delimited table provided with --input must have a specific format: First column: Mate 1 from the paired-end data Second column: Mate 2 from the paired-end data Third column: Replicate/group name

| mate1 | mate2 | rep | | -------- | -------- | -------- | | /home/user/ChimeraTE/mate1_control1.fastq.gz | /home/user/ChimeraTE/mate2_control1.fastq.gz | rep1 | | /home/user/ChimeraTE/mate1_control2.fastq.gz | /home/user/ChimeraTE/mate2_control2.fastq.gz | rep2 | | /home/user/ChimeraTE/mate1_control3.fastq.gz | /home/user/ChimeraTE/mate2_control3.fastq.gz | rep3 |

The header must be absent, as it follows in the example --input table at example_data/mode1/input_example.tsv

GTF for TEs

Usually, the coordinates for TE insertions is given as the .out file from RepeatMasker in many databases. If you already have a .out file from RepeatMasker, you can convert it to .gtf on Linux with:

tail -n +4 RMfile.out | egrep -v 'Satellite|Simple_repeat|rRNA|Low_complexity|RNA|ARTEFACT' | awk -v OFS='\t' '{Sense=$9;sub(/C/,"-",Sense);$9=Sense;print $5,"RepeatMasker","similarity",$6,$7,$2,$9,".",$10}' > RMfile.gtf

If you don't have the .out file for your genome assembly, check it out the util section.

Example Data Mode 1 <a name="example_m1"></a>

After installation, you can run ChimeraTE with the example data from the sampled RNA-seq from D. melanogaster used in our paper.

#Do not forget to activate your conda environment:
conda activate chimeraTE

#One-line
python3 chimTE_mode1.py --genome example_data/mode1/dmel_genome_sample.fa --input example_data/mode1/input_mode1.tsv --project example_mode1 --te example_data/mode1/dmel_TEs_sample.gtf --gene example_data/mode1/dmel_genes_sample.gtf --strand rf-stranded

#Multi-line
python3 chimTE_mode1.py --genome example_data/mode1/dmel_genome_sample.fa \
--input example_data/mode1/input_mode1.tsv \
--project example_mode1 \
--te example_data/mode1/dmel_5TEs_sample.gtf \
--gene example_data/mode1/dmel_5genes_sample.gtf \
--strand rf-stranded

If you have more than 6 threads available on your machine, you can use --threads to speed up the process.

Output Mode 1 <a name="output_m1"></a>

The output files can be found at ChimeraTE/projects/$your_project_name. For instance, for the example data, you can find the output at ChimeraTE/projects/example_mode1. Inside this directory, you might found 3 tables:

TE-initiated_final.ct
TE-exonized_final.ct
TE-terminated_final.ct

These tables contain the chimeric transcripts list with the location of genes and TE insertions generating chimeras, as well as their corresponding coverage of chimeric reads (support). At the 7th column of TE-exonized_final.ct, you can find the position of the TE within the gene region (Embedded, Intronic, or Overlapped). As it follows in the example below:

=========================> TE-initiated_final.ct <=========================

| gene_id | gene_strand | gene_p

ChimeraTE

Install / Use

README