ACFS

Accurate CircRNA Finder Suite. Discovering circRNAs from RNA-Seq data.

Overview

CircRNAs are generated through splicing, or to be precise, back-splicing where the downstream splice donor attacks an upstream splice acceptor. Identifying the exact site of back-splice lies in the heart of circRNA discovery.

ACFS first examines and pinpoints the back-splice site from RNA-Seq alignment using a maximal entropy model. The expression of circRNAs is estimated from a second round of alignment to the inferred pseudo circular sequences.

No prior knowledge of gene annotation is needed for circRNA prediection, but reading the coordinates is far less interesting than reading gene names, so circRNAs are annotated using the gene annotation if provided. Also, given annotation, circRNA sequences can be better estimated (especially those contains short exons) and enhance the accuracy of circRNA expression quantification.

ACFS is designed for Single-end RNA-Seq reads. Seeing is believing, we would like to see read directly spanning the back-splice sites. Paired-end data is also supported, albeit with lower sensitivity (if neither of the read ends crosses the back-splice, but the read-pair does).

Change Log

Update on 2016-05-06 : Added extension for fining trans-splicing evidence, which can be used to identify fusion-circRNAs
Update on 2016-01-27 : Performance improvement and add scripts for simulation
Update on 2015-09-17 : Added a small real-world example
Update on 2015-08-20 : Added support for paired-end reads
Update on 2015-08-11 : Corrected the Tutorial section in README, thanks to Zol
Update on 2015-03-09 : Now ACF can include pre-defined circRNA annotations from a bed6 or bed12 file (and their authenticity will be checked, so please adjust (minJump, maxJump, minSplicingScore) accordingly ). This way, you can both predict novel circRNAs in your data and estimate the abundance of annotated circRNAs.
First release on 2013-11-01

Installation

Simply unpack the ACFS package.

Requirement

bwa-0.7.3a (included in the package, but you need to do "make")
perl
blat (not necessary, sometime it helps to rule out false positive fusion-circRNAs when there is a gene-duplication or gene-pseudogene dilemma)

Pipeline scheme

Map all Tophat2-unmapped-reads to genome using BWA, seperate :
- 1-part
- 2-part-same-chromosome-same-strand (contain circRNAs)
- 2-part-same-chromosome-diff-strand (possibly PolII backwalk)
- >2-part-same-chromosome (contain circRNAs, if they are originated from short exons)
- >=2-part-diff-chromosome (contain trans-splicing, or even fusion-circRNAs)
Estimate splicing strength using Christoph Berge's method for "2-part-same-chromosome-same-strand". report:
- forward-splice (not interesting)
- back-splice and canonical splice-motif {[GT-AT],[GC-AG],[AT-AC]}, calculate strength using MaxEnt
- back-splice and non-canonical splice-motif, find the maximal score using MaxEnt and the corresponding back-splice site
- try to rescue circles using ">2-part-same-chromosome"
Check if both splice sites of the cirRNAs are on known exon-borders if an annotation is provided; otherwise all are CBR. report:
- both known exon-border (termed : MEA . M atch with E xisting A nnotation, which is somewhat more reliable)
- at least one splice site sits on unknown exon-border (termed : CBR . C hris B erge R escued)
Build gtf and pseudo-transcript for results from step3
Map all unmapped-reads to pseudo-transcipts and estimate the abundance of circRNAs
Generate bed track for visulization
Optional search of trans-splicing and fusion-circRNA events

Before running ACFS, a few pre-process

Map the RNA-Seq reads to genome and transcriptom, and extract the unmapped reads. This is recommended as those mapped reads will NOT span the back-splice sites, and therefore do NOT contribute to circRNA discovery.
Change fasta/fastq header format to allow processing multiple samples in one run. This is IMPORTANT ! ACFS expects a special header format so that multiple samples can be processed in one run. Do change the default header such as ">HWUSI-EAS100R:6:73:941:1973" into ">Truseq_sample1_HWUSI-EAS100R:6:73:941:1973", where the "sample1" is the name of your choice describing the sample. Do the conversion as:
```
perl change_fastq_header.pl SRR650317_1.fasta SRR650317_1.fa Truseq_SRR650317left
perl change_fastq_header.pl SRR650317_2.fasta SRR650317_2.fa Truseq_SRR650317right
```
Make sure there is No underline within the sample name. e.g. ">Truseq_ctrl.1_Default_header" and ">Truseq_AGO2KO.1_Default_header" are OK; ">Truseq_ctrl_1_Default_header" is BAD because ACF will register "ctrl" as the sample name instead of "ctrl_1".
Merge sequences from multiple fasta/fastq files into one fasta file, which saves time for mapping.
```
perl Truseq_merge_unique_fa.pl UNMAP SRR650317_1.fa SRR650317_2.fa
```
Alternatively, if there are many files to merge, generate a file (named filelist for example) contains the full-path of each file
```
perl Truseq_merge_unique_filelist.pl <UNMAP> <filelist>
```
UNMAP is the collasped fasta file which will be processed by ACFS, and UNMAP_expr contains the readcount per sequence in all the samples. If you change the name of UNMAP to SomethingElse, then the readcount file will be automatically named as SomethingElse_expr

However, one can bypass the previous and this step to run ACFS sample by sample. This way, no fasta header reformatting and reads collapsing is needed. For each sample, set the value of UNMAP to the name of fasta/fastq in the SPEC_example.txt file, and set the value of UNMAP_expr to "no".
Build BWA index, using verion 0.73a (currently not support for other versions as the output format changes between versions)
```
/bin/bwa073a/bwa index /data/iGenome/human/Ensembl/GRCh37/Sequence/BWAIndex/genome.fa
```
Prepare for annotation (recommended) Download the gtf file from iGenome package or download ensembl gtf here : ftp://ftp.ensembl.org/pub/current_gtf/ Then run:
```
perl get_split_exon_border_biotype_genename.pl </data/iGenome/human/Ensembl/GRCh37/Annotation/Genes/genes.gtf> </data/iGenome/human/Ensembl/GRCh37/Annotation/Genes/Homo_sapiens.GRCh37.71_split_exon.gtf>
```
The first argument is the input gtf file, the second argument is the output
Strandedness is assumed as from the Truseq Stranded RNA-Seq, so the reads are reverse-complementary to mRNAs.
- If the reads are actually sense (the same 5'->3' direction as mRNA), please reverse-complement all reads.
- If the reads are actually stransless or pair-ended, please run in parallel original reads and reverse-complemented reads.
- No pair-end information is used, as the exact junction-site must be supported by a single read. Seeing is believing.
For Paired-end data, it is highly recommended to align the reads to genome+transcriptome first (e.g. using Tophat2), and extract the unmapped read (read pairs) using the following script (the fasta header line is also modified)
```
perl convert_unmapped_SAM_to_fa_for_acfs.pl <output_file_name> <unmapped.sam> sample_id
```

Parameters

There are nine mandatory parameters to run ACFS in a basic mode. Searching for fusion-circRNAs is disabled by default. Please modify the config file "SPEC_example.txt" according to your specific organism, experimental design and sequence specs. The config file "SPEC_example.txt" is a two-column tab-delimited file : <name>\t<value>

Mandatory paramters:

| Parameter | value | Note | | --------- | ----- | ---- | | BWA_folder | /home/bin/bwa037a/ | path of the folder of bwa | | BWA_genome_Index | /data/.../BWAIndex/genome.fa | full path to the index files | | BWA_genome_folder | /data/.../Chromosomes/ | full path to the folder containing separeted chromosome files | | ACF_folder | /home/bin/ACFS/ | path of the folder of ACFS | | CBR_folder | /home/bin/ACFS/CB_splice/ | path of the folder for MaxEnt | | Agtf | /data/.../Homo_sapiens.GRCh37.71_split_exon.gtf | processed annotation, see the previous section point-4 | | UNMAP | UNMAP | the collapsed fasta file | | UNMAP_expr | UNMAP_expr | the expression of the collasped reads | | Seq_len | 150 | length of sequencing reads |

Optional parameters, the values in below are set as default:

| Parameter | value | Note | | --------- | ----- | ---- | | Thread | 16 | number of threads used in bwa | | BWA_seed_length | 16 | bwa seed length | | BWA_min_score | 20 | bwa min score to trigger report. For shorter reads, e.g. 50nt, set to 10 or lower could report more circRNAs at risk of higher FDR | | minJump | 100 | the minimum distance of a back-splice. The smaller, the more likely you can find circles from short exons | | maxJump | 2500000 | the maximum distance of a back-splice. The larger, the more likely you can find circles from long genes. The longest human gene is CNTNAP2 which spans 2.3M bp | | minSplicingScore | 10 | the minimum score for the sum of splicing strength at both splice site, 10 corresponds to 95% of all human/mouse splice site pairs. One could also set it to a lower value, e.g. zero, and do a post-filtering after running acfs | | minSampleCnt | 1 | the minimum number of samples that detect any given circle | | minReadCnt | 1 | the minimum number of reads (from all samples) that detect any given circle | | minMappingQuality | 20 | the minimum mapping quality of any given sequence | | minSpanJunc | 6 | the minimum number of bases reach beyond the back-splice-site. The larger the less likely of false-positive | | Coverage | 0.9 | the minimum percentage of any given read is ali

Acfs

Install / Use

README