SkillAgentSearch skills...

Psirc

Full-length linear and circular transcript isoform reconstruction and quantification

Install / Use

/learn @Christina-hshi/Psirc
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

psirc

Psirc (<ins>ps</ins>eudo-alignment identification of c<ins>irc</ins>ular RNAs) is for back-splicing junction detection, full-length linear and circular transcript isoform inference and quantification from RNA-seq data.

The whole psirc pipeline has two main parts: 1. Detecting <ins>b</ins>ack-<ins>s</ins>plicing <ins>j</ins>unctions (BSJs) and inferring <ins>f</ins>ull-<ins>l</ins>ength <ins>i</ins>soforms (FLIs); and 2. Quantification of FLIs (both linear and circular FLIs at the same time). Each part works well stand-alone, but it is recommended to use them together.

<p align="center" width="100%"> <img src="./figs/psirc_pipeline.png" width="500" alt="psirc pipeline"/> </p>

If you use psirc in your study, please cite:
Ken Hung-On Yu*, Christina Huan Shi*, Bo Wang, Savio Ho-Chit Chow, Grace Tin-Yun Chung, Ke-En Tan, Yat-Yuen Lim, Anna Chi-Man Tsang, Kwok-Wai Lo, Kevin Y. Yip. Quantifying full-length circular RNAs in cancer. Genome Research 31.12 (2021): 2340-2353. Available from: https://genome.cshlp.org/content/31/12/2340.short

Table of Contents

<!-- <img align="right" src="./figs/psirc_BSJ_FLI.png"/> -->

<a name="library"></a>External libraries

  • zlib
  • HDF5 C libraries

<a name="install"></a>Installation

The first part of psirc was implemented with Perl script, which can be run directly. The second part of psirc was implemented with C and C++.

    git clone https://github.com/Christina-hshi/psirc.git
    cd psirc
    cd psirc-quant
    #you may need to compile htslib under "ext/htslib" by following the README there ("make install" is optional)
    mkdir release
    cd release
    cmake ..
    make psirc-quant
    #the psirc-quant program can be found at "src/psirc-quant"
    make install (optional)

<a name="require"></a>Requirements

  • Input: paired-end RNA-seq reads
  • [custom_transcriptome_fa][custom_transcriptome_fa]
  • Forked kallisto<sup>1</sup>: [Linux] or [Mac] executable

<a name="synop_require"></a>Synopsis of requirements

The psirc pipeline has two parts. The first part is the [psirc script (for detecting BSJs and inferring FLIs), currently psirc_v1.0.pl] which is a Perl script that fully automates the production of the BSJ detection and FLI inference outputs from the above requirements. The second part is psirc-quant which quantifies the abundances of FLIs based upon RNA-seq data.

<ins>Input: paired-end RNA-seq reads</ins>
The RNA-seq reads should be sequenced from a library preparation strategy that retains circular RNAs (circRNAs), such as ribosomal RNA (rRNA) depletion or exome-capture RNA-seq. We only accept paired-end reads as single-end reads have inherent read density biases and false positive alignments, making them not recommended for circRNA detection<sup>2</sup>. The paired-end reads need to be in FASTQ format and can be gzipped.

<ins>custom_transcriptome_fa</ins>
A FASTA file that contains all annotated transcript sequences of a reference sequence, with each sequence having a custom header (indicating various positional, name, and strand information). Ready-to-use custom_transcriptome_fas are available for [human] and for [Epstein-barr virus (EBV)] since these were used in our study, although it is very easy to generate one for any well-annotated transcriptome by following the Generation of custom_transcriptome_fa section. The one we generated for human is called "gencode.v29.annotation.custom_transcriptome.fa" and will be used in the General usages section.

<ins>Forked kallisto</ins>
A forked version of kallisto v0.43.1, which was modified to allow multi-threading. This is the last version which outputs a SAM formatted pseudo-alignment to stdout, allowing the processing of the pseudo-alignment as it is being generated simultaneously, and is the reason why this version is used. The forked kallisto executable is available for [Linux] or [Mac], or can be [compiled from the source code]. "kallisto" in the General usages section refers to this version of kallisto.

<a name="gen_usages"></a>General usages

Index the custom_transcriptome_fa (need to be performed once only):

perl psirc_v1.0.pl -i gencode.v29.annotation.custom_transcriptome.fa kallisto

Produce both BSJ detection and FLI inference outputs in a single run (recommended):

perl psirc_v1.0.pl -f -o output_directory gencode.v29.annotation.custom_transcriptome.fa kallisto R1.fastq R2.fastq

Produce BSJ detection output only:

perl psirc_v1.0.pl -o output_directory gencode.v29.annotation.custom_transcriptome.fa kallisto R1.fastq R2.fastq

Produce FLI inference output from the result of BSJ detection output only:

perl psirc_v1.0.pl -s output_directory gencode.v29.annotation.custom_transcriptome.fa kallisto R1.fastq R2.fastq

where gencode.v29.annotation.custom_transcriptome.fa is the custom_transcriptome_fa, kallisto is the forked kallisto, output_directory is the user-specified directory to place the outputs, and R1.fastq R2.fastq are the input paired-end RNA-seq reads.

Index the inferred FLI

We require the header lines of the circular transcripts in fasta format should end with "\tC" to let the program know that they are circular transcripts. And header lines of linear transcripts should not end with "\tC". The outputs produced from psirc_v1.0.pl already meet this requirement, but the outputs produced by other FLI inference tools may not.

psirc-quant index [arguments] <FLI sequences>

Required argument:
  -i, --index=STRING          Filename for the index to be constructed

Optional argument:
  -k, --kmer-size=INT         k-mer (odd) length (default: 31, max value: 31)
      --make-unique           Replace repeated target names with unique names

Quantify FLI:

psirc-quant quant [arguments] R1.fastq R2.fastq

Required arguments:
  -i, --index=STRING            Filename for the index to be used for
                                quantification
  -o, --output-dir=STRING       Directory to write output to

Optional arguments:
      --bias                    Perform sequence based bias correction
  -b, --bootstrap-samples=INT   Number of bootstrap samples (default: 0)
      --seed=INT                Seed for the bootstrap sampling (default: 42)
      --plaintext               Output plaintext instead of HDF5
      --fusion                  Search for fusions for Pizzly
      --single                  Quantify single-end reads
      --single-overhang         Include reads where unobserved rest of fragment is
                                predicted to lie outside a transcript
      --fr-stranded             Strand specific reads, first read forward
      --rf-stranded             Strand specific reads, first read reverse
  -l, --fragment-length=DOUBLE  Estimated average fragment length
  -s, --sd=DOUBLE               Estimated standard deviation of fragment length
  -x, --min-fragment-length     Minimum length of a valid fragment
  -X, --max-fragment-length     Maximum length of a valid fragment
                                (default: -l, -s values are estimated from paired
                                 end data, but are required when using --single)
  -t, --threads=INT             Number of threads to use (default: 1)
      --pseudobam               Save pseudoalignments to transcriptome to BAM file
      --genomebam               Project pseudoalignments to genome sorted BAM file
  -g, --gtf                     GTF file for transcriptome information
                                (required for --genomebam)
  -c, --chromosomes             Tab separated file with chrosome names and lengths
                                (optional for --genomebam, but recommended)

Please note that the <min-fragment-length> and <max-fragment-length> options are crucial when identifying fragments supporting back-splicing junctions. We suggest you first try <fragment-length> +|- 3*<sd> as the min. and max. fragment length.

<a name="synop_outs"></a>Synopsis of outputs

BSJ detection

<ins>BSJ transcript list</ins> (candidate_circ_junctions.bed)
A list of all the detected BSJ loci and their supporting transcripts (including which exons are back-spliced) in BED format. The first three columns (chr, start, end) are the BSJ loci. The fourth column begins with the BSJ supporting read count, then a ":" separator, followed by the BSJ transcripts of the loci. The read count is the sum of the reads crossing the BSJ and the last-first exons reads of the BSJ transcript. To indicate how many reads are in each category, the fifth column is a value between 0 and 1, crossing BSJ reads / (crossing BSJ reads + last-first exons reads), with 1 indicating all reads are crossing BSJ reads, and 0 indicating all reads are last-first exons reads.

<ins>BSJ transcript sequences</ins> (candidate_circ_junctions.fa)
A FASTA file containing the sequences of each detected BSJ transcript.

<ins>BSJ transcript supporting reads SAM file</ins> (candidate_circ_supporting_reads.sam)
All detected BSJ supporting reads mapped to their BSJ transcripts in SAM format. These are pseudo-alignments directly outputted from kallisto. This information is useful for certain analyses, such as visualizing the supporting reads mapped to the BSJ transcript on a genome browser.

FLI inference

<ins>FLI list</ins> (full_length_isoforms.tsv)
A list of all inferred full-length linear and circular isoforms. For circular isoforms, the information in th

Related Skills

View on GitHub
GitHub Stars11
CategoryDevelopment
Updated2y ago
Forks5

Languages

C

Security Score

75/100

Audited on Jul 27, 2023

No findings