SkillAgentSearch skills...

CRSSANT

Package for analyzing RNA crosslinking and proximity ligation data (PARIS and beyond)

Install / Use

/learn @zhipenglu/CRSSANT
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

CRSSANT: Cross-linked RNA Secondary Structure Analysis using Network Techniques

RNA crosslinking, proximity ligation and high throughput sequencing produces non-continuous reads that indicate base pairing and higher order interactions, either in RNA secondary structures or intermolecular complexes. CRSSANT (pronounced 'croissant') is a computational pipeline for analyzing non-continuous/gapped reads from a variety of methods that employ the crosslink-ligation principle, including PARIS, LIGR, SPLASH, COMRADES, hiCLIP, etc. CRSSANT optimizes short-read mapping, automates alignment processing, and clusters gap1 and trans alignments into duplex groups (DG) and non-overlapping groups (NG). More complex arrangments are assembled into higher level structures. In particular gapm alignments with 2 gaps or 3 segments are assembled into tri-segment groups (TGs). Overlapping alignments are used to discover homotypic interactions (RNA homodimers).

Briefly, the CRSSANT pipeline operates as follows. First, sequencing reads that have been processed to remove adapters are mapped references with STAR and a new set of optimized options. Second, alignments are filtered, rearranged and classified into different types (gaptypes.py and gapfilter.py). Third, we use network analysis methods to cluster non-continuous alignments into DGs and calculate the confidence for each DG. The DGs are used as the foundation for the assembly of TGs.

CRSSANT is written in Python and available as source code that you can download and run directly on your own machine (no compiling needed). An earlier version of the DG assembly method is available here: (https://github.com/ihwang/CRSSANT). For more about the CRSSANT pipeline, please refer to this study Zhang et al. 2022 Genome Research.

Table of contents

Step 0: Download and prepare environment

Download the scripts and save it to a known path/location. No special installation is needed, but the python package dependencies need to be properly resolved before use. You will need Python version 3.6+ and the following Python packages. We recommend downloading the latest versions of these packages using the Ananconda/Bioconda package manager. Currently, the NetworkX version only works with python 3.6, but not higher versions.

Additional tools for used for mapping and general processing of high throughput sequencing data, including STAR, samtools and bedtools. The STAR should be used with optimized parameters shown below.

For visualization of the results, we recommend IGV, which has features for grouping alignments based on tags, such as DG and NG that we implemented here. IGV can also directly visualize DG summary information and RNA secondary structures, see Step 4: Cluster alignments to groups for details. VARNA is recommended for visualizing RNA secondary structures in a variety of formats, including

System requirements and tests

The programs are generally run in x86-64 compatible processors, including 64 bit Linux or Mac OS X, if there is enough memory. Read mapping against mammalian genomes using STAR requires at least 30G memory. Alignment classification typically requires 100G memory. As a result, these two steps should be run in a cluster with large memory.

Test datasets and example output files are provided for all steps except STAR mapping, which is a well maintained package. Test files are located in the tests folder. Furthermore, we provided source data for all figures in this paper to help readers reproduce the figures and troubleshoot potential problems (sourcedata). The analysis pipeline is preferably run as separate steps to allow maximal control and quality assurance. In addition, shell scripts for a typical pipeline are also provided as examples.

Step 1: Preprocessing fastq input files

Sequencing data can be processed using various published tools to demultiplex samples, and remove barcodes. Since each library preparation uses a different approach, we cannot recommend the same method for all of them. Here is an example of preprocessing PARIS data based on our recently published library preparation protocol.

Step 2: Map reads to the genome

It is assumed that the reads have been demultiplexed and adapters removed. Before mapping the reads, genome indices should be generated with the same STAR version. Reads in the fastq format are mapped to the genome using STAR and a set of optimized parameters as follows. runThreadN and genomeLoad should be adjusted based on available resources and running environment.

STAR --runMode alignReads --genomeDir /path/to/index --readFilesIn /path/to/reads/files --outFileNamePrefix /path/to/output/prefix --runThreadN 1 --genomeLoad NoSharedMemory --outReadsUnmapped Fastx  --outFilterMultimapNmax 10 --outFilterScoreMinOverLread 0 --outFilterMatchNminOverLread 0 --outSAMattributes All --outSAMtype BAM Unsorted SortedByCoordinate --alignIntronMin 1 --scoreGap 0 --scoreGapNoncan 0 --scoreGapGCAG 0 --scoreGapATAC 0 --scoreGenomicLengthLog2scale -1 --chimFilter None --chimOutType WithinBAM HardClip --chimSegmentMin 5 --chimJunctionOverhangMin 5 --chimScoreJunctionNonGTAG 0 -- chimScoreDropMax 80 --chimNonchimScoreDropMin 20

Successful STAR mapping generates the following 7 files: Aligned.out.bam, Aligned.sortedByCoord.out.bam, Log.final.out, Log.out, Log.progress.out, SJ.out.tab, and Unmapped.out.mate1. The bam file is converted back to sam for the next step of processing, keeping the header lines (samtools view -h). Sorting is not necessary for the next alignment classification step.

--Optimized STAR parameters

Here is a brief explanation of the optimized parameters for non-continuous alignments. See the bioRxiv preprint referenced at the top of this README for more detailed discussion of the optimization. The output contains very short segments, down to 7nt, and the unreliable ones are removed later using alignment-span-based penalty and segment connection dependent filtering (gaptypes.py).

  • --outFilterScoreMinOverLread 0 and --outFilterMatchNminOverLread 0 allows mapping short segments
  • --outSAMattributes All includes chimeric tags needed for alignment processing
  • --outSAMtype BAM Unsorted SortedByCoordinate simplifies subsequent SAM processing
  • --alignIntronMin 1 shifts deletions (D) to gaps (N) to equalize penalty.
  • --scoreGap* 0 removes all gap open penalty.
  • --scoreGenomicLengthLog2scale -1 increases alignment span-based penalty
  • --chimFilter None enables detection of chimeric alignments (primarily homotypic) near the 5' and 3' ends of references
  • --chimOutType WithinBAM HardClip output alignments in one file and removes hardclips.
  • --chimSegmentMin 5 and --chimJunctionOverhangMin 5 map chimera more permissively
  • --chimScoreJunctionNonGTAG 0 removes penalty for splicing junctions in chimera
  • -- chimScoreDropMax 80, a higher value, and --chimNonchimScoreDropMin 20 ensures that chimera are not produced when normal gapped alignments are possible.

--Other relevant parameters

In addition to the parameters listed above, the following ones need to be adjusted according to the data. Larger datasets may produce more gapped alignments (similar to splice junctions) and therefore requires higher limits. Even higher numbers can be tried if the following recommended numbers are still insufficient.

  • --limitOutSJcollapsed: recommended 10,000,000 (default 1 million)
  • --limitIObufferSize: recommended 1,500,000,000 (default 150 million)

Step 3: Rearrange softclipped alignments and remap

The STAR mapper, even with the optimized parameters discrima

Related Skills

View on GitHub
GitHub Stars12
CategoryDevelopment
Updated1y ago
Forks7

Languages

Python

Security Score

75/100

Audited on Mar 4, 2025

No findings