CRSSANT
Package for analyzing RNA crosslinking and proximity ligation data (PARIS and beyond)
Install / Use
/learn @zhipenglu/CRSSANTREADME
CRSSANT: Cross-linked RNA Secondary Structure Analysis using Network Techniques
RNA crosslinking, proximity ligation and high throughput sequencing produces non-continuous reads that indicate base pairing and higher order interactions, either in RNA secondary structures or intermolecular complexes. CRSSANT (pronounced 'croissant') is a computational pipeline for analyzing non-continuous/gapped reads from a variety of methods that employ the crosslink-ligation principle, including PARIS, LIGR, SPLASH, COMRADES, hiCLIP, etc. CRSSANT optimizes short-read mapping, automates alignment processing, and clusters gap1 and trans alignments into duplex groups (DG) and non-overlapping groups (NG). More complex arrangments are assembled into higher level structures. In particular gapm alignments with 2 gaps or 3 segments are assembled into tri-segment groups (TGs). Overlapping alignments are used to discover homotypic interactions (RNA homodimers).
Briefly, the CRSSANT pipeline operates as follows. First, sequencing reads that have been processed to remove adapters are mapped references with STAR and a new set of optimized options. Second, alignments are filtered, rearranged and classified into different types (gaptypes.py and gapfilter.py). Third, we use network analysis methods to cluster non-continuous alignments into DGs and calculate the confidence for each DG. The DGs are used as the foundation for the assembly of TGs.
CRSSANT is written in Python and available as source code that you can download and run directly on your own machine (no compiling needed). An earlier version of the DG assembly method is available here: (https://github.com/ihwang/CRSSANT). For more about the CRSSANT pipeline, please refer to this study Zhang et al. 2022 Genome Research.
Table of contents
- Step 0: Download and prepare environment
- Step 1: Preprocessing fastq input files
- Step 2: Map reads to the genome
- Step 3. Rearrange softclipped alignments and remap
- Step 4: Classify alignments
- Step 5: Segment and gap statistics
- Step 6: Filter spliced and short gaps
- Step 7: Cluster gap1 and trans alignments to DGs
- Step 8: Cluster gapm alignments to TGs
- Step 9: Analysis of RNA homodimers
- Running CRSSANT as a pipeline
Step 0: Download and prepare environment
Download the scripts and save it to a known path/location. No special installation is needed, but the python package dependencies need to be properly resolved before use. You will need Python version 3.6+ and the following Python packages. We recommend downloading the latest versions of these packages using the Ananconda/Bioconda package manager. Currently, the NetworkX version only works with python 3.6, but not higher versions.
- NetworkX v2.1+ (Anaconda link)
- NumPy (Anaconda link)
- SciPy (Anaconda link)
- scikit-learn (Anaconda link)
Additional tools for used for mapping and general processing of high throughput sequencing data, including STAR, samtools and bedtools. The STAR should be used with optimized parameters shown below.
For visualization of the results, we recommend IGV, which has features for grouping alignments based on tags, such as DG and NG that we implemented here. IGV can also directly visualize DG summary information and RNA secondary structures, see Step 4: Cluster alignments to groups for details. VARNA is recommended for visualizing RNA secondary structures in a variety of formats, including
System requirements and tests
The programs are generally run in x86-64 compatible processors, including 64 bit Linux or Mac OS X, if there is enough memory. Read mapping against mammalian genomes using STAR requires at least 30G memory. Alignment classification typically requires 100G memory. As a result, these two steps should be run in a cluster with large memory.
Test datasets and example output files are provided for all steps except STAR mapping, which is a well maintained package. Test files are located in the tests folder. Furthermore, we provided source data for all figures in this paper to help readers reproduce the figures and troubleshoot potential problems (sourcedata). The analysis pipeline is preferably run as separate steps to allow maximal control and quality assurance. In addition, shell scripts for a typical pipeline are also provided as examples.
Step 1: Preprocessing fastq input files
Sequencing data can be processed using various published tools to demultiplex samples, and remove barcodes. Since each library preparation uses a different approach, we cannot recommend the same method for all of them. Here is an example of preprocessing PARIS data based on our recently published library preparation protocol.
Step 2: Map reads to the genome
It is assumed that the reads have been demultiplexed and adapters removed. Before mapping the reads, genome indices should be generated with the same STAR version. Reads in the fastq format are mapped to the genome using STAR and a set of optimized parameters as follows. runThreadN and genomeLoad should be adjusted based on available resources and running environment.
STAR --runMode alignReads --genomeDir /path/to/index --readFilesIn /path/to/reads/files --outFileNamePrefix /path/to/output/prefix --runThreadN 1 --genomeLoad NoSharedMemory --outReadsUnmapped Fastx --outFilterMultimapNmax 10 --outFilterScoreMinOverLread 0 --outFilterMatchNminOverLread 0 --outSAMattributes All --outSAMtype BAM Unsorted SortedByCoordinate --alignIntronMin 1 --scoreGap 0 --scoreGapNoncan 0 --scoreGapGCAG 0 --scoreGapATAC 0 --scoreGenomicLengthLog2scale -1 --chimFilter None --chimOutType WithinBAM HardClip --chimSegmentMin 5 --chimJunctionOverhangMin 5 --chimScoreJunctionNonGTAG 0 -- chimScoreDropMax 80 --chimNonchimScoreDropMin 20
Successful STAR mapping generates the following 7 files: Aligned.out.bam, Aligned.sortedByCoord.out.bam, Log.final.out, Log.out, Log.progress.out, SJ.out.tab, and Unmapped.out.mate1. The bam file is converted back to sam for the next step of processing, keeping the header lines (samtools view -h). Sorting is not necessary for the next alignment classification step.
--Optimized STAR parameters
Here is a brief explanation of the optimized parameters for non-continuous alignments. See the bioRxiv preprint referenced at the top of this README for more detailed discussion of the optimization. The output contains very short segments, down to 7nt, and the unreliable ones are removed later using alignment-span-based penalty and segment connection dependent filtering (gaptypes.py).
--outFilterScoreMinOverLread 0and--outFilterMatchNminOverLread 0allows mapping short segments--outSAMattributes Allincludes chimeric tags needed for alignment processing--outSAMtype BAM Unsorted SortedByCoordinatesimplifies subsequent SAM processing--alignIntronMin 1shifts deletions (D) to gaps (N) to equalize penalty.--scoreGap* 0removes all gap open penalty.--scoreGenomicLengthLog2scale -1increases alignment span-based penalty--chimFilter Noneenables detection of chimeric alignments (primarily homotypic) near the 5' and 3' ends of references--chimOutType WithinBAM HardClipoutput alignments in one file and removes hardclips.--chimSegmentMin 5and--chimJunctionOverhangMin 5map chimera more permissively--chimScoreJunctionNonGTAG 0removes penalty for splicing junctions in chimera-- chimScoreDropMax 80, a higher value, and--chimNonchimScoreDropMin 20ensures that chimera are not produced when normal gapped alignments are possible.
--Other relevant parameters
In addition to the parameters listed above, the following ones need to be adjusted according to the data. Larger datasets may produce more gapped alignments (similar to splice junctions) and therefore requires higher limits. Even higher numbers can be tried if the following recommended numbers are still insufficient.
--limitOutSJcollapsed: recommended 10,000,000 (default 1 million)--limitIObufferSize: recommended 1,500,000,000 (default 150 million)
Step 3: Rearrange softclipped alignments and remap
The STAR mapper, even with the optimized parameters discrima
Related Skills
node-connect
343.3kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
92.1kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
343.3kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
343.3kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
