Sicelore
Single Cell Long Read is a suite of tools dedicated to Cell barcode / UMI assignment and analysis of highly multiplexed single cell Nanopore long read sequencing data.
Install / Use
/learn @ucagenomix/SiceloreREADME
SiCeLoRe (Single Cell Long Read) is a suite of tools dedicated to cell barcode / UMI (unique molecular identifier) assignment and bioinformatics analysis of highly multiplexed single cell Nanopore or PacBIo long read sequencing data.
Typically starting with a single cell short read bam file and Nanopore or PacBio long reads, the workflow integrates several sequential steps for cell barcode and UMI assignment to long reads (guided by short read data), transcript isoform identification, generation of molecules consensus sequences (UMI-guided error-correction) and production of [isoforms / junctions / SNPs x cells] count matrices for new modalities integration into standard single cell RNA-seq statistical analysis.
New release for short-read free analysis compatible with 10x Genomics Visium and single-cell 3' and 5' protocols: <a href="https://github.com/ucagenomix/sicelore-2.1">SiCeLoRe v2.1</a>
Installation
just copy files.
requires:
Workflow
<img src="flow.png">Features
Parses short read data and retrieves info on used cell barcodes and UMIs.
2) Nanopore poly(A) scan - stranding of reads
Pre-scan of Nanopore reads for poly(A) tails -> stranded reads.
3) Mapping of Nanopore reads to the reference genome with minimap2
4) Tag Nanopore SAM records with gene names, read sequence and quality values
Adds gene names, read sequence and QV values. Required for barcode and UMI assignment
5) Barcode and UMI assignment to Nanopore SAM records
6) Consensus sequence calculation for RNA molecules (UMIs)
Generates consensus sequence for transcripts from multiple reads for UMI.
7) Mapping of molecules consensus sequences to the reference genome with minimap2
Consensus sequences are mapped to the reference genome
8) Tag molecule SAM records with gene names, cell barcodes and UMI sequence
Adds gene names, cell barcode and UMI sequence. Required for [cell x genes/isoforms/junctions] matrices generation
9) Transcript isoform expression quantification
Identifies matching Gencode transcript isoforms and generates [cell x genes/isoforms/junctions] matrices.
Calling nucleotide polymorphism cell by cell
Detecting fusion transcripts cell by cell
12) Novel transcripts isoforms detection
Indentifying novel transcripts isoforms
Authors
- Kevin Lebrigand <lebrigand@ipmc.cnrs.fr>
- Rainer Waldmann <waldmann@ipmc.cnrs.fr>
<a id="parsing-illumina-data"></a>
Quick run analysis
We provide test data as a subsampling of reads for the Mus musculus Clta locus for the 190 cells dataset. It requires java 1.8 (JAVA_HOME), minimap2, samtools in your PATH as well as racon and poa (including blosum80.mat in same folder) for consensus calling part. This test script should takes under 5mn to run, output files are located in ./output_dir directory.
git clone https://github.com/ucagenomix/sicelore.git
cd sicelore
chmod +x quickrun.sh
dos2unix quickrun.sh
export JAVA_HOME=<path to Java 1.8>
export PATH=$PATH:<minimap2path>:<samtoolspath>:<raconpath>:<poapath>
./quickrun.sh
sicelore v2 uses spoa and provides .fastq consensus molecule sequences
git clone https://github.com/ucagenomix/sicelore.git
cd sicelore
chmod +x quickrun.v2.sh
dos2unix quickrun.v2.sh
export JAVA_HOME=<path to Java 1.8>
export PATH=$PATH:<minimap2path>:<samtoolspath>:<raconpath>:<spoapath>
./quickrun.v2.sh
1) Parsing of Illumina Data
Genome mapped short read data generated by the 10xGenomics CellRanger software (typically "possorted_genome_bam.bam" file) are parsed and info on cell barcodes and UMIs associated with each gene or genomic region are saved in a serialized nested Java Hashtable which is required for barcode and UMI assignment to Nanopore reads.
Required files
-
IlluminaParser.jar
-
Libraries in the ./lib folder
-
genome matched, barcode and UMI assigned Illumina short read data (e.g. bam file generated by 10x genomics cell ranger).
-
the 10x Genomics Cell Ranger tsv file that contains list of cell associated barcodes.
Usage
java -Xmx15000m -jar IlluminaParser.jar --inFileIllumina possorted_genome_bam.bam \
--tsv barcodes.tsv --outFile parsedForNanopore_v0.2.obj --cellBCflag CB --umiFlag UB --geneFlag GN
Parameters
-b,--cellBCflag (required)
SAM tag for cell BC in Illumina bam file. Cell barcodes in cellranger bam files have a "-1" at the end. If other single cell sequencing systems were used, the "-1" at the end of the BC is not required. The "-" and following characters are ignored. This is the assigned cell Barcode and not the read sequence for the cell barcode. In Cell Ranger bam files it is the BC tag.
-u,--umiFlag (required)
SAM tag for umi in the Illumina bam file. This is the assigned UMI and not the read sequence for the UMI. In Cell Ranger bam files this is the sequence in the UB tag.
-g,--geneFlag (required)
SAM tag for Gene name in Illumina bam file.
-i,--inFileIllumina (required)
path of bam file with genome matched Illumina data generated by 10xGenomics CellRanger
-t,--tsv (required)
the 10xGenomics tsv file that contains list of cell associated barcodes. The file contains the list of cell barcodes that are associated with a cell. One cell barcode per line. 10x Genomics barcode tsv files have a "-1" appended to the barcode sequence. The "-1" is not required and can be omitted if non-10xGenomics systems are used.
-o,--outFile (required)
full path of output file where the Illumina barcode/UMI data are saved. File required for Barcode and UMI assignment to Nanopore SAM records .
<a id="nanopore-scan"></a>
2) Scan for poly(A) and adapters in Nanopore reads.
Scans the Nanopore fastq reads for poly(A/T) and adapter sequence and generates stranded (forward) reads for reads with found polyA and adapter.
Scans by default for >= 15 nt. polyA (or T) with >= 75% As within 100 nt from both ends of the read. If poly(A) was found, Searches for a 10xGenomics adapter sequence "CTTCCGATCT" downstream of the poly(A).
When poly(A) and adapter were found at one end the read is written stranded (forward) into a "pass" folder.
Failed reads are written unstranded into a "failed" folder.
This is an optional step. Cell barcode and UMI assignment also works with non-stranded records.
Required files
-
NanoporeReadScanner.jar
-
Libraries in the ./lib folder
-
Config file: ReadScannerConfig.xml (Most default settings can be changed there).
If no ReadScannerConfig.xml file is found in the current path (working directory), the software takes the default config file from the directory where the applications (jars) are installed.
Usage
java -jar <path>/NanoporeReadScanner.jar -d <directory to start recursive search for fastq files>
Parameters
-d,--inDir (either this or –fastqFiles required)
directory to start file search. starting at this directory, takes recursively fastq files that match the RegEx pattern given in –pattern
-i,--fastqFiles (either this or – inDir required)
" ," seperated list of fastq files
-v,-- pattern (optional, defaults)
fastq File name pattern to search when parsing folders recursively: defaults to: ".{1,}.fastq"
-f,--fractionAT (optional, defaults)
min fraction AT, defaults to value in ReadScannerConfig.xml :0.75
-p,--polyAlength (optional, defaults)
min length of polyA, defaults to value in ReadScannerConfig.xml : 15
-w,--windowAT (optional, defaults)
window to search for AT from the extremities of the read, defaults to value in ReadScannerConfig.xml: 100 nt.
-o,--outDir
Output Directory, Creates a “failed” and “passed” sub-folder there with failed and stranded passed reads (found polyA and adapter) respectively.
If "null" is given as output directory it won't write but just generate some stats.
<a id="minimap2-mapping"></a>
3) Mapping of Nanopore reads to the reference genome with minimap2
fastq splitting into chunks for paralellization
Can be omitted for small runs (< 5 million reads)
Prior to mapping the fastqs are split into chunks.
uses fastp
fastp -i nanopore_reads.fastq -Q -A --thread 1 --split_prefix_digits=4 --out1=sub.fastq --split=8
*parallel minimap2 mapping
command shown for fastq batch "0001.sub.fastq"
minimap2 -ax splice -uf --MD --sam-hit-only -t 20 --junc-bed junctions.bed $BUILD.mmi 0001.sub.fastq > 0001.sub.sam
samtools view -Sb 0001.sub.sam -o 0001.sub.unsorted.bam
samtools sort 0001.sub.unsorted.bam -o 0001.sub.bam
samtools index 0001.sub.bam
--junc-bed (required)
BED file consisting of annotated introns and their strands. With this option, minimap2 prefers splicing in annotations.
can be generated with `paftools.js gff2bed -j ann.gtf' (Paftools is part of the minimap distribution)
<a id="nanopore-tagging"></a>
4) Tag Nanopore SAM records with gene names, read sequence and quality values
add gene names to Nanopore SAM records
uses AddGeneNameTag (Sicelore-2.0.jar)
Add gene names to Nanopore SAMrecords GE tag using **A
Related Skills
node-connect
349.9kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
109.8kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
349.9kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
349.9kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
