Sicelore

Single Cell Long Read is a suite of tools dedicated to Cell barcode / UMI assignment and analysis of highly multiplexed single cell Nanopore long read sequencing data.

Generate Convert Improve

Install / Use

/learn @ucagenomix/Sicelore

About this skill

Quality Score

0/100

README

SiCeLoRe (Single Cell Long Read) is a suite of tools dedicated to cell barcode / UMI (unique molecular identifier) assignment and bioinformatics analysis of highly multiplexed single cell Nanopore or PacBIo long read sequencing data.

Typically starting with a single cell short read bam file and Nanopore or PacBio long reads, the workflow integrates several sequential steps for cell barcode and UMI assignment to long reads (guided by short read data), transcript isoform identification, generation of molecules consensus sequences (UMI-guided error-correction) and production of [isoforms / junctions / SNPs x cells] count matrices for new modalities integration into standard single cell RNA-seq statistical analysis.

New release for short-read free analysis compatible with 10x Genomics Visium and single-cell 3' and 5' protocols: <a href="https://github.com/ucagenomix/sicelore-2.1">SiCeLoRe v2.1</a>

Installation

just copy files.

requires:

Java 8,
Minimap2,
poa (Sicelore-1.0.jar)
spoa (Sicelore-2.0.jar)
racon
fastp
samtools

Workflow

Features

1) Parsing of Illumina Data

Parses short read data and retrieves info on used cell barcodes and UMIs.

2) Nanopore poly(A) scan - stranding of reads

Pre-scan of Nanopore reads for poly(A) tails -> stranded reads.

3) Mapping of Nanopore reads to the reference genome with minimap2

4) Tag Nanopore SAM records with gene names, read sequence and quality values

Adds gene names, read sequence and QV values. Required for barcode and UMI assignment

5) Barcode and UMI assignment to Nanopore SAM records

6) Consensus sequence calculation for RNA molecules (UMIs)

Generates consensus sequence for transcripts from multiple reads for UMI.

7) Mapping of molecules consensus sequences to the reference genome with minimap2

Consensus sequences are mapped to the reference genome

8) Tag molecule SAM records with gene names, cell barcodes and UMI sequence

Adds gene names, cell barcode and UMI sequence. Required for [cell x genes/isoforms/junctions] matrices generation

9) Transcript isoform expression quantification

Identifies matching Gencode transcript isoforms and generates [cell x genes/isoforms/junctions] matrices.

10) SNP calling

Calling nucleotide polymorphism cell by cell

11) Fusions gene calling

Detecting fusion transcripts cell by cell

12) Novel transcripts isoforms detection

Indentifying novel transcripts isoforms

Authors

Kevin Lebrigand <lebrigand@ipmc.cnrs.fr>

Rainer Waldmann <waldmann@ipmc.cnrs.fr>

Quick run analysis

We provide test data as a subsampling of reads for the Mus musculus Clta locus for the 190 cells dataset. It requires java 1.8 (JAVA_HOME), minimap2, samtools in your PATH as well as racon and poa (including blosum80.mat in same folder) for consensus calling part. This test script should takes under 5mn to run, output files are located in ./output_dir directory.


git clone https://github.com/ucagenomix/sicelore.git
cd sicelore
chmod +x quickrun.sh
dos2unix quickrun.sh
export JAVA_HOME=<path to Java 1.8>
export PATH=$PATH:<minimap2path>:<samtoolspath>:<raconpath>:<poapath>
./quickrun.sh

sicelore v2 uses spoa and provides .fastq consensus molecule sequences


git clone https://github.com/ucagenomix/sicelore.git
cd sicelore
chmod +x quickrun.v2.sh
dos2unix quickrun.v2.sh
export JAVA_HOME=<path to Java 1.8>
export PATH=$PATH:<minimap2path>:<samtoolspath>:<raconpath>:<spoapath>
./quickrun.v2.sh

1) Parsing of Illumina Data

Genome mapped short read data generated by the 10xGenomics CellRanger software (typically "possorted_genome_bam.bam" file) are parsed and info on cell barcodes and UMIs associated with each gene or genomic region are saved in a serialized nested Java Hashtable which is required for barcode and UMI assignment to Nanopore reads.

Required files

IlluminaParser.jar
Libraries in the ./lib folder
genome matched, barcode and UMI assigned Illumina short read data (e.g. bam file generated by 10x genomics cell ranger).
the 10x Genomics Cell Ranger tsv file that contains list of cell associated barcodes.

Usage


java -Xmx15000m -jar IlluminaParser.jar --inFileIllumina possorted_genome_bam.bam \
--tsv barcodes.tsv --outFile parsedForNanopore_v0.2.obj --cellBCflag CB --umiFlag UB --geneFlag GN

Parameters

-b,--cellBCflag (required)

SAM tag for cell BC in Illumina bam file. Cell barcodes in cellranger bam files have a "-1" at the end. If other single cell sequencing systems were used, the "-1" at the end of the BC is not required. The "-" and following characters are ignored. This is the assigned cell Barcode and not the read sequence for the cell barcode. In Cell Ranger bam files it is the BC tag.

-u,--umiFlag (required)

SAM tag for umi in the Illumina bam file. This is the assigned UMI and not the read sequence for the UMI. In Cell Ranger bam files this is the sequence in the UB tag.

-g,--geneFlag (required)

SAM tag for Gene name in Illumina bam file.

-i,--inFileIllumina (required)

path of bam file with genome matched Illumina data generated by 10xGenomics CellRanger

-t,--tsv (required)

the 10xGenomics tsv file that contains list of cell associated barcodes. The file contains the list of cell barcodes that are associated with a cell. One cell barcode per line. 10x Genomics barcode tsv files have a "-1" appended to the barcode sequence. The "-1" is not required and can be omitted if non-10xGenomics systems are used.

-o,--outFile (required)

full path of output file where the Illumina barcode/UMI data are saved. File required for Barcode and UMI assignment to Nanopore SAM records .

2) Scan for poly(A) and adapters in Nanopore reads.

Scans the Nanopore fastq reads for poly(A/T) and adapter sequence and generates stranded (forward) reads for reads with found polyA and adapter.

Scans by default for >= 15 nt. polyA (or T) with >= 75% As within 100 nt from both ends of the read. If poly(A) was found, Searches for a 10xGenomics adapter sequence "CTTCCGATCT" downstream of the poly(A).

When poly(A) and adapter were found at one end the read is written stranded (forward) into a "pass" folder.

Failed reads are written unstranded into a "failed" folder.

This is an optional step. Cell barcode and UMI assignment also works with non-stranded records.

Required files

NanoporeReadScanner.jar
Libraries in the ./lib folder
Config file: ReadScannerConfig.xml (Most default settings can be changed there).

If no ReadScannerConfig.xml file is found in the current path (working directory), the software takes the default config file from the directory where the applications (jars) are installed.

Usage


java -jar <path>/NanoporeReadScanner.jar -d <directory to start recursive search for fastq files>

Parameters

-d,--inDir (either this or –fastqFiles required)

directory to start file search. starting at this directory, takes recursively fastq files that match the RegEx pattern given in –pattern

-i,--fastqFiles (either this or – inDir required)

" ," seperated list of fastq files

-v,-- pattern (optional, defaults)

fastq File name pattern to search when parsing folders recursively: defaults to: ".{1,}.fastq"

-f,--fractionAT (optional, defaults)

min fraction AT, defaults to value in ReadScannerConfig.xml :0.75

-p,--polyAlength (optional, defaults)

min length of polyA, defaults to value in ReadScannerConfig.xml : 15

-w,--windowAT (optional, defaults)

window to search for AT from the extremities of the read, defaults to value in ReadScannerConfig.xml: 100 nt.

-o,--outDir

Output Directory, Creates a “failed” and “passed” sub-folder there with failed and stranded passed reads (found polyA and adapter) respectively.

If "null" is given as output directory it won't write but just generate some stats.

3) Mapping of Nanopore reads to the reference genome with minimap2

fastq splitting into chunks for paralellization

Can be omitted for small runs (< 5 million reads)

Prior to mapping the fastqs are split into chunks.

uses fastp


fastp -i nanopore_reads.fastq -Q -A --thread 1 --split_prefix_digits=4 --out1=sub.fastq --split=8

*parallel minimap2 mapping

command shown for fastq batch "0001.sub.fastq"


minimap2 -ax splice -uf --MD --sam-hit-only -t 20 --junc-bed junctions.bed $BUILD.mmi 0001.sub.fastq > 0001.sub.sam
samtools view -Sb 0001.sub.sam -o 0001.sub.unsorted.bam
samtools sort 0001.sub.unsorted.bam -o 0001.sub.bam
samtools index 0001.sub.bam

--junc-bed (required)

BED file consisting of annotated introns and their strands. With this option, minimap2 prefers splicing in annotations.

can be generated with `paftools.js gff2bed -j ann.gtf' (Paftools is part of the minimap distribution)

4) Tag Nanopore SAM records with gene names, read sequence and quality values

add gene names to Nanopore SAM records

uses AddGeneNameTag (Sicelore-2.0.jar)

Add gene names to Nanopore SAMrecords GE tag using **A

Related Skills

node-connect

349.9k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

109.8k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

349.9k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

349.9k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。