ConsensusCruncher

ConsensusCruncher is a tool that suppresses errors in next-generation sequencing data by using unique molecular identifiers (UMIs) to amalgamate reads derived from the same DNA template into a consensus sequence.

Generate Convert Improve

Install / Use

/learn @pughlab/ConsensusCruncher

About this skill

Quality Score

0/100

README

ConsensusCruncher

ConsensusCruncher is a tool that suppresses errors in next-generation sequencing data by using unique molecular identifers (UMIs) to amalgamate reads derived from the same DNA template into a consensus sequence.

To learn more about ConsensusCruncher and its applications: https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkz474/5498633

For a full documentation of ConsensusCruncher, please see our Read the Docs

Quick start

Dependencies

This pipeline requires the following dependencies:

| Program | Version | Purpose | | ------- | ------- | ------------------------------------------ | | picrd| picard/1.9.1 or higher | Adding read group @RG in the alignment files |
| Python | 3.5.1 | Run ConsensusCruncher | | BWA | 0.7.15 | Align reads | | Samtools| 1.3.1 | Sorting and indexing bamfiles |

All required python libraries can be installed by running pip install -r requirements.txt

Configuration

Set up config.ini with the appropriate configurations for [fastq2bam] and [consensus] modes. Alternatively, you can provide command-line arguments.

ConsensusCruncher.py processes one sample (2 paired-end FASTQ files or 1 BAM file) at a time. A sample script to generate shell scripts for multiple samples is available here.

Running ConsensusCruncher

Individual files

Run ConsensusCruncher.py [-c CONFIG] fastq2bam with required input parameters:

  --fastq1 FASTQ1       FASTQ containing Read 1 of paired-end reads. [MANDATORY]
  --fastq2 FASTQ2       FASTQ containing Read 2 of paired-end reads. [MANDATORY]
  -o OUTPUT, --output OUTPUT
                        Output directory, where barcode extracted FASTQ and
                        BAM files will be placed in subdirectories 'fastq_tag'
                        and 'bamfiles' respectively (dir will be created if
                        they do not exist). [MANDATORY]
  -n FILENAME, --name FILENAME
                        Output filename. If none provided, default will
                        extract output name by taking everything left of '_R'.
  -b BWA, --bwa BWA     Path to executable bwa. [MANDATORY]
  -r REF, --ref REF     Reference (BWA index). [MANDATORY]
  -s SAMTOOLS, --samtools SAMTOOLS
                        Path to executable samtools [MANDATORY]
  -p PATTERN, --bpattern PATTERN
                        Barcode pattern (N = random barcode bases, A|C|G|T =
                        fixed spacer bases). [Pattern or list must be provided]
  -l LIST, --blist LIST
                        List of barcodes (Text file with unique barcodes on
                        each line). [Pattern or list must be provided]

BARCODE DESIGN: You can input either a barcode list or barcode pattern or both. If both are provided, barcodes will first be matched with the list and then the constant spacer bases will be removed before the barcode is added to the header. N = random / barcode bases A | C | G | T = constant spacer bases e.g. ATNNGT means barcode is flanked by two spacers matching 'AT' in front and 'GT' behind.

DESCRIPTION: This script extracts molecular barcode tags and removes spacers from unzipped FASTQ files found in the input directory (file names must contain "R1" or "R2"). Barcode extracted FASTQ files are written to the 'fastq_tag' directory and are subsequently aligned with BWA mem. Bamfiles are written to the 'bamfile" directory under the project folder.

Run ConsensusCruncher.py [-c CONFIG] consensus with the required input parameters:

  -h, --help            show this help message and exit
  -i BAM, --input BAM   Input BAM file with barcodes extracted into header. [mandatory]
  -o OUTPUT, --output OUTPUT
                        Output directory, where a folder will be created for
                        the BAM file and consensus sequences. [mandatory]
  -s SAMTOOLS, --samtools SAMTOOLS
                        Path to executable samtools. [mandatory]
  --scorrect {True,False}
                        Singleton correction, default: True.
  -b BEDFILE, --bedfile Separator file to split bamfile into chunks for processing.
                        Default: hg19 cytoband (You can find other cytobands for your 
                        genome of interest on UCSC
                        http://hgdownload.cse.ucsc.edu/downloads.html).
                        For small BAM files, you may choose to turn off data splitting 
                        with '-b False' and process everything all at once (Division of 
                        data is only required for large data sets to offload the
                        memory burden).
  --cutoff CUTOFF       Consensus cut-off, default: 0.7 (70% of reads must
                        have the same base to form a consensus).
  --cleanup {True,False}
                        Remove intermediate files.

This script amalgamates duplicate reads in bamfiles into single-strand consensus sequences (SSCS), which are subsequently combined into duplex consensus sequences (DCS). Singletons (reads lacking duplicate sequences) are corrected, combined with SSCS to form SSCS + SC, and further collapsed to form DCS + SC. Finally, files containing all unique molecules (a.k.a. no duplicates) are created for SSCS and DCS.

Multiple files

script generator will create sh scripts for each file in a fastq directory.

The following parameters need to be changed in the config file: name, bwa, ref, samtools, bpattern (alternatively if a barcode list is used instead, remove bpattern and add blist as parameter). Please note: fastq1, fastq2, output, bam, and c_output can be ignored as those will be updated using the generate_scripts.sh file.
Update generate_scripts.sh with input, output, and code_dir.
Run generate_scripts.sh to create sh files and then run those scripts.

Overview

Example

In order to create consensus sequences, we first need to process fastq files into bam files. Sample fastq files can be found under the test folder. Please note these fastqs are only for testing purposes. For the full fastqs used in our paper, please download the data from the NCBI Sequence Read Archive (SRA; https://www.ncbi.nlm.nih.gov/sra/) under access numbers SRP140497 and SRP141184.

Fastqs to Bams

Given fastq as input files, fastq2bam mode removes the spacer region and extracts the barcode tag from each sequencing read into the header with extract_barcode.py.

REPO="[insert path to ConsensusCruncher repo]"
BWAPATH="[insert path to BWA]"
BWAINDEX="[insert path to BWA INDEX]"
BWAPATH="[insert path to SAMTOOLS]"

python ConsensusCruncher.py fastq2bam --fastq1 $REPO/test/fastq/LargeMid_56_L005_R1.fastq --FASTQ2 $REPO/test/fastq/LargeMid_56_L005_R2.fastq -o $REPO/test -b $BWAPATH -r $BWAIndex -s $SAMTOOLS -bpattern NNT

In the sample dataset, we utilized 2-bp (NN) barcodes and 1-bp (T) spacers. While the barcodes for each read can be one of 16 possible combinations (4^2), the spacer is an invariant "T" base used to ligate barcodes onto each end of a DNA fragment. Thus, a spacer filter is imposed to remove faulty reads. Barcodes from read 1 and read 2 are extracted and combined together before being added to the header.

READ FROM SEQUENCER
Read1:
@HWI-D00331:196:C900FANXX:5:1101:1332:2193 1:N:0:ACGTCACA   [<-- HEADER]
ATTAAGCCCCAGGCAGTTGCTAATGATGGGAGCTTAGTGCACAAGGGCTGGGCCTCCCTCTTGGAGCTGAACATTGTTTCTTGGGGACGGCTGTGCCCACCTCAGCGGGGAGGCAAGGATTAAATC  [<-- SEQUENCE]
+
BCCCCGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGEGGGGGBGGGGGGGGGGGGGGGGGGGGGGGEGG1:FGFGGGGGGGGG/CB>DG@GGGGGGG<DGGGGAAGGEGGB>DGGGEGGG/@G  [<-- QUALITY SCORE]

Read2:
@HWI-D00331:196:C900FANXX:5:1101:1332:2193 2:N:0:ACGTCACA
GGTGGGCTCCAGCCCTGATTTCCTCCCCCAGCCCTGCAGGGCTCAGGTCCAGAGGACACAAGTTTAACTTGCGGGTGGTCACTTGCCTCGTGCGGTGACGCCATGGTGCCCTCTCTGTGCAGCGCA
+
BBBBCGGGGEGGGGFGGGGGGGGGGGGGGGGGGGGGGB:FCGGGGGGGGGGEGGGGGGGG=FCGG:@GGGEGBGGGAGFGDE@FGGGGGFGFGEGDGGGFCGGDEBGGGGGGGEG=EGGGEEGGG#

------

AFTER BARCODE EXTRACTION AND SPACER ("T") REMOVAL
Read1:
@HWI-D00331:196:C900FANXX:5:1101:1332:2193|ATGG/1
AAGCCCCAGGCAGTTGCTAATGATGGGAGCTTAGTGCACAAGGGCTGGGCCTCCCTCTTGGAGCTGAACATTGTTTCTTGGGGACGGCTGTGCCCACCTCAGCGGGGAGGCAAGGATTAAATC
+
CCGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGEGGGGGBGGGGGGGGGGGGGGGGGGGGGGGEGG1:FGFGGGGGGGGG/CB>DG@GGGGGGG<DGGGGAAGGEGGB>DGGGEGGG/@G

Read2:
@HWI-D00331:196:C900FANXX:5:1101:1332:2193|ATGG/2
GGGCTCCAGCCCTGATTTCCTCCCCCAGCCCTGCAGGGCTCAGGTCCAGAGGACACAAGTTTAACTTGCGGGTGGTCACTTGCCTCGTGCGGTGACGCCATGGTGCCCTCTCTGTGCAGCGCA
+
BCGGGGEGGGGFGGGGGGGGGGGGGGGGGGGGGGB:FCGGGGGGGGGGEGGGGGGGG=FCGG:@GGGEGBGGGAGFGDE@FGGGGGFGFGEGDGGGFCGGDEBGGGGGGGEG=EGGGEEGGG#

FASTQ files with extracted barcodes are placed in the fastq_tag directory and are subsequently aligned with BWA to generate BAMs in the bamfiles folder.

. 
├── bamfiles 
├── fastq
├── fastq_tag
└── qsub

ConsensusCruncher

consensus mode creates a consensus directory and folders for each bam file.

BAM files undergo consensus construction through the workflow illustrated above. Output BAMs are grouped according to type of error suppression (SSCS vs DCS) and whether Singleton Correction (SC) was implemented.

. 
├── bamfiles 
├── consensus 
│   ├── LargeMid_56_L005 
│   │   ├── dcs 
│   │   ├── dcs_SC 
│   │   ├── sscs 
│   │   └── sscs_SC 
... 
│   ├── LargeMid_62_L006

Related Skills

node-connect

340.5k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

84.2k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

340.5k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

commit-push-pr

84.2k

Commit, push, and open a PR