Umccrise

:snake: DRAGEN Tumor/Normal workflow post-processing

Generate Convert Improve

Install / Use

/learn @umccr/Umccrise

About this skill

Quality Score

0/100

README

UMCCR WGS tumor/normal reporting

umccrise is a Snakemake workflow that post-processes results from the Illumina DRAGEN WGS tumor/normal pipeline and generates HTML reports helpful for researchers and curators at UMCCR.

Summary
Detailed Workflow
History
Example Reports
Usage
Installation
Reference data
- Versioning
- Syncing with AWS S3
Testing
AWS
Advanced usage
Updating
Development
Docker
Building reference data
- PURPLE
- GNOMAD
- PCGR
- Problem regions
- Coding regions (SAGE)
- Ensembl annotation
- Hotspots
- Other HMF files
- Fusions
- SnpEff
- DVC

Summary

In summary, umccrise can:

Filter artefacts and germline leakage from somatic variant calls
Run PCGR to annotate, prioritize and report somatic variants
Run CPSR to annotate, prioritize and report germline variants
Filter, annotate, prioritize and report structural variants (SVs) from Manta
Run PURPLE to call copy number variants (CNVs), recover SVs, and infer tumor purity & ploidy
Generate a MultiQC report that summarizes quality control statistics in context of background "gold standard" samples
Generate a cancer report with mutational signatures, inferred HRD status, circos plots, prioritized copy number and structural variant calls
Run CACAO to calculate coverage in common hotspots, as well as goleft indexcov to estimate coverage problems
Run Conpair to estimate tumor/normal concordance and sample contamination
Run oviraptor to detect viral integration sites and affected genes

Detailed Workflow

See workflow.md for a detailed description of the workflow.

History

See HISTORY.md for the version history.

Example Reports

Below are example reports for a HCC1395/HCC1395BL cell line tumor/normal pair sequenced and validated by the SEQC-II consortium.

1. MultiQC (quality control metrics and plots) MultiQC

2. Cancer report (mutational signatures, circos plots, CNV, SV, oncoviruses)

3. CPSR (germline variants) CPSR

4. PCGR (somatic variants) PCGR

5. CACAO (coverage reports) CACAO

Usage

Given input data from DRAGEN somatic and germline output folders, or a custom set of BAM or VCF files, umccrise can be run with:

umccrise <input-data ...> -o umccrised

For more options, see Advanced usage.

Installation

Create a umccrise directory and install the umccrise GitHub repo along with the required conda environments with the following:

mkdir umccrise
cd umccrise
git clone https://github.com/umccr/umccrise umccrise.git
bash umccrise.git/install.sh

The above will generate a load_umccrise.sh script that can be sourced to load the umccrise conda environment on demand:

source load_umccrise.sh

Reference data

umccrise needs a 64G bundle of reference data to run. From within the UMCCR AWS setup, sign in to AWS, and run umccrise_refdata_pull

aws sso login --profile sso-dev-admin
umccrise_refdata_pull
export UMCCRISE_GENOMES=${PWD}/refdata/genomes

Alternatively, you can specify a custom path with --genomes <path>. The path can be a tarball and will be automatically extracted.

The path can also be a location on S3 or GDS, prefixed with s3:// or gds://. E.g.:

umccrise /input --genomes s3://umccr-refdata-dev/genomes

Versioned locations would also be checked. For the case above, umccrise will check the following locations in the order specified:

s3://umccr-refdata-dev/genomes_102
s3://umccr-refdata-dev/genomes_10, and
s3://umccr-refdata-dev/genomes, assuming that the reference_data package version is 1.0.2.

umccrise will sync the reference data locally into a ~/umccrise_genomes directory. You can symlink any other path to that path if you want a different location. If the data is already downloaded, umccrise will only attempt to update the changed files or upload new ones. To avoid attempts to check S3/GDS again at all, specify the downloaded location directly: --genomes ~/umccrise_genomes

Another option to specify the reference data is through an environment variable $UMCCRISE_GENOMES

If you have access to UMCCR's AWS account, you can sync the reference data from s3://umccr-refdata-dev. If you have access to UMCCR's NCI Gadi account, you can sync the data from /g/data3/gx8/extras/umccrise/genomes. Otherwise, you can build the bundle from scratch following the details below.

Versioning

The reference data is versioned as a python package at https://github.com/umccr/reference_data

Syncing with AWS S3

ref_data_version=1.0.0
aws s3 sync hg38 s3://umccr-refdata-dev/genomes_${ref_data_version//./}/hg38
aws s3 sync hg38-manifest.txt s3://umccr-refdata-dev/genomes_${ref_data_version//./}/hg38-manifest.txt

Testing

Load the umccrise environment, clone the repo with toy test data, and run nosetests:

source load_umccrise.sh
git clone https://github.com/umccr/umccrise_test_data
TEST_OPTS="-c -j2" nosetests -s umccrise_test_data/test.py

AWS

umccrise on AWS is run via AWS Batch in a defined compute environment. This is set up and maintained via the umccrise Terraform Stack. This stack also defines the version of umccrise that is used within AWS and how umccrise jobs are triggered.

Advanced usage

Inputs with named arguments

Inputs can be provided to umccrise as a positional argument (see Usage) or alternatively as named arguments (see examples below). This is useful when dealing with DRAGEN input, which have two paired input directories (somatic and germline). The patient and sample identifiers can also be explicitly set for DRAGEN data - in some instances this is required as these identifiers cannot be automatically inferred.

# DRAGEN input with named arguments
umccrise --dragen_somatic_dir PATH --dragen_germline_dir PATH -o umccrised/

# Explicitly setting subject identifier for provided DRAGEN input
umccrise --dragen_somatic_dir PATH --dragen_germline_dir PATH --dragen_subject_id IDENTIFIER -o umccrised/

Controlling the number of CPUs

To set the number of allowed CPUs to use, set the -j option:

umccrise <input-folder> -j30

Running selected stages

The umccrise workflow includes multiple processing stages, that can optionally be run in isolation. The following stages are run by default:

conpair
structural
somatic, germline (part of small_variants)
pcgr
cpsr
purple
mosdepth, goleft, cacao (part of coverage)
oncoviruses
cancer_report
multiqc

The following stages are optionally available and can be enabled with -T:

microbiome
immuno

Example:

# Run only multiqc and PCGR:
umccrise /bcbio/final/ -T multiqc -T pcgr

To exclude stages, use -E:

# Runs all default stages excluding `conpair` report for contamination and T/N concordance
umccrise /bcbio/final/ -E conpair

Custom input

umccrise supports bcbio-nextgen and DRAGEN projects as input. However, you can also feed custom files as multiple positional arguments. VCF and BAM files are supported. The sample name will be extracted from VCF and BAM headers. For now, the VCF file is assumed to contain T/N somatic small variant calls, and the BAM file is assumed to be from the tumor.

umccrise sample1.bam sample2.bam sample1.vcf.gz sample3.vcf.gz -o umccrised -j10

You can also provide a TSV file as input. If any input file has an extention .tsv (e.g. umccrise input.tsv) the file is assumed as a TSV file with a header, and any of the following columns in arbitrary order:

sample
wgs (WGS tumor BAM, required)
normal (WGS normal BAM, required)
exome (optional tumor BAM)
exome_normal (optional normal BAM)
rna (optiona

Related Skills

node-connect

333.3k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

82.0k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

333.3k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

commit-push-pr

82.0k

Commit, push, and open a PR