Umccrise
:snake: DRAGEN Tumor/Normal workflow post-processing
Install / Use
/learn @umccr/UmccriseREADME
UMCCR WGS tumor/normal reporting
umccrise is a Snakemake workflow that post-processes results from the Illumina DRAGEN WGS tumor/normal pipeline and generates HTML reports helpful for researchers and curators at UMCCR.
- Summary
- Detailed Workflow
- History
- Example Reports
- Usage
- Installation
- Reference data
- Testing
- AWS
- Advanced usage
- Updating
- Development
- Docker
- Building reference data
Summary
In summary, umccrise can:
- Filter artefacts and germline leakage from somatic variant calls
- Run PCGR to annotate, prioritize and report somatic variants
- Run CPSR to annotate, prioritize and report germline variants
- Filter, annotate, prioritize and report structural variants (SVs) from Manta
- Run PURPLE to call copy number variants (CNVs), recover SVs, and infer tumor purity & ploidy
- Generate a MultiQC report that summarizes quality control statistics in context of background "gold standard" samples
- Generate a cancer report with mutational signatures, inferred HRD status, circos plots, prioritized copy number and structural variant calls
- Run CACAO to calculate coverage in common hotspots, as well as goleft indexcov to estimate coverage problems
- Run Conpair to estimate tumor/normal concordance and sample contamination
- Run oviraptor to detect viral integration sites and affected genes
Detailed Workflow
See workflow.md for a detailed description of the workflow.
History
See HISTORY.md for the version history.
Example Reports
Below are example reports for a HCC1395/HCC1395BL cell line tumor/normal pair sequenced and validated by the SEQC-II consortium.
1. MultiQC (quality control metrics and plots) 
2. Cancer report (mutational signatures, circos plots, CNV, SV, oncoviruses) 
Usage
Given input data from DRAGEN somatic and germline output folders, or a
custom set of BAM or VCF files, umccrise can be run with:
umccrise <input-data ...> -o umccrised
For more options, see Advanced usage.
Installation
Create a umccrise directory and install the umccrise GitHub repo along with
the required conda
environments with the following:
mkdir umccrise
cd umccrise
git clone https://github.com/umccr/umccrise umccrise.git
bash umccrise.git/install.sh
The above will generate a load_umccrise.sh script that can be sourced to load
the umccrise conda environment on demand:
source load_umccrise.sh
Reference data
umccrise needs a 64G bundle of reference data to run. From within the UMCCR AWS
setup, sign in to AWS, and run umccrise_refdata_pull
aws sso login --profile sso-dev-admin
umccrise_refdata_pull
export UMCCRISE_GENOMES=${PWD}/refdata/genomes
Alternatively, you can specify a custom path with --genomes <path>. The path
can be a tarball and will be automatically extracted.
The path can also be a location on S3 or GDS, prefixed with s3:// or gds://.
E.g.:
umccrise /input --genomes s3://umccr-refdata-dev/genomes
Versioned locations would also be checked. For the case above, umccrise will check the following locations in the order specified:
s3://umccr-refdata-dev/genomes_102s3://umccr-refdata-dev/genomes_10, ands3://umccr-refdata-dev/genomes, assuming that the reference_data package version is1.0.2.
umccrise will sync the reference data locally into a ~/umccrise_genomes
directory. You can symlink any other path to that path if you want a different
location. If the data is already downloaded, umccrise will only attempt to
update the changed files or upload new ones. To avoid attempts to check S3/GDS
again at all, specify the downloaded location directly:
--genomes ~/umccrise_genomes
Another option to specify the reference data is through an environment variable
$UMCCRISE_GENOMES
If you have access to UMCCR's AWS account, you can sync the reference data from
s3://umccr-refdata-dev. If you have access to UMCCR's NCI Gadi account, you
can sync the data from /g/data3/gx8/extras/umccrise/genomes. Otherwise, you
can build the bundle from scratch following the
details below.
Versioning
The reference data is versioned as a python package at https://github.com/umccr/reference_data
Syncing with AWS S3
ref_data_version=1.0.0
aws s3 sync hg38 s3://umccr-refdata-dev/genomes_${ref_data_version//./}/hg38
aws s3 sync hg38-manifest.txt s3://umccr-refdata-dev/genomes_${ref_data_version//./}/hg38-manifest.txt
Testing
Load the umccrise environment, clone the repo with toy test data, and run nosetests:
source load_umccrise.sh
git clone https://github.com/umccr/umccrise_test_data
TEST_OPTS="-c -j2" nosetests -s umccrise_test_data/test.py
AWS
umccrise on AWS is run via AWS Batch in a defined compute environment. This is set up and maintained via the umccrise Terraform Stack. This stack also defines the version of umccrise that is used within AWS and how umccrise jobs are triggered.
Advanced usage
Inputs with named arguments
Inputs can be provided to umccrise as a positional argument (see Usage) or alternatively as named arguments (see examples below). This is useful when dealing with DRAGEN input, which have two paired input directories (somatic and germline). The patient and sample identifiers can also be explicitly set for DRAGEN data - in some instances this is required as these identifiers cannot be automatically inferred.
# DRAGEN input with named arguments
umccrise --dragen_somatic_dir PATH --dragen_germline_dir PATH -o umccrised/
# Explicitly setting subject identifier for provided DRAGEN input
umccrise --dragen_somatic_dir PATH --dragen_germline_dir PATH --dragen_subject_id IDENTIFIER -o umccrised/
Controlling the number of CPUs
To set the number of allowed CPUs to use, set the -j option:
umccrise <input-folder> -j30
Running selected stages
The umccrise workflow includes multiple processing stages, that can optionally be run in isolation. The following stages are run by default:
conpairstructuralsomatic,germline(part ofsmall_variants)pcgrcpsrpurplemosdepth,goleft,cacao(part ofcoverage)oncovirusescancer_reportmultiqc
The following stages are optionally available and can be enabled with -T:
microbiomeimmuno
Example:
# Run only multiqc and PCGR:
umccrise /bcbio/final/ -T multiqc -T pcgr
To exclude stages, use -E:
# Runs all default stages excluding `conpair` report for contamination and T/N concordance
umccrise /bcbio/final/ -E conpair
Custom input
umccrise supports bcbio-nextgen and DRAGEN projects as input. However, you can also feed custom files as multiple positional arguments. VCF and BAM files are supported. The sample name will be extracted from VCF and BAM headers. For now, the VCF file is assumed to contain T/N somatic small variant calls, and the BAM file is assumed to be from the tumor.
umccrise sample1.bam sample2.bam sample1.vcf.gz sample3.vcf.gz -o umccrised -j10
You can also provide a TSV file as input. If any input file has an extention
.tsv (e.g. umccrise input.tsv) the file is assumed as a TSV file with a
header, and any of the following columns in arbitrary order:
samplewgs(WGS tumor BAM, required)normal(WGS normal BAM, required)exome(optional tumor BAM)exome_normal(optional normal BAM)rna(optiona
Related Skills
node-connect
333.3kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
82.0kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
333.3kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
82.0kCommit, push, and open a PR



