SkillAgentSearch skills...

Umccrise

:snake: DRAGEN Tumor/Normal workflow post-processing

Install / Use

/learn @umccr/Umccrise

README

UMCCR WGS tumor/normal reporting

umccrise is a Snakemake workflow that post-processes results from the Illumina DRAGEN WGS tumor/normal pipeline and generates HTML reports helpful for researchers and curators at UMCCR.

Summary

In summary, umccrise can:

  • Filter artefacts and germline leakage from somatic variant calls
  • Run PCGR to annotate, prioritize and report somatic variants
  • Run CPSR to annotate, prioritize and report germline variants
  • Filter, annotate, prioritize and report structural variants (SVs) from Manta
  • Run PURPLE to call copy number variants (CNVs), recover SVs, and infer tumor purity & ploidy
  • Generate a MultiQC report that summarizes quality control statistics in context of background "gold standard" samples
  • Generate a cancer report with mutational signatures, inferred HRD status, circos plots, prioritized copy number and structural variant calls
  • Run CACAO to calculate coverage in common hotspots, as well as goleft indexcov to estimate coverage problems
  • Run Conpair to estimate tumor/normal concordance and sample contamination
  • Run oviraptor to detect viral integration sites and affected genes

Detailed Workflow

See workflow.md for a detailed description of the workflow.

History

See HISTORY.md for the version history.

Example Reports

Below are example reports for a HCC1395/HCC1395BL cell line tumor/normal pair sequenced and validated by the SEQC-II consortium.

1. MultiQC (quality control metrics and plots) MultiQC

<br>

2. Cancer report (mutational signatures, circos plots, CNV, SV, oncoviruses) Cancer report

<br>

3. CPSR (germline variants) CPSR

<br>

4. PCGR (somatic variants) PCGR

<br>

5. CACAO (coverage reports) CACAO

<br>

Usage

Given input data from DRAGEN somatic and germline output folders, or a custom set of BAM or VCF files, umccrise can be run with:

umccrise <input-data ...> -o umccrised

For more options, see Advanced usage.

Installation

Create a umccrise directory and install the umccrise GitHub repo along with the required conda environments with the following:

mkdir umccrise
cd umccrise
git clone https://github.com/umccr/umccrise umccrise.git
bash umccrise.git/install.sh

The above will generate a load_umccrise.sh script that can be sourced to load the umccrise conda environment on demand:

source load_umccrise.sh

Reference data

umccrise needs a 64G bundle of reference data to run. From within the UMCCR AWS setup, sign in to AWS, and run umccrise_refdata_pull

aws sso login --profile sso-dev-admin
umccrise_refdata_pull
export UMCCRISE_GENOMES=${PWD}/refdata/genomes

Alternatively, you can specify a custom path with --genomes <path>. The path can be a tarball and will be automatically extracted.

The path can also be a location on S3 or GDS, prefixed with s3:// or gds://. E.g.:

umccrise /input --genomes s3://umccr-refdata-dev/genomes

Versioned locations would also be checked. For the case above, umccrise will check the following locations in the order specified:

  • s3://umccr-refdata-dev/genomes_102
  • s3://umccr-refdata-dev/genomes_10, and
  • s3://umccr-refdata-dev/genomes, assuming that the reference_data package version is 1.0.2.

umccrise will sync the reference data locally into a ~/umccrise_genomes directory. You can symlink any other path to that path if you want a different location. If the data is already downloaded, umccrise will only attempt to update the changed files or upload new ones. To avoid attempts to check S3/GDS again at all, specify the downloaded location directly: --genomes ~/umccrise_genomes

Another option to specify the reference data is through an environment variable $UMCCRISE_GENOMES

If you have access to UMCCR's AWS account, you can sync the reference data from s3://umccr-refdata-dev. If you have access to UMCCR's NCI Gadi account, you can sync the data from /g/data3/gx8/extras/umccrise/genomes. Otherwise, you can build the bundle from scratch following the details below.

Versioning

The reference data is versioned as a python package at https://github.com/umccr/reference_data

Syncing with AWS S3

ref_data_version=1.0.0
aws s3 sync hg38 s3://umccr-refdata-dev/genomes_${ref_data_version//./}/hg38
aws s3 sync hg38-manifest.txt s3://umccr-refdata-dev/genomes_${ref_data_version//./}/hg38-manifest.txt

Testing

Load the umccrise environment, clone the repo with toy test data, and run nosetests:

source load_umccrise.sh
git clone https://github.com/umccr/umccrise_test_data
TEST_OPTS="-c -j2" nosetests -s umccrise_test_data/test.py

AWS

umccrise on AWS is run via AWS Batch in a defined compute environment. This is set up and maintained via the umccrise Terraform Stack. This stack also defines the version of umccrise that is used within AWS and how umccrise jobs are triggered.

Advanced usage

Inputs with named arguments

Inputs can be provided to umccrise as a positional argument (see Usage) or alternatively as named arguments (see examples below). This is useful when dealing with DRAGEN input, which have two paired input directories (somatic and germline). The patient and sample identifiers can also be explicitly set for DRAGEN data - in some instances this is required as these identifiers cannot be automatically inferred.

# DRAGEN input with named arguments
umccrise --dragen_somatic_dir PATH --dragen_germline_dir PATH -o umccrised/

# Explicitly setting subject identifier for provided DRAGEN input
umccrise --dragen_somatic_dir PATH --dragen_germline_dir PATH --dragen_subject_id IDENTIFIER -o umccrised/

Controlling the number of CPUs

To set the number of allowed CPUs to use, set the -j option:

umccrise <input-folder> -j30

Running selected stages

The umccrise workflow includes multiple processing stages, that can optionally be run in isolation. The following stages are run by default:

  • conpair
  • structural
  • somatic, germline (part of small_variants)
  • pcgr
  • cpsr
  • purple
  • mosdepth, goleft, cacao (part of coverage)
  • oncoviruses
  • cancer_report
  • multiqc

The following stages are optionally available and can be enabled with -T:

  • microbiome
  • immuno

Example:

# Run only multiqc and PCGR:
umccrise /bcbio/final/ -T multiqc -T pcgr

To exclude stages, use -E:

# Runs all default stages excluding `conpair` report for contamination and T/N concordance
umccrise /bcbio/final/ -E conpair

Custom input

umccrise supports bcbio-nextgen and DRAGEN projects as input. However, you can also feed custom files as multiple positional arguments. VCF and BAM files are supported. The sample name will be extracted from VCF and BAM headers. For now, the VCF file is assumed to contain T/N somatic small variant calls, and the BAM file is assumed to be from the tumor.

umccrise sample1.bam sample2.bam sample1.vcf.gz sample3.vcf.gz -o umccrised -j10

You can also provide a TSV file as input. If any input file has an extention .tsv (e.g. umccrise input.tsv) the file is assumed as a TSV file with a header, and any of the following columns in arbitrary order:

  • sample
  • wgs (WGS tumor BAM, required)
  • normal (WGS normal BAM, required)
  • exome (optional tumor BAM)
  • exome_normal (optional normal BAM)
  • rna (optiona

Related Skills

View on GitHub
GitHub Stars24
CategoryDevelopment
Updated8d ago
Forks9

Languages

Python

Security Score

95/100

Audited on Mar 15, 2026

No findings