Cawdor

Cancer analysis workflow (DNAseq or RNAseq)

Generate Convert Improve

Install / Use

/learn @vladsavelyev/Cawdor

About this skill

Quality Score

0/100

README

Cawdor

Cancer analysis workflow for WGS and WTS sequencing data

Introduction

The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a portable manner. The structure is inspired by bcbio-nextgen, a python-based NGS analysis framework; the nextflow implementation is inspired by Sarek, a cancer analysis workflow, and nf-core templates; some bioinformatics approaches are borrowed from Hartwig Medical Foundation pipeline. Post-processing was originally a part of Snakemake-based workflow umccrise.

Currently supported variant calling for WGS tumor/normal paired samples:

Somatic variant calling: VarDict, Mutect2, Strelka2, SAGE
Structural variant calling: Manta
CNV and heterogeneity: PURPLE
Germline variant calling: VarDict, GATK4 Haplotype Caller, Strelka2

Installation

Install java 8 or create a conda environment (nextflow package brings java 8):

conda create cawdor -c bioconda nextflow
conda activate cawdor

If you want to run on [NCI])(https://opus.nci.org.au/display/Help/Raijin+User+Guide#RaijinUserGuide-InteractivePBSJobs) cluster, clone and compile custom nextflow instance:

git clone https://github.com/vladsaveliev/nextflow
make compile pack
cp build/releases/nextflow-* $CONDA_PREFIX/bin/nextflow
chmod +x $CONDA_PREFIX/bin/nextflow

For convenience, create a loader script with the following contents:

unset PYTHONPATH
unset PERL5LIB
export PATH=$CONDA_PREFIX/bin:$CONDA_PREFIX/../../bin:$PATH
export CONDA_PREFIX=$CONDA_PREFIX
export NXF_HOME=$(pwd)/.nextflow
export NXF_WORK=$(pwd)/scratch

Also check nextflow and nf-core documentation:

Usage

Cawdor consists of several subworkflows: align.nf to align reads and get alignment QC, somatic.nf to call somatic variants (SNVs, indels, SVs, and CNVs), germline.nf to call germline variants, and postprocess.nf to annotate and prioritise variants, generate reports and QC.

The typical command for running the pipeline is as follows:

nextflow run align.sh --samplesDir /samples -profile raijin --outDir Results --genome GRCh37

This will launch the pipeline with the raijin cluster configuration profile. See below for more information about profiles.

Note that the pipeline will create the following files in your working directory:

work/            # Directory containing the nextflow working files
Results/         # Finished results (configurable, see below)
.nextflow.log    # Log file from Nextflow
.nextflow/       # Folder with other Nextflow hidden files

Specifying input data

Input files can be either raw FastQ files or aligned BAM files. To specify input files, you can either provide a directory, or a TSV file.

Samples directory

To run, you can specify the input directory with --sampleDir. The directory searched recursively for FastQ files that are named *_R1_*.fastq.gz, and a matching pair with _R2_ instead of _R1_):

nextflow run align.nf --samplesDir /samples

For multiple patients, organize the folder into one subfolder for every sample:

ID
+--sample1
+-----sample1_lib_flowcell-index_lane_R1_1000.fastq.gz
+-----sample1_lib_flowcell-index_lane_R2_1000.fastq.gz
+-----sample1_lib_flowcell-index_lane_R1_1000.fastq.gz
+-----sample1_lib_flowcell-index_lane_R2_1000.fastq.gz
+--sample2
+-----sample2_lib_flowcell-index_lane_R1_1000.fastq.gz
+-----sample2_lib_flowcell-index_lane_R2_1000.fastq.gz
+--sample3
+-----sample3_lib_flowcell-index_lane_R1_1000.fastq.gz
+-----sample3_lib_flowcell-index_lane_R2_1000.fastq.gz
+-----sample3_lib_flowcell-index_lane_R1_1000.fastq.gz
+-----sample3_lib_flowcell-index_lane_R2_1000.fastq.gz

FastQ filename structure:

sample_lib_flowcell-index_lane_R1_1000.fastq.gz and
sample_lib_flowcell-index_lane_R2_1000.fastq.gz

Where:

sample = sample id
lib = indentifier of libaray preparation
flowcell = identifyer of flow cell for the sequencing run
lane = identifier of the lane of the sequencing run

Read group information will be parsed from fastq file names according to this:

RGID = "sample_lib_flowcell_index_lane"
RGPL = "Illumina"
PU = sample
RGLB = lib

Samples TSV file

Another option is to specify a TSV file with rows corresponding to samples with --samples.

nextflow run align.nf --samples samples.tsv

The TSV file should have at least one tab-separated line:

SUBJECT_ID_1	0	SAMPLE_1_N	1	/samples/normal1_1.fastq.gz	/samples/normal1_2.fastq.gz
SUBJECT_ID_1	1	SAMPLE_1_T	3	/samples/tumor1_1.fastq.gz	/samples/tumor1_2.fastq.gz
SUBJECT_ID_2	0	SAMPLE_2_N	2	/samples/normal2_1.fastq.gz	/samples/normal2_2.fastq.gz
SUBJECT_ID_2	1	SAMPLE_2_T	4	/samples/tumor2_1.fastq.gz	/samples/tumor2_2.fastq.gz

The columns are:

Subject (batch) id
Status: 0 if normal, 1 if tumor
Sample id: actual text representation of the type of the sample
Lane ID - used when the sample is multiplexed on several lanes
First set of reads
Second set of reads

To run from BAM file, create a 5-column TSV file:

SUBJECT_ID_1	0	SAMPLE_1_N	1	/samples/normal_1.bam
SUBJECT_ID_1	1	SAMPLE_1_T	3	/samples/tumor_1.bam
SUBJECT_ID_2	0	SAMPLE_2_N	2	/samples/normal_2.bam
SUBJECT_ID_2	1	SAMPLE_2_T	4	/samples/tumor_2.bam

Another option is to specify one folder for each sample. Useful when you have many lanes:

SUBJECT_ID_1	0	SAMPLE_1_N	1	/samples/sample_n_1
SUBJECT_ID_1	1	SAMPLE_1_T	3	/samples/sample_t_1
SUBJECT_ID_2	0	SAMPLE_2_N	2	/samples/sample_n_2
SUBJECT_ID_2	1	SAMPLE_2_T	4	/samples/sample_t_2

Reference genomes

The pipeline config files come bundled with paths to the illumina iGenomes reference index files. If running with docker or AWS, the configuration is set up to use the AWS-iGenomes resource.

`--genome` (using iGenomes)

There are 31 different species supported in the iGenomes references. To run the pipeline, you must specify which to use with the --genome flag.

You can find the keys to specify the genomes in the iGenomes config file. Common genomes that are supported are:

Human
- --genome GRCh37
- --genome hg38

Presets exist for raijin and spartan environments. For other machines, provide the location of the genomes with --genomes_dir option, with the directory having the following structure:

bwaIndex         = "${params.genomes_base}/${params.genome}/${params.genome}.fa.{amb,ann,bwt,pac,sa}"
genomeDict       = "${params.genomes_base}/${params.genome}/${params.genome}.dict"
genomeFasta      = "${params.genomes_base}/${params.genome}/${params.genome}.fa"
genomeIndex      = "${params.genomes_base}/${params.genome}/${params.genome}.fa.fai"
intervals        = "${params.genomes_base}/${params.genome}/wgs_calling_regions_CAW.list"
dbsnp            = "${params.genomes_base}/${params.genome}/dbsnp-151.vcf.gz"
dbsnpIndex       = "${params.genomes_base}/${params.genome}/dbsnp-151.vcf.gz.tbi"
vepCacheVersion  = "94"

`-profile`

Use this parameter to choose a configuration profile. Profiles can give configuration presets for different compute environments. Note that multiple profiles can be loaded, for example: -profile docker - the order of arguments is important!

If -profile is not specified at all the pipeline will be run locally and expects all software to be installed and available on the PATH.

raijin
- Uses NCI PBSPro scheduler as exeutor, also knows about available resourses, default location to the conda environment and reference genomes.
spartan
- Uses Spartan Slurm scheduler as exeutor, also knows about available resourses, default location to the conda environment and reference genomes.
awsbatch
- A generic configuration profile to be used with AWS Batch.
conda
- A generic configuration profile to be used with conda
- Pulls most software from Bioconda
docker
- A generic configuration profile to be used with Docker
- Pulls software from dockerhub: vladsaveliev/cawdor
singularity
- A generic configuration profile to be used with Singularity
- Pulls software from DockerHub: vladsaveliev/cawdor
test
- A profile with a complete configuration for automated testing
- Includes links to test data so needs no other parameters

Example: running on NCI Raijin:

nextflow run align.nf --sampleDir ../Sarek/Sarek-data/testdata/tin --outDir Results --genome smallGRCh37 -profile raijin

You can also run on NCI on a local node, using 1 cpu - just set up -process.* nextflow options:

nextflow run align.nf --sampleDir ../Sarek/Sarek-data/testdata/tin --outDir Results --genome smallGRCh37 -profile raijin -process.cpus=1 -process.executor=local

Other command line parameters

`--outdir`

The ou

Related Skills

node-connect

346.8k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

107.6k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

346.8k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

346.8k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。

vladsavelyev

View profile

View on GitHub

GitHub Stars5

CategoryDevelopment

Updated1y ago

Forks0

vladsavelyev/cawdor

Languages

Nextflow

Security Score

55/100

Audited on Apr 8, 2024

No findings

Cawdor

Install / Use

README

Cawdor

Introduction

Installation

Usage

Specifying input data

Samples directory

Samples TSV file

Reference genomes

--genome (using iGenomes)

-profile

Other command line parameters

--outdir

Related Skills

`--genome` (using iGenomes)

`-profile`

`--outdir`