Cawdor
Cancer analysis workflow (DNAseq or RNAseq)
Install / Use
/learn @vladsavelyev/CawdorREADME
Cawdor
Cancer analysis workflow for WGS and WTS sequencing data
Introduction
The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a portable manner. The structure is inspired by bcbio-nextgen, a python-based NGS analysis framework; the nextflow implementation is inspired by Sarek, a cancer analysis workflow, and nf-core templates; some bioinformatics approaches are borrowed from Hartwig Medical Foundation pipeline. Post-processing was originally a part of Snakemake-based workflow umccrise.
Currently supported variant calling for WGS tumor/normal paired samples:
- Somatic variant calling: VarDict, Mutect2, Strelka2, SAGE
- Structural variant calling: Manta
- CNV and heterogeneity: PURPLE
- Germline variant calling: VarDict, GATK4 Haplotype Caller, Strelka2
Installation
Install java 8 or create a conda environment (nextflow package brings java 8):
conda create cawdor -c bioconda nextflow
conda activate cawdor
If you want to run on [NCI])(https://opus.nci.org.au/display/Help/Raijin+User+Guide#RaijinUserGuide-InteractivePBSJobs) cluster, clone and compile custom nextflow instance:
git clone https://github.com/vladsaveliev/nextflow
make compile pack
cp build/releases/nextflow-* $CONDA_PREFIX/bin/nextflow
chmod +x $CONDA_PREFIX/bin/nextflow
For convenience, create a loader script with the following contents:
unset PYTHONPATH
unset PERL5LIB
export PATH=$CONDA_PREFIX/bin:$CONDA_PREFIX/../../bin:$PATH
export CONDA_PREFIX=$CONDA_PREFIX
export NXF_HOME=$(pwd)/.nextflow
export NXF_WORK=$(pwd)/scratch
Also check nextflow and nf-core documentation:
- Installation
- Pipeline configuration
- Troubleshooting
Usage
Cawdor consists of several subworkflows: align.nf to align reads and get alignment QC, somatic.nf to call somatic variants (SNVs, indels, SVs, and CNVs), germline.nf to call germline variants, and postprocess.nf to annotate and prioritise variants, generate reports and QC.
The typical command for running the pipeline is as follows:
nextflow run align.sh --samplesDir /samples -profile raijin --outDir Results --genome GRCh37
This will launch the pipeline with the raijin cluster configuration profile. See below for more information about profiles.
Note that the pipeline will create the following files in your working directory:
work/ # Directory containing the nextflow working files
Results/ # Finished results (configurable, see below)
.nextflow.log # Log file from Nextflow
.nextflow/ # Folder with other Nextflow hidden files
Specifying input data
Input files can be either raw FastQ files or aligned BAM files. To specify input files, you can either provide a directory, or a TSV file.
Samples directory
To run, you can specify the input directory with --sampleDir. The directory searched recursively for FastQ files that are named *_R1_*.fastq.gz, and a matching pair with _R2_ instead of _R1_):
nextflow run align.nf --samplesDir /samples
For multiple patients, organize the folder into one subfolder for every sample:
ID
+--sample1
+-----sample1_lib_flowcell-index_lane_R1_1000.fastq.gz
+-----sample1_lib_flowcell-index_lane_R2_1000.fastq.gz
+-----sample1_lib_flowcell-index_lane_R1_1000.fastq.gz
+-----sample1_lib_flowcell-index_lane_R2_1000.fastq.gz
+--sample2
+-----sample2_lib_flowcell-index_lane_R1_1000.fastq.gz
+-----sample2_lib_flowcell-index_lane_R2_1000.fastq.gz
+--sample3
+-----sample3_lib_flowcell-index_lane_R1_1000.fastq.gz
+-----sample3_lib_flowcell-index_lane_R2_1000.fastq.gz
+-----sample3_lib_flowcell-index_lane_R1_1000.fastq.gz
+-----sample3_lib_flowcell-index_lane_R2_1000.fastq.gz
FastQ filename structure:
sample_lib_flowcell-index_lane_R1_1000.fastq.gzandsample_lib_flowcell-index_lane_R2_1000.fastq.gz
Where:
sample= sample idlib= indentifier of libaray preparationflowcell= identifyer of flow cell for the sequencing runlane= identifier of the lane of the sequencing run
Read group information will be parsed from fastq file names according to this:
RGID= "sample_lib_flowcell_index_lane"RGPL= "Illumina"PU= sampleRGLB= lib
Samples TSV file
Another option is to specify a TSV file with rows corresponding to samples with --samples.
nextflow run align.nf --samples samples.tsv
The TSV file should have at least one tab-separated line:
SUBJECT_ID_1 0 SAMPLE_1_N 1 /samples/normal1_1.fastq.gz /samples/normal1_2.fastq.gz
SUBJECT_ID_1 1 SAMPLE_1_T 3 /samples/tumor1_1.fastq.gz /samples/tumor1_2.fastq.gz
SUBJECT_ID_2 0 SAMPLE_2_N 2 /samples/normal2_1.fastq.gz /samples/normal2_2.fastq.gz
SUBJECT_ID_2 1 SAMPLE_2_T 4 /samples/tumor2_1.fastq.gz /samples/tumor2_2.fastq.gz
The columns are:
- Subject (batch) id
- Status: 0 if normal, 1 if tumor
- Sample id: actual text representation of the type of the sample
- Lane ID - used when the sample is multiplexed on several lanes
- First set of reads
- Second set of reads
To run from BAM file, create a 5-column TSV file:
SUBJECT_ID_1 0 SAMPLE_1_N 1 /samples/normal_1.bam
SUBJECT_ID_1 1 SAMPLE_1_T 3 /samples/tumor_1.bam
SUBJECT_ID_2 0 SAMPLE_2_N 2 /samples/normal_2.bam
SUBJECT_ID_2 1 SAMPLE_2_T 4 /samples/tumor_2.bam
Another option is to specify one folder for each sample. Useful when you have many lanes:
SUBJECT_ID_1 0 SAMPLE_1_N 1 /samples/sample_n_1
SUBJECT_ID_1 1 SAMPLE_1_T 3 /samples/sample_t_1
SUBJECT_ID_2 0 SAMPLE_2_N 2 /samples/sample_n_2
SUBJECT_ID_2 1 SAMPLE_2_T 4 /samples/sample_t_2
Reference genomes
The pipeline config files come bundled with paths to the illumina iGenomes reference index files. If running with docker or AWS, the configuration is set up to use the AWS-iGenomes resource.
--genome (using iGenomes)
There are 31 different species supported in the iGenomes references. To run the pipeline, you must specify which to use with the --genome flag.
You can find the keys to specify the genomes in the iGenomes config file. Common genomes that are supported are:
- Human
--genome GRCh37--genome hg38
Presets exist for raijin and spartan environments. For other machines, provide the location of the genomes with --genomes_dir option, with the directory having the following structure:
bwaIndex = "${params.genomes_base}/${params.genome}/${params.genome}.fa.{amb,ann,bwt,pac,sa}"
genomeDict = "${params.genomes_base}/${params.genome}/${params.genome}.dict"
genomeFasta = "${params.genomes_base}/${params.genome}/${params.genome}.fa"
genomeIndex = "${params.genomes_base}/${params.genome}/${params.genome}.fa.fai"
intervals = "${params.genomes_base}/${params.genome}/wgs_calling_regions_CAW.list"
dbsnp = "${params.genomes_base}/${params.genome}/dbsnp-151.vcf.gz"
dbsnpIndex = "${params.genomes_base}/${params.genome}/dbsnp-151.vcf.gz.tbi"
vepCacheVersion = "94"
-profile
Use this parameter to choose a configuration profile. Profiles can give configuration presets for different compute environments. Note that multiple profiles can be loaded, for example: -profile docker - the order of arguments is important!
If -profile is not specified at all the pipeline will be run locally and expects all software to be installed and available on the PATH.
raijin- Uses NCI PBSPro scheduler as exeutor, also knows about available resourses, default location to the conda environment and reference genomes.
spartan- Uses Spartan Slurm scheduler as exeutor, also knows about available resourses, default location to the conda environment and reference genomes.
awsbatch- A generic configuration profile to be used with AWS Batch.
condadocker- A generic configuration profile to be used with Docker
- Pulls software from dockerhub:
vladsaveliev/cawdor
singularity- A generic configuration profile to be used with Singularity
- Pulls software from DockerHub:
vladsaveliev/cawdor
test- A profile with a complete configuration for automated testing
- Includes links to test data so needs no other parameters
Example: running on NCI Raijin:
nextflow run align.nf --sampleDir ../Sarek/Sarek-data/testdata/tin --outDir Results --genome smallGRCh37 -profile raijin
You can also run on NCI on a local node, using 1 cpu - just set up -process.* nextflow options:
nextflow run align.nf --sampleDir ../Sarek/Sarek-data/testdata/tin --outDir Results --genome smallGRCh37 -profile raijin -process.cpus=1 -process.executor=local
Other command line parameters
--outdir
The ou
Related Skills
node-connect
346.8kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
107.6kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
346.8kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
346.8kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
