SkillAgentSearch skills...

Cawdor

Cancer analysis workflow (DNAseq or RNAseq)

Install / Use

/learn @vladsavelyev/Cawdor
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Cawdor

Cancer analysis workflow for WGS and WTS sequencing data

Build Status Nextflow

Introduction

The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a portable manner. The structure is inspired by bcbio-nextgen, a python-based NGS analysis framework; the nextflow implementation is inspired by Sarek, a cancer analysis workflow, and nf-core templates; some bioinformatics approaches are borrowed from Hartwig Medical Foundation pipeline. Post-processing was originally a part of Snakemake-based workflow umccrise.

Currently supported variant calling for WGS tumor/normal paired samples:

  • Somatic variant calling: VarDict, Mutect2, Strelka2, SAGE
  • Structural variant calling: Manta
  • CNV and heterogeneity: PURPLE
  • Germline variant calling: VarDict, GATK4 Haplotype Caller, Strelka2

Installation

Install java 8 or create a conda environment (nextflow package brings java 8):

conda create cawdor -c bioconda nextflow
conda activate cawdor

If you want to run on [NCI])(https://opus.nci.org.au/display/Help/Raijin+User+Guide#RaijinUserGuide-InteractivePBSJobs) cluster, clone and compile custom nextflow instance:

git clone https://github.com/vladsaveliev/nextflow
make compile pack
cp build/releases/nextflow-* $CONDA_PREFIX/bin/nextflow
chmod +x $CONDA_PREFIX/bin/nextflow

For convenience, create a loader script with the following contents:

unset PYTHONPATH
unset PERL5LIB
export PATH=$CONDA_PREFIX/bin:$CONDA_PREFIX/../../bin:$PATH
export CONDA_PREFIX=$CONDA_PREFIX
export NXF_HOME=$(pwd)/.nextflow
export NXF_WORK=$(pwd)/scratch

Also check nextflow and nf-core documentation:

  1. Installation
  2. Pipeline configuration
  3. Troubleshooting

Usage

Cawdor consists of several subworkflows: align.nf to align reads and get alignment QC, somatic.nf to call somatic variants (SNVs, indels, SVs, and CNVs), germline.nf to call germline variants, and postprocess.nf to annotate and prioritise variants, generate reports and QC.

The typical command for running the pipeline is as follows:

nextflow run align.sh --samplesDir /samples -profile raijin --outDir Results --genome GRCh37

This will launch the pipeline with the raijin cluster configuration profile. See below for more information about profiles.

Note that the pipeline will create the following files in your working directory:

work/            # Directory containing the nextflow working files
Results/         # Finished results (configurable, see below)
.nextflow.log    # Log file from Nextflow
.nextflow/       # Folder with other Nextflow hidden files

Specifying input data

Input files can be either raw FastQ files or aligned BAM files. To specify input files, you can either provide a directory, or a TSV file.

Samples directory

To run, you can specify the input directory with --sampleDir. The directory searched recursively for FastQ files that are named *_R1_*.fastq.gz, and a matching pair with _R2_ instead of _R1_):

nextflow run align.nf --samplesDir /samples

For multiple patients, organize the folder into one subfolder for every sample:

ID
+--sample1
+-----sample1_lib_flowcell-index_lane_R1_1000.fastq.gz
+-----sample1_lib_flowcell-index_lane_R2_1000.fastq.gz
+-----sample1_lib_flowcell-index_lane_R1_1000.fastq.gz
+-----sample1_lib_flowcell-index_lane_R2_1000.fastq.gz
+--sample2
+-----sample2_lib_flowcell-index_lane_R1_1000.fastq.gz
+-----sample2_lib_flowcell-index_lane_R2_1000.fastq.gz
+--sample3
+-----sample3_lib_flowcell-index_lane_R1_1000.fastq.gz
+-----sample3_lib_flowcell-index_lane_R2_1000.fastq.gz
+-----sample3_lib_flowcell-index_lane_R1_1000.fastq.gz
+-----sample3_lib_flowcell-index_lane_R2_1000.fastq.gz

FastQ filename structure:

  • sample_lib_flowcell-index_lane_R1_1000.fastq.gz and
  • sample_lib_flowcell-index_lane_R2_1000.fastq.gz

Where:

  • sample = sample id
  • lib = indentifier of libaray preparation
  • flowcell = identifyer of flow cell for the sequencing run
  • lane = identifier of the lane of the sequencing run

Read group information will be parsed from fastq file names according to this:

  • RGID = "sample_lib_flowcell_index_lane"
  • RGPL = "Illumina"
  • PU = sample
  • RGLB = lib

Samples TSV file

Another option is to specify a TSV file with rows corresponding to samples with --samples.

nextflow run align.nf --samples samples.tsv

The TSV file should have at least one tab-separated line:

SUBJECT_ID_1	0	SAMPLE_1_N	1	/samples/normal1_1.fastq.gz	/samples/normal1_2.fastq.gz
SUBJECT_ID_1	1	SAMPLE_1_T	3	/samples/tumor1_1.fastq.gz	/samples/tumor1_2.fastq.gz
SUBJECT_ID_2	0	SAMPLE_2_N	2	/samples/normal2_1.fastq.gz	/samples/normal2_2.fastq.gz
SUBJECT_ID_2	1	SAMPLE_2_T	4	/samples/tumor2_1.fastq.gz	/samples/tumor2_2.fastq.gz

The columns are:

  1. Subject (batch) id
  2. Status: 0 if normal, 1 if tumor
  3. Sample id: actual text representation of the type of the sample
  4. Lane ID - used when the sample is multiplexed on several lanes
  5. First set of reads
  6. Second set of reads

To run from BAM file, create a 5-column TSV file:

SUBJECT_ID_1	0	SAMPLE_1_N	1	/samples/normal_1.bam
SUBJECT_ID_1	1	SAMPLE_1_T	3	/samples/tumor_1.bam
SUBJECT_ID_2	0	SAMPLE_2_N	2	/samples/normal_2.bam
SUBJECT_ID_2	1	SAMPLE_2_T	4	/samples/tumor_2.bam

Another option is to specify one folder for each sample. Useful when you have many lanes:

SUBJECT_ID_1	0	SAMPLE_1_N	1	/samples/sample_n_1
SUBJECT_ID_1	1	SAMPLE_1_T	3	/samples/sample_t_1
SUBJECT_ID_2	0	SAMPLE_2_N	2	/samples/sample_n_2
SUBJECT_ID_2	1	SAMPLE_2_T	4	/samples/sample_t_2

Reference genomes

The pipeline config files come bundled with paths to the illumina iGenomes reference index files. If running with docker or AWS, the configuration is set up to use the AWS-iGenomes resource.

--genome (using iGenomes)

There are 31 different species supported in the iGenomes references. To run the pipeline, you must specify which to use with the --genome flag.

You can find the keys to specify the genomes in the iGenomes config file. Common genomes that are supported are:

  • Human
    • --genome GRCh37
    • --genome hg38

Presets exist for raijin and spartan environments. For other machines, provide the location of the genomes with --genomes_dir option, with the directory having the following structure:

bwaIndex         = "${params.genomes_base}/${params.genome}/${params.genome}.fa.{amb,ann,bwt,pac,sa}"
genomeDict       = "${params.genomes_base}/${params.genome}/${params.genome}.dict"
genomeFasta      = "${params.genomes_base}/${params.genome}/${params.genome}.fa"
genomeIndex      = "${params.genomes_base}/${params.genome}/${params.genome}.fa.fai"
intervals        = "${params.genomes_base}/${params.genome}/wgs_calling_regions_CAW.list"
dbsnp            = "${params.genomes_base}/${params.genome}/dbsnp-151.vcf.gz"
dbsnpIndex       = "${params.genomes_base}/${params.genome}/dbsnp-151.vcf.gz.tbi"
vepCacheVersion  = "94"

-profile

Use this parameter to choose a configuration profile. Profiles can give configuration presets for different compute environments. Note that multiple profiles can be loaded, for example: -profile docker - the order of arguments is important!

If -profile is not specified at all the pipeline will be run locally and expects all software to be installed and available on the PATH.

  • raijin
    • Uses NCI PBSPro scheduler as exeutor, also knows about available resourses, default location to the conda environment and reference genomes.
  • spartan
    • Uses Spartan Slurm scheduler as exeutor, also knows about available resourses, default location to the conda environment and reference genomes.
  • awsbatch
    • A generic configuration profile to be used with AWS Batch.
  • conda
    • A generic configuration profile to be used with conda
    • Pulls most software from Bioconda
  • docker
  • singularity
  • test
    • A profile with a complete configuration for automated testing
    • Includes links to test data so needs no other parameters

Example: running on NCI Raijin:

nextflow run align.nf --sampleDir ../Sarek/Sarek-data/testdata/tin --outDir Results --genome smallGRCh37 -profile raijin

You can also run on NCI on a local node, using 1 cpu - just set up -process.* nextflow options:

nextflow run align.nf --sampleDir ../Sarek/Sarek-data/testdata/tin --outDir Results --genome smallGRCh37 -profile raijin -process.cpus=1 -process.executor=local

Other command line parameters

--outdir

The ou

Related Skills

View on GitHub
GitHub Stars5
CategoryDevelopment
Updated1y ago
Forks0

Languages

Nextflow

Security Score

55/100

Audited on Apr 8, 2024

No findings