Gcap
GCAP (Gene-level Circular Amplicon Prediction) firstly implements extrachromosomal DNA detection from whole-exome-sequencing (WES) data and absolute copy number profiles. https://shixiangwang.r-universe.dev/gcap
Install / Use
/learn @ShixiangWang/GcapREADME
GCAP: Gene-level Circular Amplicon Prediction
<!-- badges: start --> <!-- badges: end -->In a nutshell, gcap provides an end-to-end workflow for predicting
circular amplicon (also known as ecDNA, extra-chromosomal DNA ) in gene level with machine learning approach,
then classifying cancer samples into different focal amplification (fCNA) types,
based on input from WES (tumor-normal paired BAM, with corresponding .bai index files) data,
allele specific copy number data (e.g., results from ASCAT or Sequenza), or even
absolute integer copy number data (e.g., results from ABSOLUTE). The former two data
sources are preferred as input of gcap .
Installation
Install alleleCount (WES bam data only)
alleleCount is required to run ASCAT on WES bam data, if you haven't installed conda or miniconda, please install firstly, then install the alleleCount in terminal with:
conda create -n cancerit -c bioconda cancerit-allelecount
NOTE: gcap set the default alleleCount as the
~/miniconda3/envs/cancerit/bin/alleleCounter, if you use conda or other approaches, please set the path when you use corresponding functions.
Install ASCAT (required)
Latest ASCAT v3
From v1.2, GCAP uses the latest version of ASCAT. Install ASCAT v3 in R console from GitHub with:
# install.packages("remotes")
remotes::install_github('VanLoo-lab/ascat/ASCAT')
We have provided generated reference files above, but sometimes you may want to generate the reference data for yourself, in such case, please refer to https://github.com/VanLoo-lab/ascat for generating the required allele-specific copy number data.
Reference files:
The reference files are required in ASCAT for copy number calling.
The prediction model was built with data on the top of hg38 genome build, so hg38-based BAM file input is more recommended.
A fixed version of ASCAT v3
In our manuscript, we used a fixed version of ASCAT for the GCAP data pre-processing (modified and adapted for GCAP workflow in HPC). It does not fit the R version >=4.3.
# This is a forked version ASCAT
remotes::install_github("ShixiangWang/ascat@v3-for-gcap-v1", subdir = "ASCAT")
# A ASCAT version with loose SAM flag, useful sometimes
# remotes::install_github("ShixiangWang/ascat@v3-f1", subdir = "ASCAT")
# See https://github.com/ShixiangWang/gcap/issues/27
Reference files:
Alternatives to ASCAT
For the latest version of GCAP, sequenza or facets are supported for preprocessing the bam data, please refer to the provided links for usage.
Install GCAP (required)
Install gcap in R console:
# r-universe
install.packages('gcap', repos = c('https://shixiangwang.r-universe.dev', 'https://cloud.r-project.org'))
# or GitHub
remotes::install_github("ShixiangWang/gcap")
To work with the fixed version of ASCAT, you have to install version commits no more newer than
42f216d(tagv1.1.5), i.e., please useremotes::install_github("ShixiangWang/gcap@v1.1.5")and the R version should below v4.3.
If you would like to use CLI program in Shell terminal, run the following code in your R console after installation:
gcap::deploy()
Two scripts gcap-bam.R and gcap-ascn.R shall be linked to your path /usr/local/bin/.
You can use one of them based on you input data.
NOTE
For users with package GetoptLong version >= 1.1.0, a main command is implemented
and also linked to /usr/local/bin/ when calling deploy(). So you can type gcap as
a unified interface.
$ gcap
gcap (v1.0.0)
Usage: gcap [command] [options]
Commands:
bam Run GCAP workflow with tumor-normal paired BAM files
ascn Run GCAP workflow with curated allele-specific copy number data
----------
Citation:
GCAP
URL:
https://github.com/ShixiangWang/gcap
NOTE: gcap use XGBOOST < 1.6, if you have installed a latest version, you can install the specified version with:
install.packages("https://cran.r-project.org/src/contrib/Archive/xgboost/xgboost_1.5.2.1.tar.gz", repos = NULL)
Example
Run the following code to see a quick example:
library(gcap)
data("ascn")
rv <- gcap.ASCNworkflow(ascn, outdir = tempdir(), model = "XGB11")
rv
Pipeline (WES bam data only)
- for one tumor-normal pair, you can refer to one-pair.R. test-workflow/debug contains a full workflow for data obtained from SRA.
- for multiple tumor-normal pairs, you can refer to two-pair.R.
To run gcap from bam files, a machine with at least 80GB RAM is required for
the allelecount process. If you set multiple threads, please note the parallel
computation is used in part of the workflow. You should balance the nthread setting
and the computing power your machine provides by yourself.
It generally takes ~0.5h to finish one case (tumor-normal pair).
In our practice, when we want to process multiple cases, set nthread = 22 and
directly let gcap handle multiple cases (instead of writing a loop yourself) is
good enough.
A recommended setting for Slurm is given as:
#!/bin/bash
#SBATCH -N 1
#SBATCH -o output-%J.o
#SBATCH -n 22
#SBATCH --mem=102400
Templates of practical calling command with provided hg38 and hg19 annotations are given below:
# hg38 ----------------
gcap.workflow(
tumourseqfile = tfile, normalseqfile = nfile, tumourname = tn, normalname = nn, jobname = id,
outdir = outdir,
allelecounter_exe = "~/miniconda3/envs/cancerit/bin/alleleCounter",
g1000allelesprefix = file.path(
"/data/wsx/data/1000G_loci_hg38/",
"1kg.phase3.v5a_GRCh38nounref_allele_index_chr"
),
g1000lociprefix = file.path("/data/wsx/data/1000G_loci_hg38/",
"1kg.phase3.v5a_GRCh38nounref_loci_chrstring_chr"
),
GCcontentfile = "/data/wsx/data/GC_correction_hg38.txt",
replictimingfile = "/data/wsx/data/RT_correction_hg38.txt",
skip_finished_ASCAT = TRUE,
skip_ascat_call = FALSE,
result_file_prefix = "xxx",
extra_info = df,
include_type = FALSE,
genome_build = "hg38",
model = "XGB11"
)
# hg19 ----------------
gcap.workflow(
tumourseqfile = tfile, normalseqfile = nfile, tumourname = tn, normalname = nn, jobname = id,
outdir = outdir,
allelecounter_exe = "~/miniconda3/envs/cancerit/bin/alleleCounter", g1000allelesprefix = file.path(
"/data/wsx/data/1000G_loci_hg19/",
"1000genomesAlleles2012_chr"
), g1000lociprefix = file.path("/data/wsx/data/1000G_loci_hg19/", "1000genomesloci2012chrstring_chr"),
GCcontentfile = "/data/wsx/data/GC_correction_hg19.txt", replictimingfile = "/data/wsx/data/RT_correction_hg19.txt",
skip_finished_ASCAT = TRUE,
skip_ascat_call = FALSE,
result_file_prefix = "xxx",
extra_info = NULL,
include_type = FALSE,
genome_build = "hg19",
model = "XGB11"
)
Pipeline (Allele specific or absolute copy number data)
Please refer to ?gcap.ASCNworkflow().
Functions
For more custom and advanced control of the analysis, you can read the structured documentation at package site.
Logging
For better debugging and rechecking.
The logging information of your operation with gcap would be saved into
an independent file. You can use the following commands to get the file path
and print logging message. Please note you have to use ::: to access these
functions as they are not exported from gcap.
> gcap:::get_log_file()
[1] "~/Library/Logs/gcap/gcap.log"
> gcap:::cat_log_file()
Docker image
A docker image is available in ghcr along with its corresponding Dockerfile. This image comes pre-installed with all the necessary software. However, users are responsible for mapping the required reference files and input data files on their own. The Dockerfile can be customized according to the user's specific requirements, as permitted by the license we provide.
docker pull ghcr.io/shixiangwang/gcap:latest
Related tools
- gcaputils: gcap utils for downstream analysis and visualization.
- DoAbsolute: Automate Absolute Copy Number Calling using 'ABSOLUTE' package.
- sigminer: An easy-to-use and scalable toolkit for genomic alteration signature (a.k.a. mutational signature) analysis and visualization in R.
Output
gcap outputs two data tables
