Somaticseq
An ensemble approach to accurately detect somatic mutations using SomaticSeq
Install / Use
/learn @bioinform/SomaticseqREADME
SomaticSeq
SomaticSeq is an ensemble somatic SNV/indel caller that has the ability to use machine learning to filter out false positives from other callers. It also comes with a suite of genomic utilities. The detailed documentation is located in docs/Manual.pdf.
- It was published in Fang, L.T., Afshar, P.T., Chhibber, A. et al. An ensemble approach to accurately detect somatic mutations using SomaticSeq. Genome Biol 16, 197 (2015).
- Feel free to report issues and/or ask questions at the Issues page.
Training data for benchmarking and/or model building
In 2021, the FDA-led MAQC-IV/SEQC2 Consortium has produced multi-center multi-platform whole-genome and whole-exome sequencing data sets for a pair of tumor-normal reference samples (HCC1395 and HCC1395BL), along with the high-confidence somatic mutation call set. This work was published in Fang, L.T., Zhu, B., Zhao, Y. et al. Establishing community reference samples, data and call sets for benchmarking cancer mutation detection using whole-genome sequencing. Nat Biotechnol 39, 1151-1160 (2021) / PMID:34504347 / Free Read-Only Link. The following are some of the use cases for these resources:
- Use high-confidence call set as the "ground truth" to investigate how different sample preparations, sequencing library kits, and bioinformatic algorithms affect the accuracy of the somatic mutation pipelines, and develop best practices, e.g., Xiao W. et al. Nat Biotechnol 2021.
- Use high-confidence call set as the "ground truth" to build accurate and robust machine learning models for somatic mutation detections, e.g., NeuSomatic by Sahraeian S.M.E. et al. Genome Biol 2022, DeepSomatic by Park J. et al. 2024.
- Use the bam files and high-confidence call set to benchmark a workflow, e.g., Benchmarking NVIDIA Clara Parabricks Somatic Variant Calling Pipeline on AWS, NVIDIA Docs Hub, nf-core/sarek, etc.
Click for more details of the SEQC2's somatic mutation project.
Recommendation of how to use SEQC2 data to create SomaticSeq classifiers.
<hr> <table style="width: 100%;"> <tr> <td>Briefly explaining SomaticSeq v1.0</td> <td>SEQC2 somatic mutation reference data and call sets</td> <td>How to run <a href="https://precision.fda.gov/home/apps/app-G7XVKQQ02v051q5PK3yQYJKJ-1">SomaticSeq v3.6.3</a> on precisionFDA</td> </tr> <tr> <td><a href="https://youtu.be/MnJdTQWWN6w"><img src="docs/SomaticSeqYoutube.png" width="400" /></a></td> <td><a href="https://youtu.be/nn0BOAONRe8"><img src="docs/workflow400.png" width="400" /></a></td> <td><a href="https://youtu.be/fLKokuMGTvk"><img src="docs/precisionfda.png" width="400" /></a></td> </tr> <tr> <td></td> <td></td> <td>Run in <a href="https://youtu.be/F6TSdg0OffM">train or prediction mode</a></td> </tr> </table> <hr>Installation
Dependencies
This dockerfile reveals the dependencies
- Python 3, plus pysam, numpy, scipy, pandas, and xgboost libraries.
- BEDTools: required when parallel processing is invoked, and/or when any bed files are used as input files.
- Optional: dbSNP VCF file (if you want to use dbSNP membership as a feature).
- Optional: R and ada are required for AdaBoost, whereas XGBoost (default) is implemented in python.
- To install SomaticSeq, clone this repo,
cd somaticseq, and then runpip install .(To install extra packages for development:pip install '.[dev]'). A number of commands prefixed withsomaticseq_will be placed into the PATH.
To install using pip
Make sure to install bedtools separately.
pip install somaticseq
To install the bioconda version
SomaticSeq can also be found on
,
which has
so far. To
,
which also automatically installs a bunch of 3rd-party somatic mutation callers:
conda install -c bioconda somaticseq
To install from github source with conda
conda create --name venv -c bioconda python bedtools
conda activate venv
git clone git@github.com:bioinform/somaticseq.git
cd somaticseq
pip install -e .
Test your installation
If installed successfully, you will be able to run somaticseq --help in the
terminal. Also make sure bedtools is executable. There are some toy data sets
and test scripts in example that should finish in <1 minute
if installed properly.
Run SomaticSeq with an example command
-
At minimum, given the results of the individual mutation caller(s), SomaticSeq will extract sequencing features for the combined call set. Required inputs for command
somaticseqare:--output-directoryand--genome-reference, then- Either
pairedorsingleto invoke paired or single sample mode,- if
paired:--tumor-bam-file, and--normal-bam-fileare both required. - if
single:--bam-fileis required.
- if
Everything else is optional (though without a single VCF file from at least one caller, SomaticSeq does nothing).
-
The following four files will be created into the output directory:
Consensus.sSNV.vcf,Consensus.sINDEL.vcf,Ensemble.sSNV.tsv, andEnsemble.sINDEL.tsv.
-
If you're searching for pipelines to run those individual somatic mutation callers, feel free to take advantage of our Dockerized Somatic Mutation Workflow as a start.
- Important note: multi-argument options (e.g.,
--extra-hyperparametersor--features-excluded) cannot be placed immediately beforepairedorsingle, because those options would try to "grab"pairedorsingleas an additional argument.
- Important note: multi-argument options (e.g.,
# Merge caller results and extract SomaticSeq features
somaticseq \
--output-directory $OUTPUT_DIR \
--genome-reference GRCh38.fa \
--inclusion-region genome.bed \
--exclusion-region blacklist.bed \
--threads 24 \
paired \
--tumor-bam-file tumor.bam \
--normal-bam-file matched_normal.bam \
--mutect2-vcf MuTect2/variants.vcf \
--varscan-snv VarScan2/variants.snp.vcf \
--varscan-indel VarScan2/variants.indel.vcf \
--jsm-vcf JointSNVMix2/variants.snp.vcf \
--somaticsniper-vcf SomaticSniper/variants.snp.vcf \
--vardict-vcf VarDict/variants.vcf \
--muse-vcf MuSE/variants.snp.vcf \
--lofreq-snv LoFreq/variants.snp.vcf \
--lofreq-indel LoFreq/variants.indel.vcf \
--scalpel-vcf Scalpel/variants.indel.vcf \
--strelka-snv Strelka/variants.snv.vcf \
--strelka-indel Strelka/variants.indel.vcf \
--arbitrary-snvs additional_snv_calls_1.vcf.gz additional_snv_calls_2.vcf.gz ... \
--arbitrary-indels additional_indel_calls_1.vcf.gz additional_indel_calls_2.vcf.gz ...
-
For all of those input VCF files, both
.vcfand.vcf.gzare acceptable. SomaticSeq also accepts.cram, but some callers may only take.bam. -
--arbitrary-snvsand--arbitrary-indelsare added since v3.7.0. It allows users to input any arbitrary VCF file(s) from caller(s) that we did not explicitly incorporate. SNVs and indels have to be separated.- If your caller puts SNVs and indels in the same output VCF file, you may
split it using a SomaticSeq utility script, e.g.,
somaticseq_split_vcf -infile small_variants.vcf -snv snvs.vcf -indel indels.vcf. As usual, input can be either.vcfor.vcf.gz, but output will be.vcf. - For those VCF file(s), any calls not labeled REJECT or LowQual will be
considered a bona fide somatic mutation call. REJECT calls will be
skipped. LowQual calls will be considered, but will not have a value of
1inif_Callermachine learning feature.
- If your caller puts SNVs and indels in the same output VCF file, you may
split it using a SomaticSeq utility script, e.g.,
-
--inclusion-regionor--exclusion-regionwill requirebedtoolsin your path. -
--algorithmdefaults toxgboostas v3.6.0, but can also beada(AdaBoost in R). XGBoost supports multi-threading and can be orders of magnitude faster than AdaBoost, and seems to be about the same in terms of accuracy, so we changed the default fromadatoxgboostas v3.6.0 and that's what we recommend now. -
To split the job into multiple threads, place
--threads Xbefore thepairedoption to indicate X threads. It simply creates multiple BED file (each consisting of 1/X of
Related Skills
node-connect
334.1kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
82.1kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
334.1kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
82.1kCommit, push, and open a PR
