YACHT
A mathematically characterized hypothesis test for organism presence/absence in a metagenome
Install / Use
/learn @KoslickiLab/YACHTREADME
YACHT
YACHT is a mathematically rigorous hypothesis test for the presence or absence of organisms in a metagenomic sample, based on Average Nucleotide Identity (ANI). Identifying whether a specific microbe is actually present in a metagenomic sample is often complicated by sequencing noise, low-abundance organisms, and high genomic similarity between species. Traditional profiling tools rely on simple thresholds that can lead to high false-positive rates. Various cohorts can utilize YACHT: microbiome researchers dealing with low-biomass samples, synthetic biologists needing to validate the composition of mock communities, and genomics researchers identifying specific metagenome-assembled genomes (MAGs) of interest within vast sequencing datasets.
The associated publication can be found here: https://academic.oup.com/bioinformatics/article/40/2/btae047/7588873
And the preprint can be found at: https://doi.org/10.1101/2023.04.18.537298.
Please cite via:
</br>Koslicki, D., White, S., Ma, C., & Novikov, A. (2024). YACHT: an ANI-based statistical test to detect microbial presence/absence in a metagenomic sample. Bioinformatics, 40(2), btae047.
Quick demonstration
We provide a demo to show how to use YACHT. Please follow the command lines below to try it out:
NUM_THREADS=64 # Adjust based on your machine's capabilities
cd demo # the 'demo' folder can be downloaded via command 'yacht download demo' if it doesn't exist
# build k-mer sketches for the query sample and ref genomes
yacht sketch sample --infile ./query_data/query_data.fq --kmer 31 --scaled 1000 --outfile sample.sig.zip
yacht sketch ref --infile ./ref_genomes --kmer 31 --scaled 1000 --outfile ref.sig.zip
# preprocess the reference genomes (training step)
yacht train --ref_file ref.sig.zip --ksize 31 --num_threads ${NUM_THREADS} --ani_thresh 0.95 --prefix 'demo_ani_thresh_0.95' --outdir ./ --force
# run YACHT algorithm to check the presence of reference genomes in the query sample (inference step)
yacht run --json demo_ani_thresh_0.95_config.json --sample_file sample.sig.zip --significance 0.99 --num_threads ${NUM_THREADS} --min_coverage_list 1 0.6 0.2 0.1 --outdir ./
# convert result to CAMI profile format (Optional)
yacht convert --yacht_output_dir ./results --sheet_name min_coverage0.2 --genome_to_taxid toy_genome_to_taxid.tsv --mode cami --sample_name 'MySample' --outfile_prefix cami_result --outdir ./
The output will be stored in the results folder containing:
result.xlsx: An EXCEL file recording the presence of reference genomes with different spreadsheets given the minimum coverage of1 0.6 0.2 0.1.result_all.txt: A TXT file containing all unfiltered results for all user-given min_coverage values.
Contents
- YACHT
- Quick start
- Installation
- Usage
- YACHT Commands Overview
- YACHT workflow
- Creating sketches of your reference database genomes (yacht sketch ref)
- Creating sketches of your sample (yacht sketch sample)
- Preprocess the reference genomes (yacht train)
- Run the YACHT algorithm (yacht run)
- Convert YACHT result to other popular output formats (yacht convert)
Installation
Conda Installation
YACHT is available on Conda can be installed via the steps below to install:
# create conda environment
conda create -n yacht_env
# activiate environment
conda activate yacht_env
# install YACHT
conda install -c conda-forge -c bioconda yacht
Manual installation
YACHT requires Python >3.6 (and <3.12) with the following core genomics dependencies: sourmash (>=4.8.3), sourmash_plugin_branchwater, and pytaxonkit. The full list of dependencies can be found in the environment configuration. To ensure a clean and isolated workspace, we recommend using a virtual environment. This can be accomplished using either Conda or Mamba, a faster alternative to Conda.
Using Conda
To create your Conda environment and install YACHT, follow these steps:
# Clone the YACHT repository
git clone https://github.com/KoslickiLab/YACHT.git
cd YACHT
# Create a new virtual environment named 'yacht_env'
conda env create -f env/yacht_env.yml
# Activate the newly created environment
conda activate yacht_env
# Install YACHT within the environment
pip install .
Using Mamba
If you prefer using Mamba instead of Conda, just simply repalce conda with mamba in the above commands.
Using Docker
Using Dockerfile:
docker build --tag 'yacht' .
docker run -it --entrypoint=/bin/bash yacht -i
conda activate yacht_env
Using Act:
Act. To run YACHT on docker, simply execute "act" from the main YACHT folder, or "act --container-architecture linux/amd64" if you are on MacOS system.
</br>Commands
YACHT can be run via the command line yacht <module>. The main modules include: download, sketch, train, run, and convert.
-
The
downloadmodule has three submodules:demo,default_ref_db, andpretrained_ref_db:democan automatically download the demo files to a specified folder:
# Example yacht download demo --outfolder ./demodefault_ref_dbcan automatically download pre-generated sketches of reference genomes from GTDB or GenBank as our input reference databases.
# Example for downloading the k31 sketches of representative genomes of GTDB rs214 version yacht download default_ref_db --database gtdb --db_version rs214 --gtdb_type reps --k 31 --outfolder ./| Parameter | Explanation | | ----------------- | ------------------------------------------------------------ | | database | two options for default reference databases: 'genbank' or 'gtdb' | | db_version | the version of database, options: "genbank-2022.03", "rs202", "rs207", "rs214" | | ncbi_organism | the NCBI organism for the NCBI reference genome, options: "archaea", "bacteria", "fungi", "virus", "protozoa"| | gtdb_type | for GTDB database, chooses "representative" genome version or "full" genome version | | k | the length of k-mer | | outfolder | the path to a folder where the downloaded file is expected to locate |
pretrained_ref_dbcan automatically download our pre-trained reference genome database that can be directly used as input foryacht runmodule.
# Example for downloading the pretrained reference database that was trained from GTDB rs214 representative genomes with k=31 and ani_threshold=0.9995 yacht download pretrained_ref_db --database gtdb --db_version rs214 --k 31 --ani_thresh 0.9995 --outfolder ./| Parameter | Explanation | | ----------------- | ------------------------------------------------------------ | | database | two options for default reference databases: 'genbank' or 'gtdb' | | db_version | the version of database, options: "genbank-2022.03", "rs214" | | ncbi_organism | the NCBI organism for the NCBI reference genome, options: "archaea", "bacteria", "fungi", "virus", "protozoa"| | ani_thresh | the cutoff by which two organisms are considered indistinguishable (default: 0.95) | | k | the length of k-mer | | outfolder | the path to a folder where the downloaded file is expected to locate |
-
The
sketchmodule (<ins>note that it is a simple wrapper tosourmash</ins>) has two submodules:refandsample:refis used to sketch fasta files and make them as a reference database
# Example for sketching multiple fasta files as reference genomes in a given folder yacht sketch ref --infile ./
Related Skills
node-connect
349.7kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
109.7kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
349.7kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
349.7kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
