SkillAgentSearch skills...

YACHT

A mathematically characterized hypothesis test for organism presence/absence in a metagenome

Install / Use

/learn @KoslickiLab/YACHT
About this skill

Quality Score

0/100

Supported Platforms

Zed

README

YACHT

GitHub Workflow Status codecov Quality Gate Status CodeQL License: MIT

YACHT is a mathematically rigorous hypothesis test for the presence or absence of organisms in a metagenomic sample, based on Average Nucleotide Identity (ANI). Identifying whether a specific microbe is actually present in a metagenomic sample is often complicated by sequencing noise, low-abundance organisms, and high genomic similarity between species. Traditional profiling tools rely on simple thresholds that can lead to high false-positive rates. Various cohorts can utilize YACHT: microbiome researchers dealing with low-biomass samples, synthetic biologists needing to validate the composition of mock communities, and genomics researchers identifying specific metagenome-assembled genomes (MAGs) of interest within vast sequencing datasets.

The associated publication can be found here: https://academic.oup.com/bioinformatics/article/40/2/btae047/7588873

And the preprint can be found at: https://doi.org/10.1101/2023.04.18.537298.

Please cite via:

Koslicki, D., White, S., Ma, C., & Novikov, A. (2024). YACHT: an ANI-based statistical test to detect microbial presence/absence in a metagenomic sample. Bioinformatics, 40(2), btae047.

</br>

Quick demonstration

We provide a demo to show how to use YACHT. Please follow the command lines below to try it out:

NUM_THREADS=64 # Adjust based on your machine's capabilities

cd demo # the 'demo' folder can be downloaded via command 'yacht download demo' if it doesn't exist

# build k-mer sketches for the query sample and ref genomes
yacht sketch sample --infile ./query_data/query_data.fq --kmer 31 --scaled 1000 --outfile sample.sig.zip
yacht sketch ref --infile ./ref_genomes --kmer 31 --scaled 1000 --outfile ref.sig.zip

# preprocess the reference genomes (training step)
yacht train --ref_file ref.sig.zip --ksize 31 --num_threads ${NUM_THREADS} --ani_thresh 0.95 --prefix 'demo_ani_thresh_0.95' --outdir ./ --force

# run YACHT algorithm to check the presence of reference genomes in the query sample (inference step)
yacht run --json demo_ani_thresh_0.95_config.json --sample_file sample.sig.zip --significance 0.99 --num_threads ${NUM_THREADS} --min_coverage_list 1 0.6 0.2 0.1 --outdir ./

# convert result to CAMI profile format (Optional)
yacht convert --yacht_output_dir ./results --sheet_name min_coverage0.2 --genome_to_taxid toy_genome_to_taxid.tsv --mode cami --sample_name 'MySample' --outfile_prefix cami_result --outdir ./

The output will be stored in the results folder containing:

  • result.xlsx: An EXCEL file recording the presence of reference genomes with different spreadsheets given the minimum coverage of 1 0.6 0.2 0.1.
  • result_all.txt: A TXT file containing all unfiltered results for all user-given min_coverage values.
</br>

Contents

Installation

Conda Installation

YACHT is available on Conda can be installed via the steps below to install:

# create conda environment
conda create -n yacht_env

# activiate environment
conda activate yacht_env

# install YACHT
conda install -c conda-forge -c bioconda yacht

Manual installation

YACHT requires Python >3.6 (and <3.12) with the following core genomics dependencies: sourmash (>=4.8.3), sourmash_plugin_branchwater, and pytaxonkit. The full list of dependencies can be found in the environment configuration. To ensure a clean and isolated workspace, we recommend using a virtual environment. This can be accomplished using either Conda or Mamba, a faster alternative to Conda.

Using Conda

To create your Conda environment and install YACHT, follow these steps:

# Clone the YACHT repository
git clone https://github.com/KoslickiLab/YACHT.git
cd YACHT

# Create a new virtual environment named 'yacht_env'
conda env create -f env/yacht_env.yml

# Activate the newly created environment
conda activate yacht_env

# Install YACHT within the environment
pip install .

Using Mamba

If you prefer using Mamba instead of Conda, just simply repalce conda with mamba in the above commands.

Using Docker

Using Dockerfile:

docker build --tag 'yacht' .
docker run -it --entrypoint=/bin/bash yacht -i
conda activate yacht_env

Using Act:

Act. To run YACHT on docker, simply execute "act" from the main YACHT folder, or "act --container-architecture linux/amd64" if you are on MacOS system.

</br>

Commands

YACHT can be run via the command line yacht <module>. The main modules include: download, sketch, train, run, and convert.

  • The download module has three submodules: demo, default_ref_db, and pretrained_ref_db:

    • demo can automatically download the demo files to a specified folder:
    # Example
    yacht download demo --outfolder ./demo
    
    • default_ref_db can automatically download pre-generated sketches of reference genomes from GTDB or GenBank as our input reference databases.
    # Example for downloading the k31 sketches of representative genomes of GTDB rs214 version 
    yacht download default_ref_db --database gtdb --db_version rs214 --gtdb_type reps --k 31 --outfolder ./
    

    | Parameter | Explanation | | ----------------- | ------------------------------------------------------------ | | database | two options for default reference databases: 'genbank' or 'gtdb' | | db_version | the version of database, options: "genbank-2022.03", "rs202", "rs207", "rs214" | | ncbi_organism | the NCBI organism for the NCBI reference genome, options: "archaea", "bacteria", "fungi", "virus", "protozoa"| | gtdb_type | for GTDB database, chooses "representative" genome version or "full" genome version | | k | the length of k-mer | | outfolder | the path to a folder where the downloaded file is expected to locate |

    • pretrained_ref_db can automatically download our pre-trained reference genome database that can be directly used as input for yacht run module.
    # Example for downloading the pretrained reference database that was trained from GTDB rs214 representative genomes with k=31 and ani_threshold=0.9995
    yacht download pretrained_ref_db --database gtdb --db_version rs214 --k 31 --ani_thresh 0.9995 --outfolder ./
    

    | Parameter | Explanation | | ----------------- | ------------------------------------------------------------ | | database | two options for default reference databases: 'genbank' or 'gtdb' | | db_version | the version of database, options: "genbank-2022.03", "rs214" | | ncbi_organism | the NCBI organism for the NCBI reference genome, options: "archaea", "bacteria", "fungi", "virus", "protozoa"| | ani_thresh | the cutoff by which two organisms are considered indistinguishable (default: 0.95) | | k | the length of k-mer | | outfolder | the path to a folder where the downloaded file is expected to locate |

  • The sketch module (<ins>note that it is a simple wrapper to sourmash</ins>) has two submodules: ref and sample:

    • ref is used to sketch fasta files and make them as a reference database
    # Example for sketching multiple fasta files as reference genomes in a given folder
    yacht sketch ref --infile ./
    

Related Skills

View on GitHub
GitHub Stars34
CategoryDevelopment
Updated7d ago
Forks10

Languages

C++

Security Score

90/100

Audited on Mar 30, 2026

No findings