SkillAgentSearch skills...

Strainy

Graph-based assembly phasing

Install / Use

/learn @katerinakazantseva/Strainy

README

[![CC BY-NC-SA 4.0][cc-by-nc-sa-shield]][cc-by-nc-sa]

Strainy

Version 1.2

Strainy is a tool for phasing and assembly of bacterial strains from long-read sequencing data (either Oxford Nanopore or PacBio). Given a reference (or collapsed de novo assembly) and set of aligned reads as input, Strainy produces multi-allelic phasing, individual strain haplotypes and strain-specific variant calls.

Compared to the current metagenomic strain profiling approaches, Strainy offers multiple unique features. First, it provides direct phasing of strains-specific variants, enabling evolutionary analysis of haplotypes rather than individual mutations. Second, Strainy provides a method to quantity strains and distinguish the most abundant and/or most divergent strains. Third, Strainy enables assembly-based analysis, which is useful in the absence of a high quality reference genome, especially in cross-samples comparison.

Contents

Installation

The recommended way of installing is through conda or mamba . Strainy is distributed via bioconda channel and can be installed using one of the following:

conda install bioconda::strainy

or

mamba install strainy

The other way is to install from the repository:

git clone https://github.com/katerinakazantseva/strainy
cd strainy
git submodule update --init
make -C submodules/Flye
conda env create -f environment.yml -n strainy

Once installed, you will need to activate the conda environment prior to running:

conda activate strainy
./strainy.py -h

Note that if you use an M1 conda installation, you should run conda config --add subdirs osx-64 before installation. Find details here

Quick usage

After repository cloning/installation, you should be able to run:

conda activate strainy
./strainy.py --gfa_ref test_set/toy.gfa --fastq test_set/toy.fastq.gz --output out_strainy --mode hifi --threads 8

The --gfa specifices input strain-collapsed graph (e.g. from de novo metagenomic assembly). --fastq specifies the matching long-read data. --output is the output directory, --mode speficies read type (either hifi or nano), --threads specifices the number of threads o use.

The output directory out_strainy will contain strain-level assmebly strain_contigs.gfa along with phased alignment alignment_phased.bam, strain variants calls strain_variants.vcf and other info. See below for the details on how to interpret strainy output.

Also see a more detailed tutorial for the example of how to use Strainy.

Docker container

Alternatively, you can use a Docker container (using the example provided in test_set Strainy directory):

ST_DIR=`pwd`
docker run -v $ST_DIR:$ST_DIR -u `id -u`:`id -g` mkolmogo/strainy:1.0 strainy --gfa $ST_DIR/test_set/toy.gfa --fastq $ST_DIR/test_set/toy.fastq.gz -o $ST_DIR/out_strainy --threads 8 --mode hifi

Strainy input

Strainy supports PacBio HiFi, Nanopore R9 (Guppy5+) and R10 sequencing.

The two main inputs to Strainy are:

  • GFA file A collapsed de novo metagenomic assembly that can be produced with metaFlye. For metaFlye parameters, please see Preparing de novo metagenomic assemblies. Alternatively, a reference in fasta format could be provided as input (--fasta_ref).

  • Reads (fasta/fastq) containing reads that need to be assmebled / phased. In the case of improving collapsed de novo assembly, same reads should be used for the assembler input.

  • Optinal alignment / variant calls. A user can provide their own alignment in bam format and variant calls in vcf format (that will be used for phasing). If these files are provided, splitting of long unitigs must be disabled by adding --unitig-split-length 0. It is recommended to split long sequences prior to producing alignment / variant calls, for example using the Strainy --only-split option. Long unitigs (50 kb+) may significantly slow down Strainy.

Preparing de novo metagenomic assemblies

We have developed Strainy using metaFlye metagenomic assembly graphs as input. The recommended set of parameters is --meta --keep-haplotypes --no-alt-contigs -i 0.

Note that -i 0 disables metaFlye's polishing procedure, which we found to improve read assignment to bubble branches during minimap2 realignment. --keep-haplotypes retains structural variations between strains on the assembly graph. --no-alt-contigs disables the output of "alternative" contigs, which can later confuse the read aligner.

Parameters description

Required

| Argument | Description | | ------------- | ------------- | |-o, --output |Output directory| |-g, --gfa_ref |Input assembly graph (.gfa) (may be produced with metaFlye or minigraph)| |-f, --fasta_ref |Input reference fasta (for linear genome)| |-q, --fastq |FASTQ file containing reads ( PacBio HiFi or Nanopore sequencing)| |-m, --mode |Type of the reads {hifi,nano}|

Optional

| Argument | Description | | ------------- | ------------- | | --snp | .vcf file, with variants of the desired allele frequency. If not provided, Strainy will use the built-in pileup-based caller| |-b, --bam | .bam file generated by aligning the input reads to the input graph, minimap2 will be used to generate a .bam file if not provided| | -a, --allele-frequency | Allele frequency threshold for built-in pileup-based caller. Will only work if --snp is not used (default: None)| | -d, --cluster-divergence | The maximum number of total mismatches allowed in the cluster per 1 kbp. Should be selected depending on SNP rates and their accuracy. Higher values can reduce high fragmentation at the cost of clustering accuracy (default: None)| | --unitig-split-length |The length (in kb) which the unitigs that are longer will be split, set 0 to disable (default: 50 kb)| |--min-unitig-coverage |The minimum coverage threshold for phasing unitigs, unitigs with lower coverage will not be phased (default: 20)| |--max-unitig-coverage |The maximum coverage threshold for phasing unitigs, unitigs with higher coverage will not be phased (default: 500)| |-t, --threads | Number of threads to use (default: 4)| |--debug | Enables debug mode for extra logs and output | |-s, --stage | Stage to run: phase, transform or e2e (phase + transform) (default: e2e)|

Output files

  • alignment_phased.bam Phased input alignemnt. Phased is defined via YC tag, which could also be used for IGV visualization. The matching reference could be found at preprocessing_data/gfa_converted.fasta. If (unphased) alignment was not provided as input, Strainy produced alignment against the input gfa using minimap2.

  • strain_unitigs.gfa Tranformed graph that incorporates assmebled strain haplotypes. These are "finer" strain unitigs that match the CSV tables with additional info below.

  • strain_contigs.gfa A simplified and extended version of teh graph after final simplification. Some strain unitigs are joined into longer contigs, so these sequenced may no longer have IDs matching to the CSV tables below.

  • strain_variants.vcf A file with variant produced from assembled strain haplotypes in VCF format. In the INFO field, ALT_HAP refers to strain unitigs that support the ALT version of the variant, and REF_HAP correspond to the list of unitigs that contain no variant (reference state).

  • reference_unitig_info_table.csv Additional statistics for reference sequences (such as length, coverage, SNP rate).

  • phased_unitig_info_table.csv Statistics for each phased strain unitig (matching the strain_unitigs.gfa file). For each strain unitig, its length, coverage, SNP rate and other statistics are reported.

  • multiplicity_stats.txt Dataset-level summary of strain multiplicity and strain divergence, along with other assembly-based statistics.

Strainy tutorial

Here we illustrate Strainy usage scenario using the simulated metagenomic dataset of five E. coli strains. Download the input data:

wget https://zenodo.org/records/11187925/files/strainy_ecoli_example.tar.gz
tar -xvf strainy_ecoli_example.tar.gz

It contains simulated reads and metaFlye assembly graph. Optionally, if you want to reproduce the metaFlye assembly, you can run:

flye --nano-hq strainy_ecoli_example/ecoli_5strain_sim_badread.fastq.gz -o metaflye -t 30 --meta --no-alt-contigs --keep-haplotypes -i 0

Then, you can run Strainy using:

./strainy.py --gfa_ref strainy_ecoli_example/ecoli_5strain_metaflye_hap.gfa --fastq strainy_ecoli_example/ecoli_5strain_sim_badread.fastq.gz --mode nano -t 30 --output strainy_out

This run may take ~2h in 30 threads. If you don't want to wait, you can download results from here:

wget https://zenodo.org/records/11187925/files/strainy_ecoli_out.tar.gz
tar -xvf strainy_ecoli_out.tar.gz

Now, let's take a look at the results! multiplicity_info.txt will contain some assembly stats, and the information about strain multiplicity:

Reference utgs input:	len: 6459094	num: 515	N50:28856
Reference utgs select:	len: 6367432	num: 433	N50:29273
Reference utgs phased:	len: 5439417	num: 319	N50:31133
Strain utgs asmembled:	len: 20620769	num: 1729	N50:16393

Multiplicity
Mul	RefSeqLength
1	   1653400	********************
2	    455200	*****
3	    770300	*********
4	   1634900	*******************
5	   154

Related Skills

View on GitHub
GitHub Stars93
CategoryDevelopment
Updated5d ago
Forks6

Languages

Python

Security Score

85/100

Audited on Apr 3, 2026

No findings