zol (& fai)

zol (& fai): tools for targeted searching and evolutionary investigations of gene clusters (sets of co-located genes - e.g. biosynthetic gene clusters, viruses/phages, operons, etc.).

First, fai allows users to search for homologous/orthologous instances of a query gene cluster in a database of (meta-)genomes. There are some other similar tools, including convenient webservers, to fai (which we highlight and recommend as altneratives on this documentation page); but, fai also has some unique/rarer options. Mainly, fai pays special attention to see whether gene cluster hits in target (meta-)genomes are on scaffold/contig edges and takes consideration of this, during both detection and downstream assessment. E.g. fai will mark individual coding genes and gene cluster instances if they are on the edge of a scaffold/contig, which can then be used as a filter in zol. This is important for calculation of conservation of genes across homologous gene clusters!

After finding homologous instances of a gene cluster - using fai or other software - users often wish to investigate the similarity between instances. This is often performed using pairwise similarity assessment via visualization with tools such as clinker, gggenomes, etc. While these tools are great, if you found 100s or 1000s of gene cluster instances such visualizations can get overwhelming and computationally expensive to render. To simplify the identification of interesting functional, evolutionary, and conservation patterns across 100s to 1000s of homologous gene cluster instances, we developed zol to perform de novo ortholog group predictions and create detailed color-formatted XLSX spreadsheets summarizing information. More recently, we have also introduced scalable visualization tools (cgc & cgcg) that allow for simpler assessment of information represented across thousands of homologous gene cluster instances.

Citation:

zol & fai: large-scale targeted detection and evolutionary investigation of gene clusters. Nucleic Acids Research 2025. Rauf Salamzade, Patricia Q Tran, Cody Martin, Abigail L Manson, Michael S Gilmore, Ashlee M Earl, Karthik Anantharaman, Lindsay R Kalan

In addition, please cite important dependency software or databases for your specific analysis accordingly.

Usage: zol-suite [-h] [--list-programs] [--version] <program> ...

The zol suite - a comprehensive bioinformatics toolkit for gene cluster analysis.

/================================\
|| ________      ________       ||
|||\_____  \    |\   ____\      ||
|| \|___/  /|   \ \  \___|_     ||
||     /  / /    \ \_____  \    ||
||    /  /_/__    \|____|\  \   ||
||   |\________\    ____\_\  \  ||
||    \|_______|   |\_________\ ||
||                 \|_________| ||
\================================/

Author: Rauf Salamzade
Lab: Kalan Lab; University of Wisconsin - Madison; McMaster University

This interface provides access to all ZOL tools through a single command-line interface.
Each tool has its own specific arguments and functionality.

Typical order of operations:

1.) Run prepTG to prepare a database of target genomes for searches using fai.
2.) Run fai to find additional instances of a gene cluster of interest in the prepTG database.
3.) Run zol to perform comparative gene cluster analysis on the results from fai.
4.) Run cgc and cgcg to visualize the results from zol.

For help with a specific program, use: zol-suite <program> --help

Positional Arguments:
  <program>        ZOL program to run
    abon           Automated analysis of conservation/novelty for a sample's biosynthetic
                   gene clusters.
    apos           Automated analysis of conservation/novelty for a sample's plasmids.
    atpoc          Automated analysis of conservation/novelty for a sample's prophages.
    cgc            Visualization of zol results along a consensus gene cluster sequence.
    cgcg           Visualization of zol results as a graphical network.
    fai            Find additional instances of gene clusters in a genome database using
                   flexible alignment and synteny criteria.
    prepTG         Prepare a database of target genomes for searches using fai.
    regex          Extract a genomic region (as GenBank) from a genome file (FASTA or GenBank)
                   based on scaffold and coordinate inputs (experimental).
    salt           Support assessment for lateral transfer of gene clusters (experimental).
    zol            Perform comparative gene cluster analysis.
    zol-scape      Run zol analysis on BiG-SCAPE results.

Options:
  -h, --help       show this help message and exit
  --list-programs  List all available programs and exit
  --version        show program's version number and exit

[!CAUTION] Please avoid using versions 1.5.1 to 1.5.3 in which zol has the possibility to get stuck in a while loop and write a large file. This issue is resolved in v1.5.4.

[!IMPORTANT] We recently updated zol to v1.6.0 - which introduces several key updates, including a unified interface that can be issued as zol-suite, improved PEP8 compliance for backend code, and lighter databases constructed using prepTG.

Main Contents:

Auxiliary tools within the suite:

Short Note on Resource Requirements:

Different programs in the zol suite have different resource requirements. Moving forward, the default settings in the zol program itself should usually allow for low memory usage and faster runtime. For thousands of gene cluster instances, we recommend to either use the dereplication/reinflation approach (see manuscript for comparison on evolutionary statistics between this approach and a full processing) or using DIAMOND linclust clustering to determine protein clusters/families (not true ortholog groups). Disk space is generally not a huge concern for zol analysis, but if working with thousands of gene clusters things can temporarily get large.

Available disk space is the primary concern however for fai and prepTG. This is mostly the case for users interested in the construction and searching of large databases (containing over a thousand genomes). Generally, prepTG and fai are designed to work on metagenomic as well as genomic datasets and do not have a high memory usage, but genomic files stack up in space and DIAMOND alignment files can quite get large as well.

Installation:

Bioconda (Recommended):

Note, (for some setups at least) it is critical to specify the conda-forge channel before the bioconda channel to properly configure priority and lead to a successful installation.

Recommended: For a significantly faster installation process, use mamba in place of conda in the below commands, by installing mamba in your base conda environment.

# 1. install and activate zol

# On Linux:
conda create -n zol_env -c conda-forge -c bioconda zol
conda activate zol_env

# 2. depending on internet speed, this can take 20-30 minutes
# end product will be ~40 GB! You can also run in minimal mode
# (which will only download Pfam & PGAP HMM models ~8.5 GB)
# using the -m argument. 
setup_annotation_dbs.py [-m]

# 3. run interface program
zol-suite [-h]

[!TIP] When you create a conda environment using -n, the environment will typically be stored in your home directory. However, because the databases can be large, you might prefer to instead setup the conda environment somewhere else with more space on your system using -p. For instance, conda create -p /path/to/drive_with_more_space/zol_conda_env/ -c conda-forge -c bioconda zol. Then, next time aroun

Zol

Install / Use

README