SkillAgentSearch skills...

LexicMap

LexicMap: efficient sequence alignment against millions of prokaryotic genomes

Install / Use

/learn @shenwei356/LexicMap

README

<a href="https://bioinf.shenwei.me/LexicMap"><img src="logo.svg" width="36"/></a> LexicMap: efficient sequence alignment against millions of prokaryotic genomes​

Latest Version Anaconda Cloud Cross-platform license

LexicMap is a nucleotide sequence alignment tool for efficiently querying gene, plasmid, viral, or long-read sequences (>150 bp) against up to millions of prokaryotic genomes.

Documents: https://bioinf.shenwei.me/LexicMap

For the latest features and improvements, please download the pre-release binaries.

Please cite:

Wei Shen, John A. Lees, Zamin Iqbal. (2025) Efficient sequence alignment against millions of prokaryotic genomes with LexicMap. Nature Biotechnology. https://doi.org/10.1038/s41587-025-02812-8

Table of contents

Features

  1. The accuracy of LexicMap is comparable with Blastn, MMseqs2, and Minimap2. It
    • performs base-level alignment, with qcovGnm, qcovHSP, pident, evalue and bitscore returned, both in TSV and pairwise alignment format (output format).
    • returns all possible matches, including multiple copies of a gene in a genome.
  2. The alignment is fast and memory-efficient, scalable to up to millions of prokaryotic genomes.
  3. LexicMap is easy to install, we provide binary files with no dependencies for Linux, Windows, MacOS (x86 and arm CPUs).
  4. LexicMap is easy to use (see tutorials, usages, and FAQs).
    • Database building requires only a simple command, accepting input from files, a file list, or even a directory.
    • Sequence searching supports limiting search by TaxId(s), provides a progress bar.
    • Several utility commands are available to resume unfinished indexing, explore the index data, merge search results, extract matched subsequences and more.

Introduction

Motivation: Alignment against a database of genomes is a fundamental operation in bioinformatics, popularised by BLAST. However, given the increasing rate at which genomes are sequenced, existing tools struggle to scale.

  1. Existing full alignment tools face challenges of high memory consumption and slow speeds.
  2. Alignment-free large-scale sequence searching tools only return the matched genomes, without the vital positional information for downstream analysis.
  3. Mapping tools, or those utilizing compressed full-text indexes, return only the most similar matches.
  4. Prefilter+Align strategies have the sensitivity issue in the prefiltering step.

Methods: (algorithm overview)

  1. A rewritten and improved version of the sequence sketching method LexicHash is adopted to compute alignment seeds accurately and efficiently.
    • We solved the sketching deserts problem of LexicHash seeds to provide a window guarantee.
    • We added the support of suffix matching of seeds, making seeds much more tolerant to mutations. Any 31-bp seed with a common ≥15 bp prefix or suffix can be matched.
  2. A hierarchical index enables fast and low-memory variable-length seed matching (prefix + suffix matching).
  3. A pseudo alignment algorithm is used to find similar sequence regions from chaining results for alignment.
  4. A reimplemented Wavefront alignment algorithm is used for base-level alignment.

Results:

  1. LexicMap enables efficient indexing and searching of both RefSeq+GenBank and the AllTheBacteria datasets (2.3 and 1.9 million prokaryotic assemblies respectively).

  2. When searching in all 2,340,672 Genbank+Refseq prokaryotic genomes, Blastn is unable to run with this dataset on common servers as it requires >2000 GB RAM. (see performance).

    With LexicMap v0.7.0 (48 CPUs, indexes and queries queries in HDDs),

    |Query |Genome hits|Genome hits<br/>(high-similarity)|Genome hits<br/>(medium-similarity)|Genome hits<br/>(low-similarity)|Time |RAM | |:-------------------|----------:|--------------------------------:|----------------------------------:|-------------------------------:|----------:|-------:| |A 1.3-kb marker gene|41,718 |11,746 |115 |29,857 |3m:06s |3.97 GB | |A 1.5-kb 16S rRNA |1,955,167 |245,884 |501,691 |1,207,592 |32m:59s |11.09 GB| |A 52.8-kb plasmid |560,330 |96 |15,370 |544,864 |52m:22s |14.48 GB| |1003 AMR genes |30,967,882 |7,636,386 |4,858,063 |18,473,433 |15h:52m:08s|24.86 GB|

    Notes:

    1. Default paramters are used, for returning all possible matches.
    2. Only the best alignment of a genome is used to evaluate alignment similarity:
      • high-similarity: (a) qcov >= 90% (genes) or 70% (plasmids), (b) pident>=90%.
      • medium-similarity: (a) not belong to high-similarity, (b) qcov >= 50% (genes) or 30% (plasmids), (c) pident>=80%.
      • low-similarity: the remaining.
    3. The search time varies in different computing environments and mainly depends on the I/O speed and the number of threads.
    4. The memory use is lower since v0.8.0.

More documents: https://bioinf.shenwei.me/LexicMap.

Quick start

Building an index (see the tutorial of building an index).

# From a directory with multiple genome files
lexicmap index -I genomes/ -O db.lmi

# From a file list with one file per line
lexicmap index -S -X files.txt -O db.lmi

Querying (see the tutorial of searching).

# For short queries like genes or long reads, returning top N hits.
lexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \
    --align-min-match-pident 80 --min-qcov-per-hsp 70 --min-qcov-per-genome 70 \
    --top-n-genomes 10000

# For longer queries like plasmids, returning all hits.
lexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \
    --align-min-match-pident 70 --min-qcov-per-hsp 0  --min-qcov-per-genome 50 \
    --align-min-match-len 1000 \
    --top-n-genomes 0

Sample output (queries are a few Nanopore Q20 reads). See output format details.

query                qlen   hits   sgenome           sseqid              qcovGnm   cls   hsp   qcovHSP   alenHSP   pident   gaps   qstart   qend   sstart    send      sstr   slen      evalue      bitscore
------------------   ----   ----   ---------------   -----------------   -------   ---   ---   -------   -------   ------   ----   ------   ----   -------   -------   ----   -------   ---------   --------
ERR5396170.1000004   190    1      GCF_000227465.1   NC_016047.1         84.211    1     1     84.211    165       89.091   5      14       173    4189372   4189536   -      4207222   1.93e-63    253     
ERR5396170.1000006   796    3      GCF_013394085.1   NZ_CP040910.1       99.623    1     1     99.623    801       97.628   9      4        796    1138907   1139706   +      1887974   0.00e+00    1431    
ERR5396170.1000006   796    3      GCF_013394085.1   NZ_CP040910.1       99.623    2     2     99.623    801       97.628   9      4        796    32607     33406     +      1887974   0.00e+00    1431    
ERR5396170.1000006   796    3      GCF_013394085.1   NZ_CP040910.1       99.623    3     3     99.623    801       97.628   9      4        796    134468    135267    -      1887974   0.00e+00    1431    
ERR5396170.1000006   796    3      GCF_013394085.1   NZ_CP040910.1       99.623    4     4     99.623  
View on GitHub
GitHub Stars217
CategoryData
Updated16d ago
Forks10

Languages

Go

Security Score

100/100

Audited on Mar 16, 2026

No findings