LexicMap
LexicMap: efficient sequence alignment against millions of prokaryotic genomes
Install / Use
/learn @shenwei356/LexicMapREADME
<a href="https://bioinf.shenwei.me/LexicMap"><img src="logo.svg" width="36"/></a> LexicMap: efficient sequence alignment against millions of prokaryotic genomes
LexicMap is a nucleotide sequence alignment tool for efficiently querying gene, plasmid, viral, or long-read sequences (>150 bp) against up to millions of prokaryotic genomes.
Documents: https://bioinf.shenwei.me/LexicMap
For the latest features and improvements, please download the pre-release binaries.
Please cite:
Wei Shen, John A. Lees, Zamin Iqbal. (2025) Efficient sequence alignment against millions of prokaryotic genomes with LexicMap. Nature Biotechnology. https://doi.org/10.1038/s41587-025-02812-8
Table of contents
- Features
- Introduction
- Quick start
- Performance
- Installation
- Algorithm overview
- Citation
- Limitations
- Terminology differences
- Support
- License
- Related projects
Features
- The accuracy of LexicMap is comparable with Blastn, MMseqs2, and Minimap2. It
- performs base-level alignment, with
qcovGnm,qcovHSP,pident,evalueandbitscorereturned, both in TSV and pairwise alignment format (output format).- provides a genome-wide query coverage metric (
qcovGnm), which enables accurate interpretation of search results - particularly for circular queries (such as plasmid, virus, and mtDNA) against both complete and fragmented assemblies.
- provides a genome-wide query coverage metric (
- returns all possible matches, including multiple copies of a gene in a genome.
- performs base-level alignment, with
- The alignment is fast and memory-efficient, scalable to up to millions of prokaryotic genomes.
- LexicMap is easy to install, we provide binary files with no dependencies for Linux, Windows, MacOS (x86 and arm CPUs).
- LexicMap is easy to use (see tutorials, usages, and FAQs).
- Database building requires only a simple command, accepting input from files, a file list, or even a directory.
- Sequence searching supports limiting search by TaxId(s), provides a progress bar.
- Several utility commands are available to resume unfinished indexing, explore the index data, merge search results, extract matched subsequences and more.
Introduction
Motivation: Alignment against a database of genomes is a fundamental operation in bioinformatics, popularised by BLAST. However, given the increasing rate at which genomes are sequenced, existing tools struggle to scale.
- Existing full alignment tools face challenges of high memory consumption and slow speeds.
- Alignment-free large-scale sequence searching tools only return the matched genomes, without the vital positional information for downstream analysis.
- Mapping tools, or those utilizing compressed full-text indexes, return only the most similar matches.
- Prefilter+Align strategies have the sensitivity issue in the prefiltering step.
Methods: (algorithm overview)
- A rewritten and improved version of the sequence sketching method LexicHash is adopted to compute alignment seeds accurately and efficiently.
- We solved the sketching deserts problem of LexicHash seeds to provide a window guarantee.
- We added the support of suffix matching of seeds, making seeds much more tolerant to mutations. Any 31-bp seed with a common ≥15 bp prefix or suffix can be matched.
- A hierarchical index enables fast and low-memory variable-length seed matching (prefix + suffix matching).
- A pseudo alignment algorithm is used to find similar sequence regions from chaining results for alignment.
- A reimplemented Wavefront alignment algorithm is used for base-level alignment.
Results:
-
LexicMap enables efficient indexing and searching of both RefSeq+GenBank and the AllTheBacteria datasets (2.3 and 1.9 million prokaryotic assemblies respectively).
-
When searching in all 2,340,672 Genbank+Refseq prokaryotic genomes, Blastn is unable to run with this dataset on common servers as it requires >2000 GB RAM. (see performance).
With LexicMap v0.7.0 (48 CPUs, indexes and queries queries in HDDs),
|Query |Genome hits|Genome hits<br/>(high-similarity)|Genome hits<br/>(medium-similarity)|Genome hits<br/>(low-similarity)|Time |RAM | |:-------------------|----------:|--------------------------------:|----------------------------------:|-------------------------------:|----------:|-------:| |A 1.3-kb marker gene|41,718 |11,746 |115 |29,857 |3m:06s |3.97 GB | |A 1.5-kb 16S rRNA |1,955,167 |245,884 |501,691 |1,207,592 |32m:59s |11.09 GB| |A 52.8-kb plasmid |560,330 |96 |15,370 |544,864 |52m:22s |14.48 GB| |1003 AMR genes |30,967,882 |7,636,386 |4,858,063 |18,473,433 |15h:52m:08s|24.86 GB|
Notes:
- Default paramters are used, for returning all possible matches.
- Only the best alignment of a genome is used to evaluate alignment similarity:
- high-similarity: (a) qcov >= 90% (genes) or 70% (plasmids), (b) pident>=90%.
- medium-similarity: (a) not belong to high-similarity, (b) qcov >= 50% (genes) or 30% (plasmids), (c) pident>=80%.
- low-similarity: the remaining.
- The search time varies in different computing environments and mainly depends on the I/O speed and the number of threads.
- The memory use is lower since v0.8.0.
More documents: https://bioinf.shenwei.me/LexicMap.
Quick start
Building an index (see the tutorial of building an index).
# From a directory with multiple genome files
lexicmap index -I genomes/ -O db.lmi
# From a file list with one file per line
lexicmap index -S -X files.txt -O db.lmi
Querying (see the tutorial of searching).
# For short queries like genes or long reads, returning top N hits.
lexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \
--align-min-match-pident 80 --min-qcov-per-hsp 70 --min-qcov-per-genome 70 \
--top-n-genomes 10000
# For longer queries like plasmids, returning all hits.
lexicmap search -d db.lmi query.fasta -o query.fasta.lexicmap.tsv \
--align-min-match-pident 70 --min-qcov-per-hsp 0 --min-qcov-per-genome 50 \
--align-min-match-len 1000 \
--top-n-genomes 0
Sample output (queries are a few Nanopore Q20 reads). See output format details.
query qlen hits sgenome sseqid qcovGnm cls hsp qcovHSP alenHSP pident gaps qstart qend sstart send sstr slen evalue bitscore
------------------ ---- ---- --------------- ----------------- ------- --- --- ------- ------- ------ ---- ------ ---- ------- ------- ---- ------- --------- --------
ERR5396170.1000004 190 1 GCF_000227465.1 NC_016047.1 84.211 1 1 84.211 165 89.091 5 14 173 4189372 4189536 - 4207222 1.93e-63 253
ERR5396170.1000006 796 3 GCF_013394085.1 NZ_CP040910.1 99.623 1 1 99.623 801 97.628 9 4 796 1138907 1139706 + 1887974 0.00e+00 1431
ERR5396170.1000006 796 3 GCF_013394085.1 NZ_CP040910.1 99.623 2 2 99.623 801 97.628 9 4 796 32607 33406 + 1887974 0.00e+00 1431
ERR5396170.1000006 796 3 GCF_013394085.1 NZ_CP040910.1 99.623 3 3 99.623 801 97.628 9 4 796 134468 135267 - 1887974 0.00e+00 1431
ERR5396170.1000006 796 3 GCF_013394085.1 NZ_CP040910.1 99.623 4 4 99.623
