Fmsi
FMSI is a highly memory efficient exact k-mer set index based on masked superstrings and the masked Burrows-Wheeler transform
Install / Use
/learn @OndrejSladky/FmsiREADME
FMSI
<!-- vim-markdown-toc GFM -->- Introduction
- Installation
- Usage
- Commands overview
- How it works?
- Testing
- Issues
- Changelog
- Licence
- Contact
Introduction
FMSI is a highly memory-efficient tool for performing membership queries on single $k$-mer sets. FMSI uses masked superstrings for storing the $k$-mer sets to ensure high compressibility for a wide range of different $k$-mer sets, and implements FMS-index, a simplification of the FM-index. It supports both streaming and single queries. The functionality implemented in FMSI is based on the following papers:
The memory consumption for FMSI are (w/o kLCP which is additional 1bit/superstring char):
- Queries: 2.41 bits / canonical 31-mer with human genome; ~3 bits / canonical 31-mer for E. coli pangenome (1.17G k-mers from 89k genome); ~7 bits / canonical 31-mer for SARS-CoV-2
- Construction: 46 GB for human genome
[1] Ondřej Sladký, Pavel Veselý, and Karel Břinda: FroM Superstring to Indexing: a space-efficient index for unconstrained k-mer sets using the Masked Burrows-Wheeler Transform (MBWT). bioRxiv 2024.10.30.621029; doi: 10.1101/2024.10.30.621029.
[2] Ondřej Sladký, Pavel Veselý, and Karel Břinda: Towards Efficient k-Mer Set Operations via Function-Assigned Masked Superstrings. bioRxiv 2024.03.06.583483, 2024. https://doi.org/10.1101/2024.03.06.583483
[3] Ondřej Sladký, Pavel Veselý, and Karel Břinda: Masked superstrings as a unified framework for textual k-mer set representations. bioRxiv 2023.02.01.526717, 2023. https://doi.org/10.1101/2023.02.01.526717
To construct an index (the fmsi index subcommand), FMSI accepts as input (see the -p parameter) a masked superstring of the $k$-mer set.
The masked superstring can be computed by KmerCamel🐫.
It then stores the index file in files with the same prefix and the .fmsi.[component] extension.
There are these components:
.ac_gt,.acand.gtfor storing the nodes of the wavelet tree of BWT.maskfor storing the RRR-compressed SA-transformed mask.klcpfor storing the kLCP array (optional).miscfor storing counts, dollar position and k
To query the index (the fmsi query subcommand), FMSI accepts a text file with $k$-mers on separate lines to query (see the -q parameter).
In a very near future, we will however update this to accept a FASTA file and allow for streaming queries.
Additionally, FMSI experimentally supports set operations using the $f$-masked superstrings framework. However, this feature is currently only experimental and requires rather significant time and memory.
Installation
Prerequisities:
- GCC 4.8+ or equivalent
- Zlib
Installation: FMSI can be installed using bioconda:
conda install bioconda::fmsi
Alternatively, FMSI can be built directly from source:
git clone --recursive https://github.com/OndrejSladky/fmsi
cd fmsi && make -j
export PATH="$PATH:$(pwd)"
Usage
k-mer membership queries (stable)
-
Compute an optimized masked superstring for all k-mers from a given FASTA file by KmerCamel🐫.
kmercamel compute -k 31 -o /dev/null -M ms.fa yourfile.fa -
Create an FMS index from the masked superstring:
fmsi index -k 31 ms.fa -
Query the index from a given query FASTA file with k-mers:
fmsi query -O -q query.fa -k 31 ms.fa
Note that the path for FMSI (i.e., ms.fa) needs to be the last argument.
Output format
For each line of queries, FMSI outputs an identifier and one bitstring, where 1 corresponds to a present k-mer and 0 corresponds to not present k-mer. Invalid k-mers, i.e., those that contain non-ACGT characters are also indicated with 0s.
Specific-case usage
If you need support for streaming queries, use the -S for speed enhancements at the cost of additional bit per superstring character.
If your mask superstring does not maximizes the number of ones in the mask, omit the -O optimization flag for query as otherwise you might get incorrect results.
We, however, recommend to optimize the mask using kmercamel optimize.
k-mer lookup queries (stable)
-
Compute a default masked superstring for all k-mers from a given FASTA file by KmerCamel🐫.
kmercamel compute -k 31 -o ms.fa yourfile.fa -
Create an FMS index from the masked superstring:
fmsi index -k 31 ms.fa -
Query the index from a given query FASTA file with k-mers:
fmsi lookup -q query.fa -k 31 ms.fa
Output format
For each line of lookup queries, FMSI outputs an identifier and a comma-separated list of unique integer for each present k-mer between 0 and number of k-mers - 1. K-mers that are not present in the index are marked with -1.
k-mer set operations (experimental)
Basic: k-mer set operations managed by FMSI
If you wish to perform set operations or answer membership queries without the need to understand the details of the $f$-masked superstring framework, FMSI can manage the details for you.
This can be done in a few simple steps:
- Compute a masked superstring.
- This can be done by KmerCamel🐫; simply run
kmercamel -c -k 31 -p fasta.fa -o ms.fa(with appropriate values for-kand-p). - If you obtained the masked superstring in a different way, make sure it minimizes the number of ones; if you're unsure, you can use
kmercamel optimize -c -a zeros -k 31 -p ms_more_ones.fa -o ms.fa. No need to optimize superstrings directly computed by KmerCamel🐫.
- This can be done by KmerCamel🐫; simply run
- Index the masked superstring with
fmsi index ms.fa. - Perform the set operations you wish with
fmsi [operation] -k 31 -p ms1.fa -p ms2.fa -r ms.fa. Possible operations (use the names instead of[operation]) are:unionto compute the union.interto compute the intersectiondiffto compute the set differencesymdiffto compute the symmetric difference
- Query the index with
fmsi query -q query.fa -k 31 ms.fa. - To get back the underlying masked superstring, use
fmsi export ms.fa.
If you use FMSI this way, it ensures that operations in any order and queries to any index are computed correctly, while keeping the memory usage for queries as low as possible. Furthermore, exported $f$-masked superstrings are always, or-masked superstrings, which are the default masked superstrings.
The only downside to this approach is that each set operation uses compaction, which is the most time- and memory- consuming part of the process, which in some use cases might cause slowdowns which are not necessary. If this is your case, you probably want to stick to the advanced usage, managing the functions and building block methods yourself.
Advanced: externally managed k-mer set operations
If you wish to manage the operations yourself, the workflow is quite similar to the basic usage, with the following changes:
- Underlying $f$-MS concatenation can be done with
fmsi merge. Details on which $f$ should be used is described in Chapter 4 of the paper. - Compaction and change back to the or-masked superstring can be done with
fmsi compact. - If you query the index with different function than or, use the
-fargument. The same applies to compaction.
Commands overview
To run the tool, run fmsi [command]
The recognized commands are:
indexCreates a BWT based index of the given masked superstring.queryQueries a $k$-mer against an index.exportExport the underlying masked superstring.unionPerforms union on two indexes. Expects or-MS.interPerforms intersection on two indexes. Expects or-MS.diffPerforms set difference on two indexes. Expects or-MS.symdiffPerforms symmetric difference on two indexes. Expects or-MS.mergeMerge several indexes.compactCompacts an index.
Each command has its own set of arguments, which can be displayed by running fmsi [command] -h.
How it works?
FMSI builds on the representation of k-mer sets via masked superstrings, which ensures a generally compact representation of k-mers, requiring typically 1-1.4 characters per k-mer for genomes and pan-genomes.
FMSI then builds a simplified FM-index, which omits the sampled suffix array, which is the key to the memory efficiency. It instead computes the SA-transformed mask, which makes it possible to determine the presence of a k-mer without the need to compute the original coordinates.
Additionally, FMSI optionally constructs the kLCP array (a binary version of the truncated LCP array) to allow O(1) time for a k-mer in streaming queries.
The memory consumption of the index is split as follows:
- $2.125$ bits per super
Related Skills
node-connect
338.0kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
83.4kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
338.0kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
83.4kCommit, push, and open a PR
