Fmindex
Efficient substring searches on text corpora using a compressed index
Install / Use
/learn @andreasvc/FmindexREADME
FM-Index
An FM-index <http://en.wikipedia.org/wiki/FM-index>_ is a compressed suffix
array that offers fast substring queries.
This is a Python wrapper around
sdsl-lite <https://github.com/simongog/sdsl-lite>_ to provide an FM-Index
to a corpus of text files. This module provides an efficient method for performing
a large number (say 10,000) of substring searches. For performing less than a
1000 substring searches on a corpus, it is better to use the Aho-Corasick
algorithm, as used by fgrep, and the acora python module,
cf. https://github.com/scoder/acora
Both a character-based and a word-based version are available. The character-based version offers full-text search. The word-based version converts each space-separated token to an integer (i.e., words are never matched partially). This works best for texts which are tokenized, one sentence per line, with space-separated tokens.
Example
An example application shows how to perform a set of queries from a file against a number of files::
python fmgrep.py <queries> <files>
The result is similar to fgrep -c -f queries files, although the
counts will differ because grep counts multiple matches per line as a single
match.
Installation
requires sdsl::
git clone https://github.com/andreasvc/sdsl-lite.git
cd sdsl-lite
./install.sh $HOME/.local
and Cython::
pip install --user cython
To install, run::
make
References
- http://en.wikipedia.org/wiki/FM-index
- https://github.com/simongog/sdsl-lite
Related Skills
node-connect
349.7kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
109.7kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
349.7kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
349.7kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
