LSH
Locality Sensitive Hashing using MinHash in Python/Cython to detect near duplicate text documents
Install / Use
/learn @mattilyra/LSHREADME
pylsh
pylsh is a Python implementation of locality sensitive hashing with minhash. It is very useful for detecting near duplicate documents.
The implementation uses the MurmurHash v3 library to create document finger prints.
Cython is needed if you want to regenerate the .cpp files for the hashing and shingling code. By default the setup script uses the pregenerated .cpp sources, you can change this with the USE_CYTHON flag in setup.py
NumPy is needed to run the code.
The MurmurHash3 library is distributed under the MIT license. More information https://github.com/aappleby/smhasher
examples
For an overview of how LSH works and how to set the parameters see this notebook. The notebook is also available in the examples directory.
installation
> git clone https://github.com/mattilyra/LSH
> cd LSH
> python setup.py install
Related Skills
node-connect
340.2kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
84.1kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
340.2kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
84.1kCommit, push, and open a PR
