SkillAgentSearch skills...

Wshiml

📚 Word shingling for near duplicate document detection

Install / Use

/learn @unhammer/Wshiml
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Wshiml

Implementation of http://nlp.stanford.edu/IR-book/html/htmledition/near-duplicates-and-shingling-1.html

Build requires oasis. Do:

./configure    # optionally with --prefix
make
make install

To build the example command-line program, do

./configure --enable-cli
make
make install
find-similar-docs --help

The command-line program requires cmdliner. The rest of the software has no dependencies apart from Oasis for building from git.

On Debian/Ubuntu, you can install all build dependencies with

sudo apt install oasis libcmdliner-ocaml-dev

So far the code is fairly unoptimised apart from what's described in http://nlp.stanford.edu/IR-book/html/htmledition/near-duplicates-and-shingling-1.html and uses 7s (4s with super-shingling) to cluster 1100 documents of altogether 766,937 words on an old 2.8 GHz AMD.

API documentation

is here.

View on GitHub
GitHub Stars4
CategoryDevelopment
Updated1y ago
Forks1

Languages

OCaml

Security Score

60/100

Audited on Jan 4, 2025

No findings