SkillAgentSearch skills...

Deduplication

Remove duplicate documents/videos/images via popular algorithms such as SimHash, SpotSig, Shingling, etc.

Install / Use

/learn @Marcnuth/Deduplication
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

deduplication

PyPI - Downloads

Remove duplicate documents via popular algorithms such as SimHash, SpotSig, Shingling, etc.

Install

Run following commands:

# install current library
pip install deduplication

# install required pretrained NLP models 
python -m spacy download xx_ent_wiki_sm
python -m spacy download en_core_web_sm

Example

SimHash

from deduplication import simhash

hashvalue1 = simhash('this is text')
hashvalue2 = simhash('this is another text', n_block=4)

L-SimHash

from deduplication import lsimhash

hashvalue = lsimhash('this is very long article texts. maybe with a lot of sentences.')

Citation

SimHash

Sadowski C, Levin G. 
Simhash: Hash-based similarity detection[J]. 
Technical report, Google, 2007.

Related Skills

View on GitHub
GitHub Stars19
CategoryContent
Updated3mo ago
Forks6

Languages

Python

Security Score

92/100

Audited on Dec 10, 2025

No findings