Deduplication
Remove duplicate documents/videos/images via popular algorithms such as SimHash, SpotSig, Shingling, etc.
Install / Use
/learn @Marcnuth/DeduplicationREADME
deduplication
Remove duplicate documents via popular algorithms such as SimHash, SpotSig, Shingling, etc.
Install
Run following commands:
# install current library
pip install deduplication
# install required pretrained NLP models
python -m spacy download xx_ent_wiki_sm
python -m spacy download en_core_web_sm
Example
SimHash
from deduplication import simhash
hashvalue1 = simhash('this is text')
hashvalue2 = simhash('this is another text', n_block=4)
L-SimHash
from deduplication import lsimhash
hashvalue = lsimhash('this is very long article texts. maybe with a lot of sentences.')
Citation
SimHash
Sadowski C, Levin G.
Simhash: Hash-based similarity detection[J].
Technical report, Google, 2007.
Related Skills
qqbot-channel
347.9kQQ 频道管理技能。查询频道列表、子频道、成员、发帖、公告、日程等操作。使用 qqbot_channel_api 工具代理 QQ 开放平台 HTTP 接口,自动处理 Token 鉴权。当用户需要查看频道、管理子频道、查询成员、发布帖子/公告/日程时使用。
docs-writer
100.2k`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie
model-usage
347.9kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
arscontexta
2.9kClaude Code plugin that generates individualized knowledge systems from conversation. You describe how you think and work, have a conversation and get a complete second brain as markdown files you own.
