Stopwords
Removes most frequent words (stop words) from a text content. Based on a Curated list of language statistics.
Install / Use
/learn @bbalet/StopwordsREADME
stopwords is a go package that removes stop words from a text content. If instructed to do so, it will remove HTML tags and parse HTML entities. The objective is to prepare a text in view to be used by natural processing algos or text comparison algorithms such as SimHash.
It uses a curated list of the most frequent words used in these languages:
- Arabic
- Bulgarian
- Czech
- Danish
- English
- Finnish
- French
- German
- Hungarian
- Italian
- Japanese
- Khmer
- Latvian
- Norwegian
- Persian
- Polish
- Portuguese
- Romanian
- Russian
- Slovak
- Spanish
- Swedish
- Thai
- Turkish
If the function is used with an unsupported language, it doesn't fail, but will apply english filter to the content.
How to use this package?
You can find an example here https:github.com/bbalet/gorelated where stopwords package is used in conjunction with SimHash algorithm in order to find a list of related content for a static website generator:
import (
"github.com/bbalet/stopwords"
)
//Example with 2 strings containing P html tags
//"la", "un", etc. are (stop) words without lexical value in French
string1 := []byte("<p>la fin d'un bel après-midi d'été</p>")
string2 := []byte("<p>cet été, nous avons eu un bel après-midi</p>")
//Return a string where HTML tags and French stop words has been removed
cleanContent := stopwords.CleanString(string1, "fr", true)
//Get two (Sim) hash representing the content of each string
hash1 := stopwords.Simhash(string1, "fr", true)
hash2 := stopwords.Simhash(string2, "fr", true)
//Hamming distance between the two strings (diffference between contents)
distance := stopwords.CompareSimhash(hash1, hash2)
//Clean the content of string1 and string2, compute the Levenshtein Distance
stopwords.LevenshteinDistance(string1, string2, "fr", true)
Where fr is the ISO 639-1 code for French (it accepts a BCP 47 tag as well). https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes
How to load a custom list of stop words from a file/string?
This package comes with a predefined list of stopwords. However, two functions allow you to use your own list of words:
stopwords.LoadStopWordsFromFile(filePath, langCode, separator)
stopwords.LoadStopWordsFromString(wordsList, langCode, separator)
They will overwrite the predefined words for a given language.
You can find an example with the file stopwords.txt
How to overwrite the word segmenter?
If you don't want to strip the Unicode Characters of the 'Number, Decimal Digit'
Category, call the function DontStripDigits before using the package :
stopwords.DontStripDigits()
If you want to use your own segmenter, you can overwrite the regular expression:
stopwords.OverwriteWordSegmenter(`[\pL]+`)
Limitations
Please note that this library doesn't break words. If you want to break words prior using stopwords, you need to use another library that provides a binding to ICU library.
These curated lists contain the most used words in various topics, they were not built with a corpus limited to any given specialized topic.
Credits
Most of the lists were built by IR Multilingual Resources at UniNE http://members.unine.ch/jacques.savoy/clef/index.html
License
stopwords is released under the BSD license.
Related Skills
xurl
342.0kA CLI tool for making authenticated requests to the X (Twitter) API. Use this skill when you need to post tweets, reply, quote, search, read posts, manage followers, send DMs, upload media, or interact with any X API v2 endpoint.
docs-writer
99.6k`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie
model-usage
342.0kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
kubeshark
11.8kCluster-wide network observability for Kubernetes. Captures L4 packets, L7 API calls, and decrypted TLS traffic using eBPF, with full Kubernetes context. Available to AI agents via MCP and human operators via dashboard.

