Texthero
Text preprocessing, representation and visualization from zero to hero.
Install / Use
/learn @jbesomi/TextheroREADME
Texthero is a python toolkit to work with text-based dataset quickly and effortlessly. Texthero is very simple to learn and designed to be used on top of Pandas. Texthero has the same expressiveness and power of Pandas and is extensively documented. Texthero is modern and conceived for programmers of the 2020 decade with little knowledge if any in linguistic.
You can think of Texthero as a tool to help you understand and work with text-based dataset. Given a tabular dataset, it's easy to grasp the main concept. Instead, given a text dataset, it's harder to have quick insights into the underline data. With Texthero, preprocessing text data, mapping it into vectors, and visualizing the obtained vector space takes just a couple of lines.
Texthero include tools for:
- Preprocess text data: it offers both out-of-the-box solutions but it's also flexible for custom-solutions.
- Natural Language Processing: keyphrases and keywords extraction, and named entity recognition.
- Text representation: TF-IDF, term frequency, and custom word-embeddings (wip)
- Vector space analysis: clustering (K-means, Meanshift, DBSCAN and Hierarchical), topic modeling (wip) and interpretation.
- Text visualization: vector space visualization, place localization on maps (wip).
Texthero is free, open-source and well documented (and that's what we love most by the way!).
We hope you will find pleasure working with Texthero as we had during his development.
<h2 align="center">Hablas español? क्या आप हिंदी बोलते हैं? 日本語が話せるのか?</h2>Texthero has been developed for the whole NLP community. We know how hard it is to deal with different NLP tools (NLTK, SpaCy, Gensim, TextBlob, Sklearn): that's why we developed Texthero, to simplify things.
Now, the next main milestone is to provide multilingual support and for this big step, we need the help of all of you. ¿Hablas español? Sie sprechen Deutsch? 你会说中文? 日本語が話せるのか? Fala português? Parli Italiano? Вы говорите по-русски? If yes or you speak another language not mentioned here, then you might help us develop multilingual support! Even if you haven't contributed before or you just started with NLP, contact us or open a Github issue, there is always a first time :) We promise you will learn a lot, and, ... who knows? It might help you find your new job as an NLP-developer!
For improving the python toolkit and provide an even better experience, your aid and feedback are crucial. If you have any problem or suggestion please open a Github issue, we will be glad to support you and help you.
<h2 align="center">Beta version</h2>Texthero's community is growing fast. Texthero though is still in a beta version; soon, a faster and better version will be released and it will bring some major changes.
For instance, to give a more granular control over the pipeline, starting from the next version on, all preprocessing functions will require as argument an already tokenized text. This will be a major change.
Once released the stable version (Texthero 2.0), backward compatibility will be respected. Until this point, backward compatibility will be present but it will be weaker.
If you want to be part of this fast-growing movements, do not hesitate to contribute: CONTRIBUTING!
<h2 align="center">Installation</h2>Install texthero via pip:
pip install texthero
☝️Under the hoods, Texthero makes use of multiple NLP and machine learning toolkits such as Gensim, NLTK, SpaCy and scikit-learn. You don't need to install them all separately, pip will take care of that.
<h2 align="center">Getting started</h2>For faster performance, make sure you have installed Spacy version >= 2.2. Also, make sure you have a recent version of python, the higher, the best.
The best way to learn Texthero is through the <a href="https://texthero.org/docs/getting-started">Getting Started</a> docs.
In case you are an advanced python user, then help(texthero) should do the work.
import texthero as hero
import pandas as pd
df = pd.read_csv(
"https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv"
)
df['pca'] = (
df['text']
.pipe(hero.clean)
.pipe(hero.tfidf)
.pipe(hero.pca)
)
hero.scatterplot(df, 'pca', color='topic', title="PCA BBC Sport news")
<p align="center">
<img src="https://github.com/jbesomi/texthero/raw/master/github/scatterplot_bbcsport.svg">
</p>
<h3>2. Text preprocessing, TF-IDF, K-means and Visualization</h3>
import texthero as hero
import pandas as pd
df = pd.read_csv(
"https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv"
)
df['tfidf'] = (
df['text']
.pipe(hero.clean)
.pipe(hero.tfidf)
)
df['kmeans_labels'] = (
df['tfidf']
.pipe(hero.kmeans, n_clusters=5)
.astype(str)
)
df['pca'] = df['tfidf'].pipe(hero.pca)
hero.scatterplot(df, 'pca', color='kmeans_labels', title="K-means BBC Sport news")
<p align="center">
<img src="https://github.com/jbesomi/texthero/raw/master/github/scatterplot_bbcsport_kmeans.svg">
</p>
<h3>3. Simple pipeline for text cleaning</h3>
>>> import texthero as hero
>>> import pandas as pd
>>> text = "This sèntencé (123 /) needs to [OK!] be cleaned! "
>>> s = pd.Series(text)
>>> s
0 This sèntencé (123 /) needs to [OK!] be cleane...
dtype: object
Remove all digits:
>>> s = hero.remove_digits(s)
>>> s
0 This sèntencé ( /) needs to [OK!] be cleaned!
dtype: object
Remove digits replaces only blocks of digits. The digits in the string "hello123" will not be removed. If we want to remove all digits, you need to set only_blocks to false.
Remove all types of brackets and their content.
>>> s = hero.remove_brackets(s)
>>> s
0 This sèntencé needs to be cleaned!
dtype: object
Remove diacritics.
>>> s = hero.remove_diacritics(s)
>>> s
0 This sentence needs to be cleaned!
dtype: object
Remove punctuation.
>>> s = hero.remove_punctuation(s)
>>> s
0 This sentence needs to be cleaned
dtype: object
Remove extra white-spaces.
>>> s = hero.remove_whitespace(s)
>>> s
0 This sentence needs to be cleaned
dtype: object
Sometimes we also want to get rid of stop-words.
>>> s = hero.remove_stopwords(s)
>>> s
0 This sentence needs cleaned
dtype: object
<h2 align="center">API</h2>
Texthero is composed of four modules: preprocessing.py, nlp.py, representation.py and visualization.py.
<h3>1. Preprocessing</h3>Scope: prepare text data for further analysis.
Full documentation: preprocessing
<h3>2. NLP</h3>Scope: provide classic natural language processing tools such as named_entity and noun_phrases.
Full documentation: nlp
<h3>2. Representation</h3>Scope: map text data into vectors and do dimensionality reduction.
Supported representation algorithms:
- Term frequency (
count) - Term frequency-inverse document frequency (
tfidf)
Supported clustering algorithms:
- K-means (
kmeans) - Density-Based Spatial Clustering of Applications with Noise (
dbscan) - Meanshift (
meanshift)
Supported dimensionality reduction algorithms:
- Principal component analysis (
pca) - t-distributed stochastic neighbor embedding (
tsne) - Non-negative matrix factorization (
nmf)
Full documentation: representation
<h3>3. Visualization</h3>Scope: summarize the main facts regarding the text data and visualize it. This module is opinionable. It's handy for anyone that needs a quick solution to visualize on screen the text data, for instance during a text exploratory data analysis (EDA).
Supported functions:
- Text scatterplot (
scatterplot) - Most common words (
top_words)
Full documentation: visualization
<h2 align="center">FAQ</h2> <h5>Why Texthero</h5>Sometimes we just want things done, right? Texthero helps wit
Related Skills
YC-Killer
2.7kA library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.
best-practices-researcher
The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app
groundhog
398Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).
isf-agent
a repo for an agent that helps researchers apply for isf funding
