Pyfasttext
Yet another Python binding for fastText
Install / Use
/learn @vrasneur/PyfasttextREADME
pyfasttext
Warning! pyfasttext is no longer maintained: use the official Python binding from the fastText repository: https://github.com/facebookresearch/fastText/tree/master/python
Yet another Python binding for fastText.
The binding supports Python 2.6, 2.7 and Python 3. It requires Cython.
Numpy and cysignals are also dependencies, but are optional.
pyfasttext has been tested successfully on Linux and Mac OS X.
Warning: if you want to compile pyfasttext on Windows, do not compile with the cysignals module because it does not support this platform.
Table of Contents
- pyfasttext
- Table of Contents
Installation
To compile pyfasttext, make sure you have the following compiler:
- GCC (
g++) with C++11 support. - LLVM (
clang++) with (at least) partial C++17 support.
Simplest way to install pyfasttext: use pip
Just type these lines:
pip install cython
pip install pyfasttext
Possible compilation error
If you have a compilation error, you can try to install cysignals manually:
pip install cysignals
Then, retry to install pyfasttext with the already mentioned pip command.
Cloning
pyfasttext uses git submodules.
So, you need to add the --recursive option when you clone the repository.
git clone --recursive https://github.com/vrasneur/pyfasttext.git
cd pyfasttext
Requirements for Python 2.7
Python 2.7 support relies on the future module: pyfasttext needs bytes objects, which are not available natively in Python2.
You can install the future module with pip.
pip install future
Building and installing manually
First, install all the requirements:
pip install -r requirements.txt
Then, build and install with setup.py:
python setup.py install
Building and installing without optional dependencies
pyfasttext can export word vectors as numpy ndarrays, however this feature can be disabled at compile time.
To compile without numpy, pyfasttext has a USE_NUMPY environment variable. Set this variable to 0 (or empty), like this:
USE_NUMPY=0 python setup.py install
If you want to compile without cysignals, likewise, you can set the USE_CYSIGNALS environment variable to 0 (or empty).
Usage
How to load the library?
>>> from pyfasttext import FastText
How to load an existing model?
>>> model = FastText('/path/to/model.bin')
or
>>> model = FastText()
>>> model.load_model('/path/to/model.bin')
Word representation learning
You can use all the options provided by the fastText binary (input, output, epoch, lr, ...).
Just use keyword arguments in the training methods of the FastText object.
Training using Skipgram
>>> model = FastText()
>>> model.skipgram(input='data.txt', output='model', epoch=100, lr=0.7)
Training using CBoW
>>> model = FastText()
>>> model.cbow(input='data.txt', output='model', epoch=100, lr=0.7)
Word vectors
Word vectors access
Vector for a given word
By default, a single word vector is returned as a regular Python array of floats.
>>> model['dog']
array('f', [-1.308749794960022, -1.8326224088668823, ...])
Numpy ndarray
The model.get_numpy_vector(word) method returns the word vector as a numpy ndarray.
>>> model.get_numpy_vector('dog')
array([-1.30874979, -1.83262241, ...], dtype=float32)
If you want a normalized vector (i.e. the vector divided by its norm), there is an optional boolean parameter named normalized.
>>> model.get_numpy_vector('dog', normalized=True)
array([-0.07084749, -0.09920666, ...], dtype=float32)
Words for a given vector
The inverse operation of model[word] or model.get_numpy_vector(word) is model.words_for_vector(vector, k).
It returns a list of the k words closest to the provided vector. The default value for k is 1.
>>> king = model.get_numpy_vector('king')
>>> man = model.get_numpy_vector('man')
>>> woman = model.get_numpy_vector('woman')
>>> model.words_for_vector(king + woman - man, k=1)
[('queen', 0.77121970653533936)]
Get the number of words in the model
>>> model.nwords
500000
Get all the word vectors in a model
>>> for word in model.words:
... print(word, model[word])
Numpy ndarray
If you want all the word vectors as a big numpy ndarray, you can use the numpy_normalized_vectors member. Note that all these vectors are normalized.
>>> model.nwords
500000
>>> model.numpy_normalized_vectors
array([[-0.07549749, -0.09407753, ...],
[ 0.00635979, -0.17272158, ...],
...,
[-0.01009259, 0.14604086, ...],
[ 0.12467574, -0.0609326 , ...]], dtype=float32)
>>> model.numpy_normalized_vectors.shape
(500000, 100) # (number of words, dimension)
Misc operations with word vectors
Word similarity
>>> model.similarity('dog', 'cat')
0.75596606254577637
Most similar words
>>> model.nearest_neighbors('dog', k=2)
[('dogs', 0.7843924736976624), ('cat', 75596606254577637)]
Analogies
The model.most_similar() method works similarly as the one in gensim.
>>> model.most_similar(positive=['woman', 'king'], negative=['man'], k=1)
[('queen', 0.77121970653533936)]
Text classification
Supervised learning
>>> model = FastText()
>>> model.supervised(input='/path/to/input.txt', output='/path/to/model', epoch=100, lr=0.7)
Get all the labels
>>> model.labels
['LABEL1', 'LABEL2', ...]
Get the number of labels
>>> model.nlabels
100
Prediction
To obtain the k most likely labels from test sentences, there are multiple model.predict_*() methods.
The default value for k is 1. If you want to obtain all the possible labels, use None for k.
Labels and probabilities
If you have a list of strings (or an iterable object), use this:
>>> model.predict_proba(['first sentence\n', 'second sentence\n'], k=2)
[[('LABEL1', 0.99609375), ('LABEL3', 1.953126549381068e-08)], [('LABEL2', 1.0), ('LABEL3', 1.953126549381068e-08)]]
If you want to test a single string, use this:
>>> model.predict_proba_single('first sentence\n', k=2)
[('LABEL1', 0
Related Skills
claude-opus-4-5-migration
82.5kMigrate prompts and code from Claude Sonnet 4.0, Sonnet 4.5, or Opus 4.1 to Opus 4.5
model-usage
335.4kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
TrendRadar
49.7k⭐AI-driven public opinion & trend monitor with multi-platform aggregation, RSS, and smart alerts.🎯 告别信息过载,你的 AI 舆情监控助手与热点筛选工具!聚合多平台热点 + RSS 订阅,支持关键词精准筛选。AI 智能筛选新闻 + AI 翻译 + AI 分析简报直推手机,也支持接入 MCP 架构,赋能 AI 自然语言对话分析、情感洞察与趋势预测等。支持 Docker ,数据本地/云端自持。集成微信/飞书/钉钉/Telegram/邮件/ntfy/bark/slack 等渠道智能推送。
mcp-for-beginners
15.6kThis open-source curriculum introduces the fundamentals of Model Context Protocol (MCP) through real-world, cross-language examples in .NET, Java, TypeScript, JavaScript, Rust and Python. Designed for developers, it focuses on practical techniques for building modular, scalable, and secure AI workflows from session setup to service orchestration.
