SkillAgentSearch skills...

Pyfasttext

Yet another Python binding for fastText

Install / Use

/learn @vrasneur/Pyfasttext
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

pyfasttext

Warning! pyfasttext is no longer maintained: use the official Python binding from the fastText repository: https://github.com/facebookresearch/fastText/tree/master/python

Yet another Python binding for fastText.

The binding supports Python 2.6, 2.7 and Python 3. It requires Cython.

Numpy and cysignals are also dependencies, but are optional.

pyfasttext has been tested successfully on Linux and Mac OS X.
Warning: if you want to compile pyfasttext on Windows, do not compile with the cysignals module because it does not support this platform.

Table of Contents

Installation

To compile pyfasttext, make sure you have the following compiler:

  • GCC (g++) with C++11 support.
  • LLVM (clang++) with (at least) partial C++17 support.

Simplest way to install pyfasttext: use pip

Just type these lines:

pip install cython
pip install pyfasttext

Possible compilation error

If you have a compilation error, you can try to install cysignals manually:

pip install cysignals

Then, retry to install pyfasttext with the already mentioned pip command.

Cloning

pyfasttext uses git submodules.
So, you need to add the --recursive option when you clone the repository.

git clone --recursive https://github.com/vrasneur/pyfasttext.git
cd pyfasttext

Requirements for Python 2.7

Python 2.7 support relies on the future module: pyfasttext needs bytes objects, which are not available natively in Python2.
You can install the future module with pip.

pip install future

Building and installing manually

First, install all the requirements:

pip install -r requirements.txt

Then, build and install with setup.py:

python setup.py install

Building and installing without optional dependencies

pyfasttext can export word vectors as numpy ndarrays, however this feature can be disabled at compile time.

To compile without numpy, pyfasttext has a USE_NUMPY environment variable. Set this variable to 0 (or empty), like this:

USE_NUMPY=0 python setup.py install

If you want to compile without cysignals, likewise, you can set the USE_CYSIGNALS environment variable to 0 (or empty).

Usage

How to load the library?

>>> from pyfasttext import FastText

How to load an existing model?

>>> model = FastText('/path/to/model.bin')

or

>>> model = FastText()
>>> model.load_model('/path/to/model.bin')

Word representation learning

You can use all the options provided by the fastText binary (input, output, epoch, lr, ...).
Just use keyword arguments in the training methods of the FastText object.

Training using Skipgram

>>> model = FastText()
>>> model.skipgram(input='data.txt', output='model', epoch=100, lr=0.7)

Training using CBoW

>>> model = FastText()
>>> model.cbow(input='data.txt', output='model', epoch=100, lr=0.7)

Word vectors

Word vectors access

Vector for a given word

By default, a single word vector is returned as a regular Python array of floats.

>>> model['dog']
array('f', [-1.308749794960022, -1.8326224088668823, ...])
Numpy ndarray

The model.get_numpy_vector(word) method returns the word vector as a numpy ndarray.

>>> model.get_numpy_vector('dog')
array([-1.30874979, -1.83262241, ...], dtype=float32)

If you want a normalized vector (i.e. the vector divided by its norm), there is an optional boolean parameter named normalized.

>>> model.get_numpy_vector('dog', normalized=True)
array([-0.07084749, -0.09920666, ...], dtype=float32)
Words for a given vector

The inverse operation of model[word] or model.get_numpy_vector(word) is model.words_for_vector(vector, k).
It returns a list of the k words closest to the provided vector. The default value for k is 1.

>>> king = model.get_numpy_vector('king')
>>> man = model.get_numpy_vector('man')
>>> woman = model.get_numpy_vector('woman')
>>> model.words_for_vector(king + woman - man, k=1)
[('queen', 0.77121970653533936)]
Get the number of words in the model
>>> model.nwords
500000
Get all the word vectors in a model
>>> for word in model.words:
...   print(word, model[word])
Numpy ndarray

If you want all the word vectors as a big numpy ndarray, you can use the numpy_normalized_vectors member. Note that all these vectors are normalized.

>>> model.nwords
500000
>>> model.numpy_normalized_vectors
array([[-0.07549749, -0.09407753, ...],
       [ 0.00635979, -0.17272158, ...],
       ..., 
       [-0.01009259,  0.14604086, ...],
       [ 0.12467574, -0.0609326 , ...]], dtype=float32)
>>> model.numpy_normalized_vectors.shape
(500000, 100) # (number of words, dimension)

Misc operations with word vectors

Word similarity
>>> model.similarity('dog', 'cat')
0.75596606254577637
Most similar words
>>> model.nearest_neighbors('dog', k=2)
[('dogs', 0.7843924736976624), ('cat', 75596606254577637)]
Analogies

The model.most_similar() method works similarly as the one in gensim.

>>> model.most_similar(positive=['woman', 'king'], negative=['man'], k=1)
[('queen', 0.77121970653533936)]

Text classification

Supervised learning

>>> model = FastText()
>>> model.supervised(input='/path/to/input.txt', output='/path/to/model', epoch=100, lr=0.7)

Get all the labels

>>> model.labels
['LABEL1', 'LABEL2', ...]

Get the number of labels

>>> model.nlabels
100

Prediction

To obtain the k most likely labels from test sentences, there are multiple model.predict_*() methods.
The default value for k is 1. If you want to obtain all the possible labels, use None for k.

Labels and probabilities

If you have a list of strings (or an iterable object), use this:

>>> model.predict_proba(['first sentence\n', 'second sentence\n'], k=2)
[[('LABEL1', 0.99609375), ('LABEL3', 1.953126549381068e-08)], [('LABEL2', 1.0), ('LABEL3', 1.953126549381068e-08)]]

If you want to test a single string, use this:

>>> model.predict_proba_single('first sentence\n', k=2)
[('LABEL1', 0

Related Skills

View on GitHub
GitHub Stars224
CategoryEducation
Updated1mo ago
Forks30

Languages

Python

Security Score

100/100

Audited on Feb 15, 2026

No findings