Sense2vec
🦆 Contextually-keyed word vectors
Install / Use
/learn @explosion/Sense2vecREADME
<a href="https://explosion.ai"><img src="https://explosion.ai/assets/img/logo.svg" width="125" height="125" align="right" /></a>
sense2vec: Contextually-keyed word vectors
sense2vec (Trask et. al, 2015) is a nice twist on word2vec that lets you learn more interesting and detailed word vectors. This library is a simple Python implementation for loading, querying and training sense2vec models. For more details, check out our blog post. To explore the semantic similarities across all Reddit comments of 2015 and 2019, see the interactive demo.
🦆 Version 2.0 (for spaCy v3) out now! Read the release notes here.
✨ Features

- Query vectors for multi-word phrases based on part-of-speech tags and entity labels.
- spaCy pipeline component and extension attributes.
- Fully serializable so you can easily ship your sense2vec vectors with your spaCy model packages.
- Optional caching of nearest neighbors for super fast "most similar" queries.
- Train your own vectors using a pretrained spaCy model, raw text and GloVe or Word2Vec via fastText (details).
- Prodigy annotation recipes for evaluating models, creating lists of similar multi-word phrases and converting them to match patterns, e.g. for rule-based NER or to bootstrap NER annotation (details & examples).
🚀 Quickstart
Standalone usage
from sense2vec import Sense2Vec
s2v = Sense2Vec().from_disk("/path/to/s2v_reddit_2015_md")
query = "natural_language_processing|NOUN"
assert query in s2v
vector = s2v[query]
freq = s2v.get_freq(query)
most_similar = s2v.most_similar(query, n=3)
# [('machine_learning|NOUN', 0.8986967),
# ('computer_vision|NOUN', 0.8636297),
# ('deep_learning|NOUN', 0.8573361)]
Usage as a spaCy pipeline component
⚠️ Note that this example describes usage with spaCy v3. For usage with spaCy v2, download
sense2vec==1.0.3and check out thev1.xbranch of this repo.
import spacy
nlp = spacy.load("en_core_web_sm")
s2v = nlp.add_pipe("sense2vec")
s2v.from_disk("/path/to/s2v_reddit_2015_md")
doc = nlp("A sentence about natural language processing.")
assert doc[3:6].text == "natural language processing"
freq = doc[3:6]._.s2v_freq
vector = doc[3:6]._.s2v_vec
most_similar = doc[3:6]._.s2v_most_similar(3)
# [(('machine learning', 'NOUN'), 0.8986967),
# (('computer vision', 'NOUN'), 0.8636297),
# (('deep learning', 'NOUN'), 0.8573361)]
Interactive demos
<img width="34%" src="https://user-images.githubusercontent.com/13643239/68093565-1bb6ea80-fe97-11e9-8192-e293acc290fe.png" align="right" />To try out our pretrained vectors trained on Reddit comments, check out the interactive sense2vec demo.
This repo also includes a Streamlit demo script for
exploring vectors and the most similar phrases. After installing streamlit,
you can run the script with streamlit run and one or more paths to
pretrained vectors as positional arguments on the command line. For
example:
pip install streamlit
streamlit run https://raw.githubusercontent.com/explosion/sense2vec/master/scripts/streamlit_sense2vec.py /path/to/vectors
Pretrained vectors
To use the vectors, download the archive(s) and pass the extracted directory to
Sense2Vec.from_disk or Sense2VecComponent.from_disk. The vector files are
attached to the GitHub release. Large files have been split into multi-part
downloads.
| Vectors | Size | Description | 📥 Download (zipped) |
| -------------------- | -----: | ---------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| s2v_reddit_2019_lg | 4 GB | Reddit comments 2019 (01-07) | part 1, part 2, part 3 |
| s2v_reddit_2015_md | 573 MB | Reddit comments 2015 | part 1 |
To merge the multi-part archives, you can run the following:
cat s2v_reddit_2019_lg.tar.gz.* > s2v_reddit_2019_lg.tar.gz
⏳ Installation & Setup
sense2vec releases are available on pip:
pip install sense2vec
To use pretrained vectors, download
one of the vector packages, unpack the .tar.gz archive
and point from_disk to the extracted data directory:
from sense2vec import Sense2Vec
s2v = Sense2Vec().from_disk("/path/to/s2v_reddit_2015_md")
👩💻 Usage
Usage with spaCy v3
The easiest way to use the library and vectors is to plug it into your spaCy
pipeline. The sense2vec package exposes a Sense2VecComponent, which can be
initialised with the shared vocab and added to your spaCy pipeline as a
custom pipeline component.
By default, components are added to the end of the pipeline, which is the
recommended position for this component, since it needs access to the dependency
parse and, if available, named entities.
import spacy
from sense2vec import Sense2VecComponent
nlp = spacy.load("en_core_web_sm")
s2v = nlp.add_pipe("sense2vec")
s2v.from_disk("/path/to/s2v_reddit_2015_md")
The component will add several
extension attributes and methods
to spaCy's Token and Span objects that let you retrieve vectors and
frequencies, as well as most similar terms.
doc = nlp("A sentence about natural language processing.")
assert doc[3:6].text == "natural language processing"
freq = doc[3:6]._.s2v_freq
vector = doc[3:6]._.s2v_vec
most_similar = doc[3:6]._.s2v_most_similar(3)
For entities, the entity labels are used as the "sense" (instead of the token's part-of-speech tag):
doc = nlp("A sentence about Facebook and Google.")
for ent in doc.ents:
assert ent._.in_s2v
most_similar = ent._.s2v_most_similar(3)
Available attributes
The following extension attributes are exposed on the Doc object via the ._
property:
| Name | Attribute Type | Type | Description |
| ------------- | -------------- | ---- | ----------------------------------------------------------------------------------- |
| s2v_phrases | property | list | All sense2vec-compatible phrases in the given Doc (noun phrases, named entities). |
The following attributes are available via the ._ property of Token and
Span objects – for example token._.in_s2v:
| Name | Attribute Type | Return Type | Description |
| ------------------ | -------------- | ------------------ | ---------------------------------------------------------------------------------- |
| in_s2v | property | bool | Whether a key exists in the vector map. |
| s2v_key | property | unicode | The sense2vec key of the given object, e.g. "duck NOUN". |
| s2v_vec | property | ndarray[float32] | The vector of the given key. |
| s2v_freq | property | int | The frequency of the given key. |
| s2v_other_senses | property | list | Available other senses, e.g. "duck\|VERB" for "duck\|NOUN". |
| s2v_most_similar | method | list | Get the n most similar terms. Returns a list of ((word, sense), score) tuples
