sense2vec: Contextually-keyed word vectors

sense2vec (Trask et. al, 2015) is a nice twist on word2vec that lets you learn more interesting and detailed word vectors. This library is a simple Python implementation for loading, querying and training sense2vec models. For more details, check out our blog post. To explore the semantic similarities across all Reddit comments of 2015 and 2019, see the interactive demo.

🦆 Version 2.0 (for spaCy v3) out now! Read the release notes here.

✨ Features

Query vectors for multi-word phrases based on part-of-speech tags and entity labels.
spaCy pipeline component and extension attributes.
Fully serializable so you can easily ship your sense2vec vectors with your spaCy model packages.
Optional caching of nearest neighbors for super fast "most similar" queries.
Train your own vectors using a pretrained spaCy model, raw text and GloVe or Word2Vec via fastText (details).
Prodigy annotation recipes for evaluating models, creating lists of similar multi-word phrases and converting them to match patterns, e.g. for rule-based NER or to bootstrap NER annotation (details & examples).

🚀 Quickstart

Standalone usage

from sense2vec import Sense2Vec

s2v = Sense2Vec().from_disk("/path/to/s2v_reddit_2015_md")
query = "natural_language_processing|NOUN"
assert query in s2v
vector = s2v[query]
freq = s2v.get_freq(query)
most_similar = s2v.most_similar(query, n=3)
# [('machine_learning|NOUN', 0.8986967),
#  ('computer_vision|NOUN', 0.8636297),
#  ('deep_learning|NOUN', 0.8573361)]

Usage as a spaCy pipeline component

⚠️ Note that this example describes usage with spaCy v3. For usage with spaCy v2, download sense2vec==1.0.3 and check out the v1.x branch of this repo.

import spacy

nlp = spacy.load("en_core_web_sm")
s2v = nlp.add_pipe("sense2vec")
s2v.from_disk("/path/to/s2v_reddit_2015_md")

doc = nlp("A sentence about natural language processing.")
assert doc[3:6].text == "natural language processing"
freq = doc[3:6]._.s2v_freq
vector = doc[3:6]._.s2v_vec
most_similar = doc[3:6]._.s2v_most_similar(3)
# [(('machine learning', 'NOUN'), 0.8986967),
#  (('computer vision', 'NOUN'), 0.8636297),
#  (('deep learning', 'NOUN'), 0.8573361)]

Interactive demos

To try out our pretrained vectors trained on Reddit comments, check out the interactive sense2vec demo.

This repo also includes a Streamlit demo script for exploring vectors and the most similar phrases. After installing streamlit, you can run the script with streamlit run and one or more paths to pretrained vectors as positional arguments on the command line. For example:

pip install streamlit
streamlit run https://raw.githubusercontent.com/explosion/sense2vec/master/scripts/streamlit_sense2vec.py /path/to/vectors

Pretrained vectors

To use the vectors, download the archive(s) and pass the extracted directory to Sense2Vec.from_disk or Sense2VecComponent.from_disk. The vector files are attached to the GitHub release. Large files have been split into multi-part downloads.

| Vectors | Size | Description | 📥 Download (zipped) | | -------------------- | -----: | ---------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | s2v_reddit_2019_lg | 4 GB | Reddit comments 2019 (01-07) | part 1, part 2, part 3 | | s2v_reddit_2015_md | 573 MB | Reddit comments 2015 | part 1 |

To merge the multi-part archives, you can run the following:

cat s2v_reddit_2019_lg.tar.gz.* > s2v_reddit_2019_lg.tar.gz

⏳ Installation & Setup

sense2vec releases are available on pip:

pip install sense2vec

To use pretrained vectors, download one of the vector packages, unpack the .tar.gz archive and point from_disk to the extracted data directory:

from sense2vec import Sense2Vec
s2v = Sense2Vec().from_disk("/path/to/s2v_reddit_2015_md")

👩‍💻 Usage

Usage with spaCy v3

The easiest way to use the library and vectors is to plug it into your spaCy pipeline. The sense2vec package exposes a Sense2VecComponent, which can be initialised with the shared vocab and added to your spaCy pipeline as a custom pipeline component. By default, components are added to the end of the pipeline, which is the recommended position for this component, since it needs access to the dependency parse and, if available, named entities.

import spacy
from sense2vec import Sense2VecComponent

nlp = spacy.load("en_core_web_sm")
s2v = nlp.add_pipe("sense2vec")
s2v.from_disk("/path/to/s2v_reddit_2015_md")

The component will add several extension attributes and methods to spaCy's Token and Span objects that let you retrieve vectors and frequencies, as well as most similar terms.

doc = nlp("A sentence about natural language processing.")
assert doc[3:6].text == "natural language processing"
freq = doc[3:6]._.s2v_freq
vector = doc[3:6]._.s2v_vec
most_similar = doc[3:6]._.s2v_most_similar(3)

For entities, the entity labels are used as the "sense" (instead of the token's part-of-speech tag):

doc = nlp("A sentence about Facebook and Google.")
for ent in doc.ents:
    assert ent._.in_s2v
    most_similar = ent._.s2v_most_similar(3)

Available attributes

The following extension attributes are exposed on the Doc object via the ._ property:

| Name | Attribute Type | Type | Description | | ------------- | -------------- | ---- | ----------------------------------------------------------------------------------- | | s2v_phrases | property | list | All sense2vec-compatible phrases in the given Doc (noun phrases, named entities). |

The following attributes are available via the ._ property of Token and Span objects – for example token._.in_s2v:

| Name | Attribute Type | Return Type | Description | | ------------------ | -------------- | ------------------ | ---------------------------------------------------------------------------------- | | in_s2v | property | bool | Whether a key exists in the vector map. | | s2v_key | property | unicode | The sense2vec key of the given object, e.g. "duck NOUN". | | s2v_vec | property | ndarray[float32] | The vector of the given key. | | s2v_freq | property | int | The frequency of the given key. | | s2v_other_senses | property | list | Available other senses, e.g. "duck\|VERB" for "duck\|NOUN". | | s2v_most_similar | method | list | Get the n most similar terms. Returns a list of ((word, sense), score) tuples

Sense2vec

Install / Use

README