SimCSE
[EMNLP 2021] SimCSE: Simple Contrastive Learning of Sentence Embeddings https://arxiv.org/abs/2104.08821
Install / Use
/learn @princeton-nlp/SimCSEREADME
SimCSE: Simple Contrastive Learning of Sentence Embeddings
This repository contains the code and pre-trained models for our paper SimCSE: Simple Contrastive Learning of Sentence Embeddings.
**************************** Updates ****************************
<!-- Thanks for your interest in our repo! --> <!-- Probably you will think this as another *"empty"* repo of a preprint paper 🥱. Wait a minute! The authors are working day and night 💪, to make the code and models available, so you can explore our state-of-the-art sentence embeddings. We anticipate the code will be out * **in one week** *. --> <!-- * 4/26: SimCSE is now on [Gradio Web Demo](https://gradio.app/g/AK391/SimCSE) (Thanks [@AK391](https://github.com/AK391)!). Try it out! -->- 8/31: Our paper has been accepted to EMNLP! Please check out our updated paper (with updated numbers and baselines).
- 5/12: We updated our unsupervised models with new hyperparameters and better performance.
- 5/10: We released our sentence embedding tool and demo code.
- 4/23: We released our training code.
- 4/20: We released our model checkpoints and evaluation code.
- 4/18: We released our paper. Check it out!
Quick Links
- Overview
- Getting Started
- Model List
- Use SimCSE with Huggingface
- Train SimCSE
- Bugs or Questions?
- Citation
- SimCSE Elsewhere
Overview
We propose a simple contrastive learning framework that works with both unlabeled and labeled data. Unsupervised SimCSE simply takes an input sentence and predicts itself in a contrastive learning framework, with only standard dropout used as noise. Our supervised SimCSE incorporates annotated pairs from NLI datasets into contrastive learning by using entailment pairs as positives and contradiction pairs as hard negatives. The following figure is an illustration of our models.

Getting Started
We provide an easy-to-use sentence embedding tool based on our SimCSE model (see our Wiki for detailed usage). To use the tool, first install the simcse package from PyPI
pip install simcse
Or directly install it from our code
python setup.py install
Note that if you want to enable GPU encoding, you should install the correct version of PyTorch that supports CUDA. See PyTorch official website for instructions.
After installing the package, you can load our model by just two lines of code
from simcse import SimCSE
model = SimCSE("princeton-nlp/sup-simcse-bert-base-uncased")
See model list for a full list of available models.
Then you can use our model for encoding sentences into embeddings
embeddings = model.encode("A woman is reading.")
Compute the cosine similarities between two groups of sentences
sentences_a = ['A woman is reading.', 'A man is playing a guitar.']
sentences_b = ['He plays guitar.', 'A woman is making a photo.']
similarities = model.similarity(sentences_a, sentences_b)
Or build index for a group of sentences and search among them
sentences = ['A woman is reading.', 'A man is playing a guitar.']
model.build_index(sentences)
results = model.search("He plays guitar.")
We also support faiss, an efficient similarity search library. Just install the package following instructions here and simcse will automatically use faiss for efficient search.
WARNING: We have found that faiss did not well support Nvidia AMPERE GPUs (3090 and A100). In that case, you should change to other GPUs or install the CPU version of faiss package.
We also provide an easy-to-build demo website to show how SimCSE can be used in sentence retrieval. The code is based on DensePhrases' repo and demo (a lot of thanks to the authors of DensePhrases).
Model List
Our released models are listed as following. You can import these models by using the simcse package or using HuggingFace's Transformers.
| Model | Avg. STS |
|:-------------------------------|:--------:|
| princeton-nlp/unsup-simcse-bert-base-uncased | 76.25 |
| princeton-nlp/unsup-simcse-bert-large-uncased | 78.41 |
| princeton-nlp/unsup-simcse-roberta-base | 76.57 |
| princeton-nlp/unsup-simcse-roberta-large | 78.90 |
| princeton-nlp/sup-simcse-bert-base-uncased | 81.57 |
| princeton-nlp/sup-simcse-bert-large-uncased | 82.21 |
| princeton-nlp/sup-simcse-roberta-base | 82.52 |
| princeton-nlp/sup-simcse-roberta-large | 83.76 |
Note that the results are slightly better than what we have reported in the current version of the paper after adopting a new set of hyperparameters (for hyperparamters, see the training section).
Naming rules: unsup and sup represent "unsupervised" (trained on Wikipedia corpus) and "supervised" (trained on NLI datasets) respectively.
Use SimCSE with Huggingface
Besides using our provided sentence embedding tool, you can also easily import our models with HuggingFace's transformers:
import torch
from scipy.spatial.distance import cosine
from transformers import AutoModel, AutoTokenizer
# Import our models. The package will take care of downloading the models automatically
tokenizer = AutoTokenizer.from_pretrained("princeton-nlp/sup-simcse-bert-base-uncased")
model = AutoModel.from_pretrained("princeton-nlp/sup-simcse-bert-base-uncased")
# Tokenize input texts
texts = [
"There's a kid on a skateboard.",
"A kid is skateboarding.",
"A kid is inside the house."
]
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
# Get the embeddings
with torch.no_grad():
embeddings = model(**inputs, output_hidden_states=True, return_dict=True).pooler_output
# Calculate cosine similarities
# Cosine similarities are in [-1, 1]. Higher means more similar
cosine_sim_0_1 = 1 - cosine(embeddings[0], embeddings[1])
cosine_sim_0_2 = 1 - cosine(embeddings[0], embeddings[2])
print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (texts[0], texts[1], cosine_sim_0_1))
print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (texts[0], texts[2], cosine_sim_0_2))
If you encounter any problem when directly loading the models by HuggingFace's API, you can also download the models manually from the above table and use model = AutoModel.from_pretrained({PATH TO THE DOWNLOAD MODEL}).
Train SimCSE
In the following section, we describe how to train a SimCSE model by using our code.
Requirements
First, install PyTorch by following the instructions from the official website. To faithfully reproduce our results, please use the correct 1.7.1 version corresponding to your platforms/CUDA versions. PyTorch version higher than 1.7.1 should also work. For example, if you use Linux and CUDA11 (how to check CUDA version), install PyTorch by the following command,
pip install torch==1.7.1+cu110 -f https://download.pytorch.org/whl/torch_stable.html
If you instead use CUDA <11 or CPU, install PyTorch by the following command,
pip install torch==1.7.1
Then run the following script to install the remaining dependencies,
pip install -r requirements.txt
Evaluation
Our evaluation code for sentence embeddings is based on a modified version of SentEval. It evaluates sentence embeddings on semantic textual similarity (STS) tasks and downstream transfer tasks. For STS tasks, our evaluation takes the "all" setting, and report Spearman's correlation. See our paper (Appendix B) for evaluation details.
Before evaluation, please download the evaluation datasets by running
cd SentEval/data/downstream/
bash download_dataset.sh
Then come back to the root directory, you can evaluate any transformers-based pre-trained models using our evaluation code. For example,
python evaluation.py \
--model_name_or_path princeton-nlp/sup-simcse-bert-base-uncased \
--pooler cls \
--task_set sts \
--mode test
which is expected to output the results in a tabular format:
------ test ------
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness | Avg. |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| 75.30 | 84.67 | 80.19 | 85.40 | 80.82 | 84.26 | 80.39 | 81.58 |
+-------+-------+--
Related Skills
YC-Killer
2.7kA library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.
best-practices-researcher
The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app
research_rules
Research & Verification Rules Quote Verification Protocol Primary Task "Make sure that the quote is relevant to the chapter and so you we want to make sure that we want to have it identifie
groundhog
398Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).
