UCTopic
An easy-to-use tool for phrase encoding and topic mining (unsupervised aspect extraction); Code base for ACL 2022 paper, UCTopic: Unsupervised Contrastive Learning for Phrase Representations and Topic Mining.
Install / Use
/learn @JiachengLi1995/UCTopicREADME
UCTopic
This repository contains the code of model UCTopic and an easy-to-use tool UCTopicTool used for <strong>Topic Mining</strong>, <strong>Unsupervised Aspect Extractioin</strong> or <strong>Phrase Retrieval</strong>.
Our ACL 2022 paper UCTopic: Unsupervised Contrastive Learning for Phrase Representations and Topic Mining.
Quick Links
Overview
We propose UCTopic, a novel unsupervised contrastive learning framework for context-aware phrase representations and topic mining. UCTopic is pretrained in a large scale to distinguish if the contexts of two phrase mentions have the same semantics. The key to pretraining is positive pair construction from our phrase-oriented assumptions. However, we find traditional in-batch negatives cause performance decay when finetuning on a dataset with small topic numbers. Hence, we propose cluster-assisted contrastive learning(CCL) which largely reduces noisy negatives by selecting negatives from clusters and further improves phrase representations for topics accordingly.
Pretrained Model
Our released model: | Model | Note| |:-------------------------------|------| |uctopic-base| Pretrained UCTopic model based on LUKE-BASE
Unzip to get uctopic-base folder.
Getting Started
We provide an easy-to-use phrase representation tool based on our UCTopic model. To use the tool, first install the uctopic package from PyPI
pip install uctopic
Or directly install it from our code
python setup.py install
UCTopic Model
<strong>Note</strong>: Please make sure your transformers version is 4.7.0 to load our pre-trained checkpoints.
You can install correct transformers version by:
pip install transformers==4.7.0
After installing the package, you can load our model by just two lines of code
from uctopic import UCTopic
model = UCTopic.from_pretrained('JiachengLi/uctopic-base')
The model will automatically download pre-trained parameters from HuggingFace's models. If you encounter any problem when directly loading the models by HuggingFace's API, you can also download the models manually from the above table and use model = UCTopic.from_pretrained({PATH TO THE DOWNLOAD MODEL}).
To get pre-trained <strong>phrase representations</strong>, our model inputs are same as LUKE. Note: please input only <strong>ONE</strong> span each time, otherwise, will have performance decay according to our empirical results.
from uctopic import UCTopicTokenizer, UCTopic
tokenizer = UCTopicTokenizer.from_pretrained('JiachengLi/uctopic-base')
model = UCTopic.from_pretrained('JiachengLi/uctopic-base')
text = "Beyoncé lives in Los Angeles."
entity_spans = [(17, 28)] # character-based entity span corresponding to "Los Angeles"
inputs = tokenizer(text, entity_spans=entity_spans, add_prefix_space=True, return_tensors="pt")
outputs, phrase_repr = model(**inputs)
phrase_repr is the phrase embedding (size [768]) of the phrase Los Angeles. outputs has the same format as the outputs from LUKE.
UCTopicTool
We provide a tool UCTopicTool built on UCTopic for efficient phrase encoding, topic mining (or unsupervised aspect extraction) or phrase retrieval.
Initialization
UCTopicTool is initialized by giving the model_name_or_path and device.
from uctopic import UCTopicTool
topic_tool = UCTopicTool('JiachengLi/uctopic-base', device='cuda:0')
Phrase Encoding
Phrases are encoded by our method UCTopicTool.encode in batches, which is more efficient than UCTopic.
phrases = [["This place is so much bigger than others!", (0, 10)],
["It was totally packed and loud.", (15, 21)],
["Service was on the slower side.", (0, 7)],
["I ordered 2 mojitos: 1 lime and 1 mango.", (12, 19)],
["The ingredient weren't really fresh.", (4, 14)]]
embeddings = topic_tool.encode(phrases) # len(embeddings) is equal to len(phrases)
Note: Each instance in phrases contains only one sentence and one span (character-level position) in format [sentence, span].
Arguments for UCTopicTool.encode are as follows,
- phrase (List) - A list of
[sentence, span]to be encoded. - return_numpy (bool, optional, defaults to
False) - Returnnumpy.arrayortorch.Tensor. - normalize_to_unit (bool, optional, defaults to
True) - Normalize all embeddings to unit vectors. - keepdim (bool, optional, defaults to
True) - Keep dimension size[instance_number, hidden_size]. - batch_size (int, optional, defaults to
64) - The size of mini-batch in the model.
Topic Mining and Unsupervised Aspect Extraction
The method UCTopicTool.topic_mining can mine topical phrases or conduct aspect extraction from sentences with or without spans.
sentences = ["This place is so much bigger than others!",
"It was totally packed and loud.",
"Service was on the slower side.",
"I ordered 2 mojitos: 1 lime and 1 mango.",
"The ingredient weren't really fresh."]
spans = [[(0, 10)], # This place
[(15, 21), (26, 30)], # packed; loud
[(0, 7)], # Service
[(12, 19), (21, 27), (32, 39)], # mojitos; 1 lime; 1 mango
[(4, 14)]] # ingredient
# len(sentences) is equal to len(spans)
output_data, topic_phrase_dict = tool.topic_mining(sentences, spans, \
n_clusters=[15, 25])
# predict topic for new phrases
phrases = [["The food here is amazing!", (4, 8)],
["Lovely ambiance with live music!", (21, 31)]]
topics = tool.predict_topic(phrases)
Note: If spans is not given, UCTopicTool will extract noun phrases by spaCy.
Arguments for UCTopicTool.topic_mining are as follows,
Data arguments:
- sentences (List) - A List of sentences for topic mining.
- spans (List, optional, defaults to
None) - A list of span list corresponding sentences, e.g.,[[(0, 9), (5, 7)], [(1, 2)]]andlen(sentences)==len(spans). If None, automatically mine phrases from noun chunks.
Clustering arguments:
- n_clusters (int or List, optional, defaults to
2) - The number of topics. Whenn_clustersis a list,n_clusters[0]andn_clusters[1]will be the minimum and maximum numbers to search,n_clusters[2]is the search step length (if not provided, default to 1). - meric (str, optional, defaults to
"cosine") - The metric to measure the distance between vectors."cosine"or"euclidean". - batch_size (int, optional, defaults to
64) - The size of mini-batch for phrase encoding. - max_iter (int, optional, defaults to
300) - The maximum iteration number of kmeans.
CCL-finetune arguments:
- ccl_finetune (bool, optional, defaults to
True) - Whether to conduct CCL-finetuning in the paper. - batch_size_finetune (int, optional, defaults to
8) - The size of mini-batch for finetuning. - max_finetune_num (int, optional, defaults to
100000) - The maximum number of training instances for finetuning. - finetune_step (int, optional, defaults to
2000) - The number of training steps for finetuning. - contrastive_num (int, optional, defaults to
5) - The number of negatives in contrastive learning. - positive_ratio (float, optional, defaults to
0.1) - The ratio of the most confident instances for finetuning. - n_sampling (int, optional, defaults to
10000) - The number of sampled examples for cluster number confirmation and finetuning. Set to-1to use the whole dataset. - n_workers (int, optional, defaults to
8) - The number of workers for preprocessing data.
Returns for UCTopicTool.topic_mining are as follows,
- output_data (List) - A list of sentences and corresponding phrases and topic numbers. Each element is
[sentence, [[start1, end1, topic1], [start2, end2, topic2]]]. - topic_phrase_dict (Dict) - A dictionary of topics and the list of phrases under a topic. The phrases are sorted by their confidence scores. E.g.,
{topic: [[phrase1, score1], [phrase2, score2]]}.
The method UCTopicTool.predict_topic predicts the topic ids for new phrases based on your training results from UCTopicTool.topic_mining. The inputs of UCTopicTool.predict_topic are same as UCTopicTool.encode and returns a list of topic ids (int).
Phrase Similarities and Retrieval
The method UCTopicTool.similarity compute the cosine similarities between two groups of phrases:
phrases_a = [["This place is so much bigger than others!", (0, 10)],
["It was totally packed and loud.", (15, 21)]]
phrases_b = [["Service was on the slower side.", (0, 7)],
["I ordered 2 mojitos: 1 lime and 1 mango.", (12, 19)],
["The ingredient weren't really fresh.", (4, 14)]]
similarities = tool.similarity(phrases_a, phrases_b)
Arguments for UCTopicTool.similarity are as follows,
- queries (List) - A list of
[sentence, span]as queries. - keys (List or
numpy.array) - A list of[sentence, span]as keys or phrase representations (numpy.array) fromUCTopicTool.encode.
Related Skills
next
A beautifully designed, floating Pomodoro timer that respects your workspace.
product-manager-skills
50PM skill for Claude Code, Codex, Cursor, and Windsurf: diagnose SaaS metrics, critique PRDs, plan roadmaps, run discovery, and coach PM career transitions.
devplan-mcp-server
3MCP server for generating development plans, project roadmaps, and task breakdowns for Claude Code. Turn project ideas into paint-by-numbers implementation plans.
