SkillAgentSearch skills...

UCTopic

An easy-to-use tool for phrase encoding and topic mining (unsupervised aspect extraction); Code base for ACL 2022 paper, UCTopic: Unsupervised Contrastive Learning for Phrase Representations and Topic Mining.

Install / Use

/learn @JiachengLi1995/UCTopic
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

UCTopic

This repository contains the code of model UCTopic and an easy-to-use tool UCTopicTool used for <strong>Topic Mining</strong>, <strong>Unsupervised Aspect Extractioin</strong> or <strong>Phrase Retrieval</strong>.

Our ACL 2022 paper UCTopic: Unsupervised Contrastive Learning for Phrase Representations and Topic Mining.

Quick Links

Overview

We propose UCTopic, a novel unsupervised contrastive learning framework for context-aware phrase representations and topic mining. UCTopic is pretrained in a large scale to distinguish if the contexts of two phrase mentions have the same semantics. The key to pretraining is positive pair construction from our phrase-oriented assumptions. However, we find traditional in-batch negatives cause performance decay when finetuning on a dataset with small topic numbers. Hence, we propose cluster-assisted contrastive learning(CCL) which largely reduces noisy negatives by selecting negatives from clusters and further improves phrase representations for topics accordingly.

Pretrained Model

Our released model: | Model | Note| |:-------------------------------|------| |uctopic-base| Pretrained UCTopic model based on LUKE-BASE

Unzip to get uctopic-base folder.

Getting Started

We provide an easy-to-use phrase representation tool based on our UCTopic model. To use the tool, first install the uctopic package from PyPI

pip install uctopic

Or directly install it from our code

python setup.py install

UCTopic Model

<strong>Note</strong>: Please make sure your transformers version is 4.7.0 to load our pre-trained checkpoints.

You can install correct transformers version by:

pip install transformers==4.7.0

After installing the package, you can load our model by just two lines of code

from uctopic import UCTopic
model = UCTopic.from_pretrained('JiachengLi/uctopic-base')

The model will automatically download pre-trained parameters from HuggingFace's models. If you encounter any problem when directly loading the models by HuggingFace's API, you can also download the models manually from the above table and use model = UCTopic.from_pretrained({PATH TO THE DOWNLOAD MODEL}).

To get pre-trained <strong>phrase representations</strong>, our model inputs are same as LUKE. Note: please input only <strong>ONE</strong> span each time, otherwise, will have performance decay according to our empirical results.

from uctopic import UCTopicTokenizer, UCTopic

tokenizer = UCTopicTokenizer.from_pretrained('JiachengLi/uctopic-base')
model = UCTopic.from_pretrained('JiachengLi/uctopic-base')

text = "Beyoncé lives in Los Angeles."
entity_spans = [(17, 28)] # character-based entity span corresponding to "Los Angeles"

inputs = tokenizer(text, entity_spans=entity_spans, add_prefix_space=True, return_tensors="pt")
outputs, phrase_repr = model(**inputs)

phrase_repr is the phrase embedding (size [768]) of the phrase Los Angeles. outputs has the same format as the outputs from LUKE.

UCTopicTool

We provide a tool UCTopicTool built on UCTopic for efficient phrase encoding, topic mining (or unsupervised aspect extraction) or phrase retrieval.

Initialization

UCTopicTool is initialized by giving the model_name_or_path and device.

from uctopic import UCTopicTool

topic_tool = UCTopicTool('JiachengLi/uctopic-base', device='cuda:0')

Phrase Encoding

Phrases are encoded by our method UCTopicTool.encode in batches, which is more efficient than UCTopic.

phrases = [["This place is so much bigger than others!", (0, 10)],
           ["It was totally packed and loud.", (15, 21)],
           ["Service was on the slower side.", (0, 7)],
           ["I ordered 2 mojitos: 1 lime and 1 mango.", (12, 19)],
           ["The ingredient weren't really fresh.", (4, 14)]]

embeddings = topic_tool.encode(phrases) # len(embeddings) is equal to len(phrases)

Note: Each instance in phrases contains only one sentence and one span (character-level position) in format [sentence, span].

Arguments for UCTopicTool.encode are as follows,

  • phrase (List) - A list of [sentence, span] to be encoded.
  • return_numpy (bool, optional, defaults to False) - Return numpy.array or torch.Tensor.
  • normalize_to_unit (bool, optional, defaults to True) - Normalize all embeddings to unit vectors.
  • keepdim (bool, optional, defaults to True) - Keep dimension size [instance_number, hidden_size].
  • batch_size (int, optional, defaults to 64) - The size of mini-batch in the model.

Topic Mining and Unsupervised Aspect Extraction

The method UCTopicTool.topic_mining can mine topical phrases or conduct aspect extraction from sentences with or without spans.

sentences = ["This place is so much bigger than others!",
             "It was totally packed and loud.",
             "Service was on the slower side.",
             "I ordered 2 mojitos: 1 lime and 1 mango.",
             "The ingredient weren't really fresh."]

spans = [[(0, 10)],                       # This place
         [(15, 21), (26, 30)],            # packed; loud
         [(0, 7)],                        # Service
         [(12, 19), (21, 27), (32, 39)],  # mojitos; 1 lime; 1 mango
         [(4, 14)]]                       # ingredient
# len(sentences) is equal to len(spans)
output_data, topic_phrase_dict = tool.topic_mining(sentences, spans, \
                                                   n_clusters=[15, 25])

# predict topic for new phrases
phrases = [["The food here is amazing!", (4, 8)],
           ["Lovely ambiance with live music!", (21, 31)]]

topics = tool.predict_topic(phrases)

Note: If spans is not given, UCTopicTool will extract noun phrases by spaCy.

Arguments for UCTopicTool.topic_mining are as follows,

Data arguments:

  • sentences (List) - A List of sentences for topic mining.
  • spans (List, optional, defaults to None) - A list of span list corresponding sentences, e.g., [[(0, 9), (5, 7)], [(1, 2)]] and len(sentences)==len(spans). If None, automatically mine phrases from noun chunks.

Clustering arguments:

  • n_clusters (int or List, optional, defaults to 2) - The number of topics. When n_clusters is a list, n_clusters[0] and n_clusters[1] will be the minimum and maximum numbers to search, n_clusters[2] is the search step length (if not provided, default to 1).
  • meric (str, optional, defaults to "cosine") - The metric to measure the distance between vectors. "cosine" or "euclidean".
  • batch_size (int, optional, defaults to 64) - The size of mini-batch for phrase encoding.
  • max_iter (int, optional, defaults to 300) - The maximum iteration number of kmeans.

CCL-finetune arguments:

  • ccl_finetune (bool, optional, defaults to True) - Whether to conduct CCL-finetuning in the paper.
  • batch_size_finetune (int, optional, defaults to 8) - The size of mini-batch for finetuning.
  • max_finetune_num (int, optional, defaults to 100000) - The maximum number of training instances for finetuning.
  • finetune_step (int, optional, defaults to 2000) - The number of training steps for finetuning.
  • contrastive_num (int, optional, defaults to 5) - The number of negatives in contrastive learning.
  • positive_ratio (float, optional, defaults to 0.1) - The ratio of the most confident instances for finetuning.
  • n_sampling (int, optional, defaults to 10000) - The number of sampled examples for cluster number confirmation and finetuning. Set to -1 to use the whole dataset.
  • n_workers (int, optional, defaults to 8) - The number of workers for preprocessing data.

Returns for UCTopicTool.topic_mining are as follows,

  • output_data (List) - A list of sentences and corresponding phrases and topic numbers. Each element is [sentence, [[start1, end1, topic1], [start2, end2, topic2]]].
  • topic_phrase_dict (Dict) - A dictionary of topics and the list of phrases under a topic. The phrases are sorted by their confidence scores. E.g., {topic: [[phrase1, score1], [phrase2, score2]]}.

The method UCTopicTool.predict_topic predicts the topic ids for new phrases based on your training results from UCTopicTool.topic_mining. The inputs of UCTopicTool.predict_topic are same as UCTopicTool.encode and returns a list of topic ids (int).

Phrase Similarities and Retrieval

The method UCTopicTool.similarity compute the cosine similarities between two groups of phrases:

phrases_a = [["This place is so much bigger than others!", (0, 10)],
           ["It was totally packed and loud.", (15, 21)]]

phrases_b = [["Service was on the slower side.", (0, 7)],
           ["I ordered 2 mojitos: 1 lime and 1 mango.", (12, 19)],
           ["The ingredient weren't really fresh.", (4, 14)]]

similarities = tool.similarity(phrases_a, phrases_b)

Arguments for UCTopicTool.similarity are as follows,

  • queries (List) - A list of [sentence, span] as queries.
  • keys (List or numpy.array) - A list of [sentence, span] as keys or phrase representations (numpy.array) from UCTopicTool.encode.

Related Skills

View on GitHub
GitHub Stars46
CategoryProduct
Updated19d ago
Forks4

Languages

Python

Security Score

95/100

Audited on Mar 20, 2026

No findings