CUPID
[COLM 2025] CUPID: Evaluating Personalized and Contextualized Alignment of LLMs from Interactions
Install / Use
/learn @kixlab/CUPIDREADME
🏹 CUPID: Evaluating Personalized and Contextualized Alignment of LLMs from Interactions
<p align="center"> <a href="https://cupid.kixlab.org/"><img src="https://img.shields.io/badge/🌐-Website-red" height="23"></a> <a href="https://arxiv.org/abs/2508.01674"><img src="https://img.shields.io/badge/📝-Paper (COLM 2025)-blue" height="23"></a> <a href="https://huggingface.co/datasets/kixlab/CUPID"><img src="https://img.shields.io/badge/🤗-Dataset-green" height="23"></a> <a href="https://huggingface.co/datasets/kixlab/CUPID-Raw"><img src="https://img.shields.io/badge/🤗-Raw Data-green" height="23"></a> <a href="https://huggingface.co/datasets/kixlab/CUPID-Unverified"><img src="https://img.shields.io/badge/🤗-Dataset (Unverified)-green" height="23"></a> <a href="https://huggingface.co/kixlab/prefmatcher-7b"><img src="https://img.shields.io/badge/🤗-Model-orange" height="23"></a> </p> <p align="center"> <img src="assets/pipeline.png" width="95%" height="80%"> </p> <p align="center"> <a href="https://taesookim.com/">Tae Soo Kim</a>, <a href="https://yoonjoolee.com/">Yoonjoo Lee</a>, <a href="https://www.linkedin.com/in/yoonah-park-50ba57249/">Yoonah Park</a>, <a href="https://jihokim.dev/">Jiho Kim</a>, <a href="https://younghokim.net/">Young-Ho Kim</a>, <a href="https://juhokim.com">Juho Kim</a> </p>We introduce CUPID 🏹, a benchmark for evaluating the capability of Large Language Models (LLMs) to infer and apply personalized, contextual preferences from multi-turn user interactions. Unlike existing approaches that assume static global preferences, CUPID tests models' ability to understand dynamic, context-dependent user preferences revealed through conversational and implicit feedback.
🎯 Overview
CUPID contains 756 human-curated interaction session histories between simulated users and LLM-based AI assistants. Each interaction session involves a specific context factor (e.g., person, artifact, organization) and presents a user expressing their preference relevant to the context through multi-turn feedback.
Key Features:
- Contextual Preferences: Tests models' ability to infer preferences that change based on context
- Multi-turn Interactions: Evaluates understanding from conversational feedback rather than explicit statements
- Preference Inference: Assesses capability to extract relevant preferences from prior interactions
- Response Generation: Tests application of inferred preferences to new requests
- Comprehensive Evaluation: Presents metrics to asses model performance at preference inference and response genration
Evaluation Tasks:
- Preference Inference: Given prior interactions, infer the user's contextual preference
- Response Generation: Given prior interactions, generate response that can satisfy the user's contextual preferences
🛠️ Setup
Environment
We recommend using a conda environment:
conda create -n cupid python=3.9
conda activate cupid
pip install -r requirements.txt
API Keys
Set up your API keys for model evaluation:
export OPENAI_API_KEY="your_openai_key"
export ANTHROPIC_API_KEY="your_anthropic_key"
export TOGETHER_API_KEY="your_together_key" # For models supported by Together AI
export GOOGLE_API_KEY="your_google_key" # For Gemini models
📊 Dataset
The CUPID dataset is available on HuggingFace: kixlab/CUPID
Dataset Structure:
- 756 instances across diverse personas and contexts
- Human-curated interactions showing contextual preference expression
- Three instance types: consistent, contrastive, and changing preferences
- Rich context factors influencing user preferences (e.g., personal relationships, prior experiences, etc.)
Data Fields:
persona_id: Unique identifier for the user personacurrent_request: The request to be answered by the modelcurrent_context_factor: Context influencing the user's preferencecurrent_contextual_preference: Ground-truth preference for this contextcurrent_checklist: Specific criteria for evaluating response alignmentprior_interactions: List of previous interaction sessions showing user feedback
Check kixlab/CUPID-Raw for the full personas, context factors, and interaction sessions used to synthesize the benchmark.
We also release kixlab/CUPID-Unverified, a non-validated version of CUPID with >3k instances.
🚀 Evaluation
Quick Start
Evaluate a model on the CUPID dataset:
python -m evaluation.run \
--results_dir results \
--model "gpt-4.1-nano-2025-04-14" \
--evaluator gpt-4o-2024-11-20 \
--n_workers 4
Key Parameters:
--model: Model to evaluate (must have a corresponding class inevaluation/models/)--evaluator: Model used for evaluation functions (preference decomposing and matching, response judging)--use_matcher: Use our finetuned preference matcher (kixlab/prefmatcher-7b) for preference inference--task: Runinference,generation, orbothevaluation stages--data_dir: Use custom data instead of the official CUPID dataset (data synthesis explained in the next section)
Adding New Models to Evaluate
To evaluate your own model, you must first create a new model class in evaluation/models/your_model.py that inherits the Model class and implements the model's inference logic. The __call__ method should take in a system prompt and a user prompt, and return only the final text response.
- Create a new model class in
evaluation/models/your_model.py:
from evaluation.models.model import Model, register_model
@register_model
class YourModel(Model):
model_name = "your-model-name"
def __call__(self, system_prompt, user_prompt):
# Your model inference logic here
return response
- Run evaluation:
python -m evaluation.run --model your-model-name --results_dir results
Evaluation Metrics
CUPID evaluates models on two main tasks:
1. Preference Inference (Precision/Recall/F1)
- Measures how well models can infer the user's preference for the current request from prior interactions
- Compares inferred preference to the ground-truth preference
- Optionally, you can use our finetuned preference matcher for more cost-efficient evaluation
- Our finetuned preference matcher is available on HuggingFace: kixlab/prefmatcher-7b
- First, run the bash script
evaluation/serve_prefmatcher.shto serve the model through VLLM - This will serve the model at
http://localhost:8000 - Then, run the evaluation script with the
--use_matcherflag
2. Response Generation (Average Score 1-10)
- Evaluates how well generated responses satisfy user preferences
- Scored by LLM-based judges on response-preference alignment
🔧 Synthesis Framework
This respository also incldues the synthesis pipeline for CUPID to generate additional training/evaluation data.
Generate Synthetic Data
python -m synthesis.run \
--output_dir synthetic_data \
--model "anthropic.claude-3-5-sonnet-20241022-v2:0" \
--n_personas 10 \
--n_factors 8 \
--n_sessions 13 \
--max_turns 16 \
--n_workers 4
Key Parameters:
--output_dir: Directory to save the generated data--model: Model to use for data generation--n_personas: Number of personas to generate (default: 4)--n_factors: Number of context factors to generate (default: 8)--n_sessions: Number of interaction sessions to generate (default: 13)--max_turns: Maximum number of turns in an interaction session (default: 16)--n_workers: Number of workers to use for data generation (default: 1)
Synthesis Pipeline: Consists of four main steps:
- Persona Generation: Create diverse user personas with different backgrounds and taits
- Context Factors: For each persona, generate context factors that influence preferences
- Session Generation: Create interaction scenarios based on personas and contexts
- Interaction Simulation: Simulate multi-turn conversations with preference feedback
📁 Project Structure
cupid/
├── evaluation # Evaluation framework
│ ├── models/ # Model implementations
│ ├── modules/ # Evaluation components
│ ├── pipeline/ # Evaluation pipeline
│ └── run.py # Main evaluation script
├── synthesis/ # Data synthesis framework
│ ├── modules/ # Synthesis components
│ ├── pipeline/ # Synthesis pipeline
│ └── run.py # Main synthesis script
├── prompts/ # Prompt templates
│ ├── evaluation/ # Evaluation prompts
│ └── synthesis/ # Synthesis prompts
├── utils/ # Utility functions
├── config.py # Configuration settings
└── requirements.txt # Dependencies
📄 Citation
If you find our work useful, please consider citing our paper!
@article{kim2025cupid,
title = {CUPID: Evaluating Personalized and Contextualized Alignment of LLMs from Interactions},
author = {Kim, Tae Soo and Lee, Yoonjoo and Park, Yoonah and Kim, Jiho and Kim, Young-Ho and Kim, Juho},
journal = {arXiv preprint arXiv:2508.01674},
year = {2025},
}
Related Skills
node-connect
342.5kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
85.3kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
342.5kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
342.5kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
