GRACE: Generative Representation Learning via Contrastive Policy Optimization

Our method can improve the embedding performance while retaining the generative performance.

How to cite

If you extend or use this work, please cite the paper where it was introduced:

@misc{sun2025gracegenerativerepresentationlearning,
      title={GRACE: Generative Representation Learning via Contrastive Policy Optimization}, 
      author={Jiashuo Sun and Shixuan Liu and Zhaochen Su and Xianrui Zhong and Pengcheng Jiang and Bowen Jin and Peiran Li and Weijia Shi and Jiawei Han},
      year={2025},
      eprint={2510.04506},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2510.04506}, 
}

🔥 Update

[2026-01-31]: 🚀 Our paper is accepted by ICLR 2026! See you in Brazil! 🥳🥳🥳
[2025-10-02]: 🚀 Our paper is avaliable at https://arxiv.org/pdf/2510.04506.
[2025-10-02]: 🚀 We release the code for training and evaluation.

GRACE Overview
Project Visualizations
📌 Data Processing Pipeline
🎯 Reinforcement Learning for Supervised Training
🤖 Reinforcement Learning for Unsupervised Training
🔍 Inference and Evaluation
📄 Licensing and Claims

GRACE Overview

We present GRACE (Generative Representation Learning via Contrastive Policy Optimization), a framework that turns LLMs into interpretable representation learners using policy-gradient optimization. The model first produces explicit rationales $r$ that analyzes and reasons about the input. From $r$ we derive the final embedding $h$ via mean pooling over hidden states. We recast contrastive learning signals as rewards that increase query–positive similarity and decrease query–negative similarity. Optimizing this reward with standard policy-gradient methods teaches the model to generate faithful rationales while simultaneously learning effective text representations.

Contributions:

(1) We present the first empirical evidence that rewards derived from contrastive learning can be leveraged to train policy models, resulting in improved representational capabilities.

(2) We propose a novel methodology that enables the transformation of existing LLMs into powerful representation models while preserving their general-purpose capabilities without performance degradation.

(3) This work represents a substantial advancement in text representation interpretability, as the model’s reasoning can be directly inspected through its textual outputs.

(4) Our method yields a significant performance gain of avg 11.5 % over baseline models when evaluated on the MTEB benchmark.

💡 Preliminaries

You should install the environment by pip install -r requirements.txt.

Moreover, we developed our algorithm and pipeline based on verl, version 0.4.0dev, so after installing dependencies please run pip install -e . at the project root.

⚡ Quickstart

Install dependencies and editable package:

pip install -r requirements.txt
pip install -e .

Prepare data (both supervised and unsupervised are supported). The following script wraps the steps in this README:

bash process_data.sh

Train (edit data.train_files, data.val_files, and +data.train_mode in train.sh if needed):

bash train.sh

Evaluate on MTEB with vLLM (ensure merged checkpoints are pointed by MODEL_PATH in eval.sh):

bash eval.sh

📁 Project Structure (brief)

GRACE/
  assets/                 # Figures used in README
  scripts/                # Utilities such as model merging
  verl/                   # Training framework (0.4.0dev-based)
  process_data.py         # Convert raw data to parquet for training
  offline_filter_data.py  # Pre-filter overlong samples
  eval_mteb.py            # vLLM-based MTEB evaluation entry
  train.sh                # Supervised training example (GRPO)
  eval.sh                 # Inference + evaluation example
  process_data.sh         # End-to-end data prep helper
  requirements.txt
  pyproject.toml          # Packaging for editable install
  README.md

📌 Data Processing Pipeline

You can find the data from following link: Supervised data: Data from Repetition Improves Language Model Embeddings

Unsupervised data: Wiki1m_for_simcse from SimCSE: Simple Contrastive Learning of Sentence Embeddings

After downloading the data, you can directly run:

#!/bin/bash

set -e  # Exit on any error

echo "Processing supervised data..."

python process_data.py \
    --input_file echo_data_total.jsonl \
    --local_dir data/supervised \
    --mode supervised \
    --test_ratio 0.01

sleep 3

python offline_filter_data.py \
    --train_parquet data/supervised/train.parquet \
    --val_parquet   data/supervised/test.parquet \
    --out_dir       data/supervised_filtered_overlong/ \
    --tokenizer_path Qwen/Qwen2.5-1.5B-Instruct \
    --max_len 1024

# echo "Processing unsupervised data..."

python process_data.py \
    --input_file wiki1m_for_simcse.txt \
    --local_dir data/unsupervised \
    --mode unsupervised \
    --test_ratio 0.01

sleep 3

python offline_filter_data.py \
    --train_parquet data/unsupervised/train.parquet \
    --val_parquet   data/unsupervised/test.parquet \
    --out_dir       data/unsupervised_filtered_overlong/ \
    --tokenizer_path Qwen/Qwen2.5-1.5B-Instruct \
    --max_len 1024

which can be found at process_data.sh

This operation first processes the data into a format that verl can recognize. In addition, because the amount of data is large, we will process the overlong data in advance to prevent this step from taking too long during training.

🎯 Reinforcement Learning for Supervised Training

To train the model, you can directly using following command:

#!/bin/bash

mkdir -p logs

export HF_DATASETS_CACHE="huggingface_cache"
export HF_HOME="huggingface_cache"

export CUDA_VISIBLE_DEVICES=0,1,2,3
export VLLM_TORCH_COMPILE_LEVEL=0
export TORCH_COMPILE_DISABLE=1

python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    algorithm.norm_adv_by_std_in_grpo=False \
    data.train_files=data/supervised_filtered_overlong/train.parquet \
    data.val_files=data/supervised_filtered_overlong/test.parquet \
    data.train_batch_size=64 \
    data.val_batch_size=16 \
    data.max_prompt_length=1024 \
    data.max_response_length=2048 \
    data.filter_overlong_prompts=False \
    data.truncation='right' \
    +data.train_mode=supervised \
    actor_rollout_ref.model.path=Qwen/Qwen2.5-1.5B-Instruct \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.use_kl_loss=True \
    actor_rollout_ref.actor.kl_loss_coef=0 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.actor.entropy_coeff=0 \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.fsdp_config.param_offload=True \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
    actor_rollout_ref.actor.fsdp_config.offload_policy=True \
    actor_rollout_ref.actor.use_dynamic_bsz=True \
    actor_rollout_ref.actor.ppo_mini_batch_size=16 \
    actor_rollout_ref.actor.ppo_max_token_len_per_gpu=16384 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \
    actor_rollout_ref.rollout.enable_chunked_prefill=False \
    actor_rollout_ref.rollout.enforce_eager=True \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.free_cache_engine=False \
    actor_rollout_ref.rollout.n=8 \
    actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=True \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    actor_rollout_ref.ref.log_prob_use_dynamic_bsz=True \
    algorithm.use_kl_in_reward=False \
    reward_model.enable=False \
    reward_model.reward_manager=hidden \
    trainer.critic_warmup=0 \
    +reward_model.reward_kwargs.temperature=0.1 \
    +reward_model.reward_kwargs.with_scale=True \
    +reward_model.reward_kwargs.clustering_weight=0.2 \
    +reward_model.reward_kwargs.cross_group_weight=0.2 \
    trainer.logger='["wandb"]' \
    trainer.project_name='GRACE' \
    trainer.experiment_name='test_exp' \
    trainer.n_gpus_per_node=4 \
    trainer.nnodes=1 \
    trainer.save_freq=50 \
    trainer.test_freq=-1 \
    trainer.val_before_train=False \
    trainer.total_epochs=2

which can be found at train.sh. You need to modify data.train_files, data.val_files, and +data.train_mode before training.

🛠️ We used 4× H100-80G GPUs to train all models.

🎯 Reinforcement Learning for Unsupervised Training

To train the unsupervised model, you can dir

GRACE

Install / Use

README