RePro

The official code of Refinement Provenance Inference: Detecting LLM-Refined Training Prompts from Model Behavior

Generate Convert Improve

Install / Use

/learn @YinBo0927/RePro

About this skill

Quality Score

0/100

README

Refinement Provenance Inference: Detecting LLM-Refined Training Prompts from Model Behavior

Authors:
Bo Yin*, Qi Li*, Runpeng Yu, Xinchao Wang†

National University of Singapore

* Equal contribution.
† Corresponding author: xinchao@nus.edu.sg

[!IMPORTANT] We will release the weights (both attacker and fine-tuned model) on huggingface in the future and add commercial LLM refiner.

Overview

alt text

Quick Start

0. Install

pip install -r requirements.txt

1. Prepare raw benchmark instances

python scripts/prepare_raw.py --dataset gsm8k --out data/gsm8k_raw.jsonl
python scripts/prepare_raw.py --dataset humaneval --out data/humaneval_raw.jsonl

Each JSONL row contains:

id: instance id (problem / function)
x_raw: raw prompt
y: reference output used for teacher forcing

2. Create refined prompts

python scripts/refine_prompts.py   --dataset gsm8k   --in data/gsm8k_raw.jsonl   --out data/gsm8k_refined.jsonl   --refiner_model <hf-refiner-model-id>   --task gsm8k

3. Build victim/shadow pools and mixtures (instance-disjoint)

python scripts/build_mixtures.py   --raw data/gsm8k_raw.jsonl   --ref data/gsm8k_refined.jsonl   --out_dir data/gsm8k_mix   --rho 0.5   --seed 42

Outputs:

shadow_train.jsonl, shadow_val.jsonl, shadow_test.jsonl
victim_train.jsonl, victim_val.jsonl, victim_test.jsonl

Each row includes:

z ∈ {0,1}: provenance label (1 refined, 0 raw), sampled once per instance and fixed.
x_train: equals x_ref if z=1 else x_raw.
x_raw, x_ref, y kept for analysis and baselines.

4. LoRA SFT for shadow / victim

python scripts/sft_lora.py   --base_model <hf-base-model-id>   --train_jsonl data/gsm8k_mix/shadow_train.jsonl   --val_jsonl data/gsm8k_mix/shadow_val.jsonl   --out_dir artifacts/shadow_lora   --max_steps 500 --lr 2e-4 --ctx_len 768   --lora_r 16 --lora_alpha 32 --lora_dropout 0.05

Repeat for victim by swapping the train/val files and output directory.

5. Extract teacher-forced logit features

Shadow (used for training attacker):

python scripts/extract_features.py   --base_model <hf-base-model-id>   --lora_dir artifacts/shadow_lora   --data_jsonl data/gsm8k_mix/shadow_train.jsonl   --out_npz artifacts/features_shadow_train.npz   --ctx_len 768

Victim (used for evaluation):

python scripts/extract_features.py   --base_model <hf-base-model-id>   --lora_dir artifacts/victim_lora   --data_jsonl data/gsm8k_mix/victim_test.jsonl   --out_npz artifacts/features_victim_test.npz   --ctx_len 768

6. Train attacker (supervised contrastive + linear head)

python scripts/train_attacker.py   --shadow_train_npz artifacts/features_shadow_train.npz   --shadow_val_npz artifacts/features_shadow_val.npz   --out_dir artifacts/attacker   --epochs 30 --batch_size 256 --lr 1e-3 --temperature 0.1

This saves:

standardize.json (μ, σ from shadow train)
encoder.pt
linear_head.pt
thresholds.json (e.g., threshold for 1% FPR computed on shadow val)

7. Evaluate on victim

python scripts/eval_victim.py   --features_npz artifacts/features_victim_test.npz   --attacker_dir artifacts/attacker   --out_json artifacts/victim_metrics.json

Result

alt text

Citation

If you find this work useful, please cite:

@misc{yin2026refinementprovenanceinferencedetecting,
      title={Refinement Provenance Inference: Detecting LLM-Refined Training Prompts from Model Behavior}, 
      author={Bo Yin and Qi Li and Runpeng Yu and Xinchao Wang},
      year={2026},
      eprint={2601.01966},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2601.01966}, 
}

Related Skills

node-connect

350.1k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

109.9k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

350.1k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

350.1k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。