JudgeLM
[ICLR 2025 Spotlight] An open-sourced LLM judge for evaluating LLM-generated answers.
Install / Use
/learn @baaivision/JudgeLMREADME
JudgeLM: Fine-tuned Large Language Models are Scalable Judges
<a target="_blank" href="https://arxiv.org/abs/2310.17631"> <img style="height:22pt" src="https://img.shields.io/badge/-Paper-black?style=flat&logo=arxiv"> </a> <a target="_blank" href="https://https://github.com/baaivision/JudgeLM"> <img style="height:22pt" src="https://img.shields.io/badge/-Code-green?style=flat&logo=github"> </a> <a target="_blank" href="http://218.91.113.230:9004/"> <img style="height:22pt" src="https://img.shields.io/badge/🤖 Demo-20B2AA?style=flat"> </a> <a target="_blank" href="https://huggingface.co/datasets/BAAI/JudgeLM-100K"> <img style="height:22pt" src="https://img.shields.io/badge/-🤗%20Dataset-red?style=flat"> </a> <a target="_blank" href="https://huggingface.co/BAAI/JudgeLM-7B-v1.0"> <img style="height:22pt" src="https://img.shields.io/badge/-🤗%20Models (7B)-red?style=flat"> </a> <a target="_blank" href="https://huggingface.co/BAAI/JudgeLM-13B-v1.0"> <img style="height:22pt" src="https://img.shields.io/badge/-🤗%20(13B)-red?style=flat"> </a> <a target="_blank" href="https://huggingface.co/BAAI/JudgeLM-33B-v1.0"> <img style="height:22pt" src="https://img.shields.io/badge/-🤗%20(33B)-red?style=flat"> </a> <a target="_blank" href="https://twitter.com/_akhaliq/status/1717718525958037799?s=61&t=Q73fac6D7gyJgMBfcxgPvA"> <img style="height:22pt" src="https://img.shields.io/badge/-Tweet-blue?style=flat&logo=twitter"> </a> <br>Lianghui Zhu<sup>1,2</sup>, Xinggang Wang<sup>1</sup>, Xinlong Wang<sup>2</sup>
<sup>1</sup>HUST, <sup>2</sup>BAAI
News
-
[2025/01] JudgeLM is accepted by ICLR2025, and presented as a Spotlight. 🎉 OpenReview page can be found here.
-
[2023/10] We released JudgeLM: Fine-tuned Large Language Models are Scalable Judges. Check out the paper.
Overview

JudgeLM is an open platform for training, serving, and evaluating scalable large language model judges.
- JudgeLM is a scalable language model judge, designed for evaluating LLMs in open-ended scenarios. It achieves an agreement exceeding 90% that surpasses the human-to-human agreement.
- JudgeLM dataset contains 100K judge samples for training and 5K judge samples for validation. All the judge samples have the GPT-4-generated high-quality judgements.
JudgeLM's core features include:
- The training and evaluation code for state-of-the-art LLM judges.
- The broad capacities to deal with extended tasks. (e.g., judges of the single answer, multimodal models, multiple answers, and multi-turn chat)
- A distributed multi-model serving system with web UI.
Contents
Install: From source
- Clone this repository and navigate to the JudgeLM folder
git clone https://github.com/baaivision/JudgeLM
cd JudgeLM
- Install Package
conda create -n judgelm python=3.10.10 -y
conda activate judgelm
pip3 install --upgrade pip
pip3 install -e .
pip install flash-attn==2.0.4 --no-build-isolation
Model Weights
JudgeLM is based on LLaMA and should be used under LLaMA's model license.
| Model | w/ reference? | Agreement↑ | Precision↑ | Recall↑ | F1↑ | Consistency↑ | |:------------------------------------------------------------------:|:-------------:|:----------:|:----------:|:-------:|:-----:|:------------:| | JudgeLM-7B | ❎ | 81.11 | 69.67 | 78.39 | 72.21 | 83.57 | | JudgeLM-7B | ✅ | 84.08 | 75.92 | 82.55 | 78.28 | 84.46 | | JudgeLM-13B | ❎ | 84.33 | 73.69 | 80.51 | 76.17 | 85.01 | | JudgeLM-13B | ✅ | 85.47 | 77.71 | 82.90 | 79.77 | 87.23 | | JudgeLM-33B 🔥 | ❎ | 89.03 | 80.97 | 84.76 | 82.64 | 91.36 | | JudgeLM-33B 🔥 | ✅ | 89.32 | 84.00 | 86.21 | 84.98 | 92.37 |
Evaluation


JudgeLM can judge open-ended answers from LLMs, as well as the multimodal models.
See instructions for running JudgeLM at judgelm/llm_judge.
Serving with Web GUI

We use gradio to provide web server and UI for users to evaluate LLMs' performance at open-ended tasks. The demo can be tried here.
See instructions for running JudgeLM web server at judgelm/serve.
Fine-tuning
Data
The JudgeLM-100K dataset is available at HuggingFace Datasets.
Code and Hyperparameters
Our code is based on Vicuna with additional support for judging answer pairs. We use similar hyperparameters as the Vicuna.
| Hyperparameter | Global Batch Size | Learning rate | Epochs | Max length | Weight decay | | --- | ---: | ---: | ---: | ---: | ---: | | JudgeLM-13B | 128 | 2e-5 | 3 | 2048 | 0 |
Fine-tuning JudgeLM-7B with Local GPUs
- You can use the following command to train JudgeLM-7B with 4 x A100 (40GB). Update
--model_name_or_pathwith the actual path to Vicuna weights and--data_pathwith the actual path to JudgeLM data.
torchrun --nproc_per_node=4 --master_port=20001 judgelm/train/train_mem.py \
--model_name_or_path="/share/project/lianghuizhu/vicuna-weights-collection-v1.3/vicuna-7b-v1.3" \
--data_path /share/project/lianghuizhu/JudgeLM-Project/JudgeLM/judgelm/data/JudgeLM/judgelm_train_100k.jsonl \
--bf16 True \
--output_dir="/home/zhulianghui/ProjectC_ChatGPT/alpaca/output/judgelm-debug-evaluator" \
--num_train_epochs 3 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 32 \
--evaluation_strategy no \
--save_strategy steps \
--save_steps 1000 \
--save_total_limit 1 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type cosine \
--logging_steps 1 \
--fsdp "full_shard auto_wrap offload" \
--fsdp_transformer_layer_cls_to_wrap "LlamaDecoderLayer" \
--tf32 True \
--model_max_length 2048 \
--gradient_checkpointing True \
--run_name 7B-full-model \
--swap_aug_ratio 0.5 \
--ref_drop_ratio 0.5
Tips:
- If you are using V100 which is not supported by FlashAttention, you can use the memory-efficient attention implemented in xFormers. Install xformers and replace
judgelm/train/train_mem.pyabove with judgelm/train/train_xformers.py. - If you meet out-of-memory due to "FSDP Warning: When using FSDP, it is efficient and recommended... ", see solutions here.
- If you meet out-of-memory during model saving, see solutions here.
Acknowledgement :heart:
This project is based on Vicuna (blog, code), PandaLM (paper, code), LLM-Blender (paper, code). Thanks for their wonderful works.
Citation
The code (training, serving, and evaluation) in this repository is mostly developed for or derived from the paper below. Please cite it if you find the repository h
Related Skills
node-connect
325.6kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
80.2kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
325.6kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
80.2kCommit, push, and open a PR
