GoodBadGreedy
The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism
Install / Use
/learn @Yifan-Song793/GoodBadGreedyREADME
Evaluation of LLMs Should Not Ignore Non-Determinism
Official repo for The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism
Authors: Yifan Song, Guoyin Wang, Sujian Li, Bill Yuchen Lin.
TLDR: Are there performance differences between greedy decoding and sampling methods for LLM generation? The answer is YES!
Current evaluations of large language models (LLMs) often overlook non-determinism, typically focusing on a single output per example. This limits our understanding of LLM performance variability in real-world applications. Our study addresses this issue by exploring key questions about the non-determinism of LLM generations, identifying benchmarks’ consistency regarding non-determinism, and examining unique model behaviors.
Here are our findings:
- A notable performance gap is observed between greedy decoding and sampling generation.
- Greedy decoding outperforms sampling on most evaluated benchmarks, except for AlpacaEval.
- Math reasoning and code generation were most impacted by sampling variance.
- The above findings remain consistent across different sizes and families of LLMs.
- Alignment methods, e.g., DPO, can significantly reduce the sampling variance for most benchmarks.
- High temperature will significantly harm the reasoning and code generation capabilities of LLMs, while higher repetition penalty leads to improved performance on AlpacaEval.
- In the best-of-N sampling setting, 7B-level LMs have the potential to outperform GPT-4-Turbo.
🧩 Structure of This Project
There are two parts in this project: LLM non-determinism analysis, best-of-N experiments
analyse: analyse the results of non-determinism generation on seven benchmarks, including AlpacaEval 2, Arena-Hard, WildBench v2, MixEval, MMLU-Redux, GSM8K, and HumanEval.
get_benchmark_rewards: prepare reward data for best-of-N experiments. Using cutting-edge reward models to generate rewards for sampled model responses.
best_of_n_eval: unveiling the potential of LLM non-determinism generation by using best-of-N strategy, selecting the best response from N sampled generations.
🛠️ Setup
- Download LLM samples from Huggingface.
pip install -r requirements.txt
📊 LLM Non-Determinism Analysis
We evaluate non-determinism generation of LLMs on seven benchmarks: AlpacaEval 2, Arena-Hard, WildBench v2, MixEval, MMLU-Redux, GSM8K, and HumanEval.
| Dataset | Instance Num. | Sample Num. | Metric | |--------------|---------------|-------------|----------| | AlpacaEval 2 | 805 | 16 | LC | | Arena-Hard | 500 | 16 | Win Rate | | WildBench v2 | 1024 | 16 | WB-Score | | MixEval | 4000 | 16 | Score | | MMLU-Redux | 3000 | 32 | Acc | | GSM8K | 1319 | 128 | EM | | HumanEval | 164 | 128 | Pass@1 |
Take AlpacaEval for example, you can analyse the 16 sampled generations:
bash scripts/eval_alpacaeval_sample_baseline.sh <DATA_PATH>/alpaca_eval
<p align="center">
<img src=assets/main.png width=800/>
</p>
From the results, we observe a consistent performance gap between greedy decoding and the sampling method. Greedy decoding generally proves more effective for most tasks, except for AlpacaEval.
🚀 Best-of-N Evaluation
First, employ off-the-shelf reward models to get the rewards for LLM generations. We have implement reward modeling code for Starling-RM, Eurus-RM, FsfairX, and ArmoRM. Take AlpacaEval for example:
bash scripts/get_alpacaeval_sample_rewards.sh Meta-Llama-3-8B-Instruct <DATA_PATH>/alpaca_eval
The reward information will be saved in benchmark_rewards/{model_name}/{dataset}/{reward_model_name}.json.
Then, evaluate in a best-of-N setting:
bash scripts/eval_alpacaeval_sample_reward.sh <DATA_PATH>/alpaca_eval
<p align="center">
<img src=assets/reward.png width=800/>
</p>
With the oracle selection, even smaller LLMs like Llama-3-8B-Instruct can outperform GPT-4-Turbo on MMLU, GSM8K, and HumanEval. Furthermore, cutting-edge reward models can also select superior responses from multiple generations, and can outperform GPT-4-Turbo on GSM8K with only 8 samples.
📖 Citation
If you find this repo helpful, please cite out paper:
@article{song2024good,
author={Yifan Song and Guoyin Wang and Sujian Li and Bill Yuchen Lin},
title={The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism},
year={2024},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Related Skills
node-connect
334.5kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
82.2kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
334.5kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
82.2kCommit, push, and open a PR
