SkillAgentSearch skills...

GoodBadGreedy

The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism

Install / Use

/learn @Yifan-Song793/GoodBadGreedy
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Evaluation of LLMs Should Not Ignore Non-Determinism

🤗 Dataset | 📖 arXiv

Official repo for The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism

Authors: Yifan Song, Guoyin Wang, Sujian Li, Bill Yuchen Lin.

TLDR: Are there performance differences between greedy decoding and sampling methods for LLM generation? The answer is YES!

Current evaluations of large language models (LLMs) often overlook non-determinism, typically focusing on a single output per example. This limits our understanding of LLM performance variability in real-world applications. Our study addresses this issue by exploring key questions about the non-determinism of LLM generations, identifying benchmarks’ consistency regarding non-determinism, and examining unique model behaviors.

Here are our findings:

  • A notable performance gap is observed between greedy decoding and sampling generation.
  • Greedy decoding outperforms sampling on most evaluated benchmarks, except for AlpacaEval.
  • Math reasoning and code generation were most impacted by sampling variance.
  • The above findings remain consistent across different sizes and families of LLMs.
  • Alignment methods, e.g., DPO, can significantly reduce the sampling variance for most benchmarks.
  • High temperature will significantly harm the reasoning and code generation capabilities of LLMs, while higher repetition penalty leads to improved performance on AlpacaEval.
  • In the best-of-N sampling setting, 7B-level LMs have the potential to outperform GPT-4-Turbo.

🧩 Structure of This Project

There are two parts in this project: LLM non-determinism analysis, best-of-N experiments

analyse: analyse the results of non-determinism generation on seven benchmarks, including AlpacaEval 2, Arena-Hard, WildBench v2, MixEval, MMLU-Redux, GSM8K, and HumanEval.

get_benchmark_rewards: prepare reward data for best-of-N experiments. Using cutting-edge reward models to generate rewards for sampled model responses.

best_of_n_eval: unveiling the potential of LLM non-determinism generation by using best-of-N strategy, selecting the best response from N sampled generations.

🛠️ Setup

  1. Download LLM samples from Huggingface.
  2. pip install -r requirements.txt

📊 LLM Non-Determinism Analysis

We evaluate non-determinism generation of LLMs on seven benchmarks: AlpacaEval 2, Arena-Hard, WildBench v2, MixEval, MMLU-Redux, GSM8K, and HumanEval.

| Dataset | Instance Num. | Sample Num. | Metric | |--------------|---------------|-------------|----------| | AlpacaEval 2 | 805 | 16 | LC | | Arena-Hard | 500 | 16 | Win Rate | | WildBench v2 | 1024 | 16 | WB-Score | | MixEval | 4000 | 16 | Score | | MMLU-Redux | 3000 | 32 | Acc | | GSM8K | 1319 | 128 | EM | | HumanEval | 164 | 128 | Pass@1 |

Take AlpacaEval for example, you can analyse the 16 sampled generations:

bash scripts/eval_alpacaeval_sample_baseline.sh <DATA_PATH>/alpaca_eval
<p align="center"> <img src=assets/main.png width=800/> </p>

From the results, we observe a consistent performance gap between greedy decoding and the sampling method. Greedy decoding generally proves more effective for most tasks, except for AlpacaEval.

🚀 Best-of-N Evaluation

First, employ off-the-shelf reward models to get the rewards for LLM generations. We have implement reward modeling code for Starling-RM, Eurus-RM, FsfairX, and ArmoRM. Take AlpacaEval for example:

bash scripts/get_alpacaeval_sample_rewards.sh Meta-Llama-3-8B-Instruct <DATA_PATH>/alpaca_eval

The reward information will be saved in benchmark_rewards/{model_name}/{dataset}/{reward_model_name}.json. Then, evaluate in a best-of-N setting:

bash scripts/eval_alpacaeval_sample_reward.sh <DATA_PATH>/alpaca_eval
<p align="center"> <img src=assets/reward.png width=800/> </p>

With the oracle selection, even smaller LLMs like Llama-3-8B-Instruct can outperform GPT-4-Turbo on MMLU, GSM8K, and HumanEval. Furthermore, cutting-edge reward models can also select superior responses from multiple generations, and can outperform GPT-4-Turbo on GSM8K with only 8 samples.

📖 Citation

If you find this repo helpful, please cite out paper:

@article{song2024good,
    author={Yifan Song and Guoyin Wang and Sujian Li and Bill Yuchen Lin},
    title={The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism},
    year={2024},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Related Skills

View on GitHub
GitHub Stars30
CategoryDevelopment
Updated10mo ago
Forks2

Languages

Python

Security Score

72/100

Audited on May 6, 2025

No findings