SkillAgentSearch skills...

DebateLLM

Benchmarking Multi-Agent Debate between Language Models for Truthfulness in Q&A.

Install / Use

/learn @instadeepai/DebateLLM
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

💬 DebateLLM - debating LLMs for truth discovery in medicine and beyond

👀 Overview

DebateLLM is a library that encompasses a variety of debating protocols and prompting strategies, aimed at enhancing the accuracy of Large Language Models (LLMs) in Q&A datasets.

Our research (mostly using GPT-3.5) reveals that no single debate or prompting strategy consistently outperforms others across all scenarios. Therefore it is important to experiment with various approaches to find what works best for each dataset. However, implementing each protocol can be time-consuming. We therefore built and open-sourced DebateLLM to facilitate its use by the research community. It enables researchers to test various implementations from the literature on their specific problems (medical or otherwise), potentially driving further advancements in the intelligent prompting of LLMs.

We have various system implementations:

<table align="center"> <tr> <td align="center"><img src="imgs/SocietyOfMind.png" style="height:350px; width:auto;"></td> <td align="center"><img src="imgs/Medprompt.png" style="height:350px; width:auto;"></td> <td align="center"><img src="imgs/MultiPersona.png" style="height:350px; width:auto;"></td> </tr> <tr> <td align="center"><a href="https://arxiv.org/abs/2305.14325">Society of Minds</a></td> <td align="center"><a href="https://arxiv.org/abs/2311.16452">Medprompt</a></td> <td align="center"><a href="https://arxiv.org/abs/2307.05300">Multi-Persona</a></td> </tr> </table> <table align="center"> <tr> <td align="center"><img src="imgs/EnsembleRefinement.png" style="height:100px; width:auto;"></td> <td align="center"><img src="imgs/ChatEval.png" style="height:100px; width:auto;"></td> <td align="center"><img src="imgs/SPP.png" style="height:100px; width:auto;"></td> </tr> <tr> <td align="center"><a href="https://arxiv.org/abs/2305.09617">Ensemble Refinement</a></td> <td align="center"><a href="https://arxiv.org/abs/2308.07201">ChatEval</a></td> <td align="center"><a href="https://arxiv.org/abs/2307.05300">Solo Performance Prompting</a></td> </tr> </table>

🔧 Installation

To set up the DebateLLM environment, execute the following command:

make build_venv

🚀 Running an Experiment

To run an experiment:

  1. Activate the Python virtual environment:

    source venv/bin/activate
    
  2. Execute the evaluation script:

    python ./experiments/evaluate.py
    

    You can modify experiment parameters by using Hydra configs located in the conf folder. The main configuration file is found under conf/config.yaml. Changes at the database and system levels can be made by updating the configs in conf/dataset and conf/system.

To launch multiple experiments:

python ./scripts/launch_experiments.py

📊 Visualising Results

To visualize the results with Neptune:

  1. Run the visualisation script:
    python ./scripts/visualise_results.py
    
  2. The output results will be saved to ./data/charts/.

📊 Benchmarks

Our benchmarks showcase DebateLLM's performance on MedQA, PubMedQA, and MMLU datasets, focusing on accuracy versus cost, time efficiency, token economy, and agent agreement impact. For all our experiments we use GPT 3.5, unless specified otherwise. These visualizations illustrate the balance between accuracy and computational cost, the speed and quality of responses, linguistic efficiency, and the effects of consensus strategies in medical Q&A contexts. Each dataset highlights the varied capabilities of DebateLLM's strategies.

MedQA Dataset

<div> <img src="./imgs/results/MedQA_Average seconds per question_scatter_plots.png" alt="Average Seconds per Question vs. Accuracy MedQA" width="46.5%"/> <img src="./imgs/results/MedQA_Average tokens per question_scatter_plots.png" alt="Average Tokens per Question vs. Accuracy MedQA" width="46.5%"/> </div> <div> <img src="./imgs/results/MedQA_Total cost_scatter_plots.png" alt="Accuracy vs. Cost MedQA" width="51.8%"/> <img src="./imgs/results/medqa_total_acc_box.png" alt="Total Accuracy Box MedQA" width="41.2%"/> </div>

PubMedQA Dataset

<div> <img src="./imgs/results/PubMedQA_Average seconds per question_scatter_plots.png" alt="Average Seconds per Question vs. Accuracy PubMedQA" width="46.5%"/> <img src="./imgs/results/PubMedQA_Average tokens per question_scatter_plots.png" alt="Average Tokens per Question vs. Accuracy PubMedQA" width="46.5%"/> </div> <div> <img src="./imgs/results/PubMedQA_Total cost_scatter_plots.png" alt="Accuracy vs. Cost PubMedQA" width="51.8%"/> <img src="./imgs/results/pubmedqa_total_acc_box.png" alt="Total Accuracy Box PubMedQA" width="41.2%"/> </div>

MMLU Dataset

<div> <img src="./imgs/results/MMLU_Average seconds per question_scatter_plots.png" alt="Average Seconds per Question vs. Accuracy MMLU" width="46.5%"/> <img src="./imgs/results/MMLU_Average tokens per question_scatter_plots.png" alt="Average Tokens per Question vs. Accuracy MMLU" width="46.5%"/> </div> <div> <img src="./imgs/results/MMLU_Total cost_scatter_plots.png" alt="Accuracy vs. Cost MMLU" width="51.8%"/> <img src="./imgs/results/mmlu_total_acc_box.png" alt="Total Accuracy Box MMLU" width="41.2%"/> </div>

Agent Agreement Analysis

Modulating the agreement intensity provides a substantial improvement in performance for various models. For Multi-Persona, there is an approximate 15% improvement, and for Society of Minds (SoM), an approximate 5% improvement on the USMLE dataset. The 90% agreement intensity prompts applied to Multi-Persona demonstrate a new high score on the MedQA dataset, highlighted in the MedQA dataset cost plot as a red cross.

<div> <img src="./imgs/results/prompt_agreement_vs_accuracy.png" alt="Agreement Intensity" width="47%"/> <img src="./imgs/results/agreement_vs_accuracy.png" alt="Agreement vs Accuracy" width="47%"/> </div>

The benchmarks indicate the effectiveness of various strategies and models implemented within DebateLLM. For detailed analysis and discussion, refer to our paper.

GPT4 results

We also assessed GPT-4's capability on the MedQA dataset, applying the optimal agreement modulation value identified for Multi-Persona with GPT-3.5 on USMLE. The results suggest that these hyperparameter settings are indeed capable of transferring effectively to more advanced models. The results are shown below:

<div> <img src="./imgs/results/medqa_gpt4.png" alt="MedQA gpt4" width="51.8%"/> <img src="./imgs/results/medqa_gpt4_total_acc_box.png" alt="Total Accuracy Box MedQA" width="41.2%"/> </div>

Contributing 🤝

Please read our contributing docs for details on how to submit pull requests, our Contributor License Agreement and community guidelines.

📚 Citing DebateLLM

If you use DebateLLM in your work, please cite our paper:

@article{smit2024mad,
  title={Should we be going MAD? A Look at Multi-Agent Debate Strategies for LLMs},
  author={Smit, Andries and Duckworth, Paul and Grinsztajn, Nathan and Barrett, Thomas D. and Pretorius, Arnu},
  journal={arXiv preprint arXiv:2311.17371},
  year={2024},
  url={https://arxiv.org/abs/2311.17371}
}

Link to the paper: Benchmarking Multi-Agent Debate between Language Models for Medical Q&A.

Related Skills

View on GitHub
GitHub Stars54
CategoryDevelopment
Updated10d ago
Forks6

Languages

Jupyter Notebook

Security Score

95/100

Audited on Mar 17, 2026

No findings