SkillAgentSearch skills...

ToolBench

[ICLR'24 spotlight] An open platform for training, serving, and evaluating large language model for tool learning.

Install / Use

/learn @OpenBMB/ToolBench
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<div align= "center"> <h1> 🛠️ToolBench🤖</h1> </div> <div align="center">

Dialogues Dialogues Dialogues Dialogues Dialogues Dialogues

</div> <p align="center"> <a href="#model">Model</a> • <a href="#data">Data Release</a> • <a href="#web-ui">Web Demo</a> • <a href="#tooleval">Tool Eval</a> • <a href="https://arxiv.org/pdf/2307.16789.pdf">Paper</a> • <a href="#citation">Citation</a> </p> </div> <div align="center"> <img src="assets/ToolLLaMA-logo.png" width="350px"> </div>

🔨This project (ToolLLM) aims to construct open-source, large-scale, high-quality instruction tuning SFT data to facilitate the construction of powerful LLMs with general tool-use capability. We aim to empower open-source LLMs to master thousands of diverse real-world APIs. We achieve this by collecting a high-quality instruction-tuning dataset. It is constructed automatically using the latest ChatGPT (gpt-3.5-turbo-16k), which is upgraded with enhanced function call capabilities. We provide the dataset, the corresponding training and evaluation scripts, and a capable model ToolLLaMA fine-tuned on ToolBench.

2024.8 Update We have updated the RapidAPI server with a new IP, please make sure you get the latest code. You can also build it locally using codes here.

💁‍♂️💁💁‍♀️ Join Us on Discord!

Read this in 中文.

What's New

  • [2024/3/17] Welcome to StableToolBench: A stable and reliable local toolbench server based on API response simulation. Dive deeper into the tech behind StableToolBench with paper here and explore more on the project homepage. Codes are available here.

  • [2023/9/29] A new version ToolEval which is more stable and covers more models including GPT4! Please refer to ToolEval for more details. Besides, ToolLLaMA-2-7b-v2 is released with stronger tool-use capabilities. Please use the ToolLLaMA-2-7b-v2 model to reproduce our latest experimental results with the new version ToolEval.

  • [2023/8/30] Data updation, with more than 120,000 solution path annotations and intact reasoning thoughts! Please find data.zip on Google Drive.

  • [2023/8/8] No more hallucination! ToolLLaMA-2-7b-v1 (fine-tuned from LLaMA-2-7b) is released with lower API hallucination than ChatGPT.

  • [2023/8/4] We provide RapidAPI backend service to free you from using your own RapidAPI key and subscribing the APIs. Please fill out our form. We will review it as soon as possible and send you the ToolBench key to get start on it!

  • [2023/8/1] Our paper is released.

  • [2023/7/27] New version ToolBench is released.

✨Here is an overview of the dataset construction, training, and evaluation.

<br> <div align="center"> <img src="assets/overview.png" width="800px"> </div> <br>

✨✨Features:

  • API Collection: we gather 16464 representational state transfer (REST) APIs from RapidAPI, a platform that hosts massive real-world APIs provided by developers.
  • Instruction Generation: we curate instructions that involve both single-tool and multi-tool scenarios.
  • Answer Annotation: we develop a novel depth-first search based decision tree (DFSDT) to bolster the planning and reasoning ability of LLMs, which significantly improves the annotation efficiency and successfully annotates those complex instructions that cannot be answered with CoT or ReACT. We provide responses that not only include the final answer but also incorporate the model's reasoning process, tool execution, and tool execution results.
  • API Retriver: we incorporate API retrieval to equip ToolLLaMA with open-domain tool-using abilities.
  • All the data is automatically generated by OpenAI API and filtered by us, the whole data creation process is easy to scale up.
<br> <div align="center"> <img src="assets/comparison.png" width="800px"> </div> <br>

We also provide A demo of using ToolLLaMA

<div align="center">

https://github.com/OpenBMB/ToolBench/assets/25274507/f1151d85-747b-4fac-92ff-6c790d8d9a31

</div>

Currently, our ToolLLaMA has reached the performance of ChatGPT (turbo-16k) in tool use, in the future, we will continually improve the data quality and increase the coverage of real-world tools.

<div align="center"> <img src="assets/performance.png" width="300px"> </div>

Here is the Old version of ToolBench.

Data

👐ToolBench is intended solely for research and educational purposes and should not be construed as reflecting the opinions or views of the creators, owners, or contributors of this dataset. It is distributed under Apache License 2.0. Below is the statistics of the data :

| Tool Nums | API Nums | Instance Nums | Real API Call | Reasoning Traces | |-----------|----------|---------------|---------------|------------------| | 3451 | 16464 | 126486 | 469585 | 4.0 |

We crawl 16000+ real-world APIs from RapidAPI, and curate realistic human instructions that involve them. Below we present a hierarchy of RapidAPI and our instruction generation process.

<br> <div align="center"> <img src="assets/instructiongeneration.png" width="800px"> </div> <br>

ToolBench contains both single-tool and multi-tool scenarios. The multi-tool scenarios can be further categorized into intra-category multi-tool and intra-collection multi-tool. We utilize DFSDT method for all scenarios to our data creation. Here is an illustration for the data creation process using DFSDT method:

<div align="center"> <img src="assets/answer_anno.png" width="800px"> </div>

Data Release

Please download our dataset using the following link: Google Drive or Tsinghua Cloud. Notice that data_0801 is the old version data. The file structure is as follows:

├── /data/
│  ├── /instruction/
│  ├── /answer/
│  ├── /toolenv/
│  ├── /retrieval/
│  ├── /test_instruction/
│  ├── /test_query_ids/
│  ├── /retrieval_test_query_ids/
│  ├── toolllama_G123_dfs_train.json
│  └── toolllama_G123_dfs_eval.json
├── /reproduction_data/
│  ├── /chatgpt_cot/
│  ├── /chatgpt_dfs/
│  ├── ...
│  └── /toolllama_dfs/

Here are some descriptions for the data directory:

  • instruction and answer: The instruction data and solution path annotation data. G1,G2, G3 refers to single-tool, intra-category multi-tool and intra-collection multi-tool data respectively. We also have an Atlas Explorer for visualization.
  • toolenv: The tool environment related data, containing API jsons, API codes and API example responses.
  • retrieval: The data used for tool retrieval is included in this directory.
  • test_instruction and test_query_ids: We sample 200 instances from every test set. The test_instruction directory contains test queries for each test set, and the test_query_ids contains query ids of the test instances in each test set.
  • retrieval_test_query_ids: This directory contains query ids of the test instances for retriever.
  • toolllama_G123_dfs_train.json and toolllama_G123_dfs_eval.json: Preprocessed data that can be used to train toolllama directly and reproduce our results. For preprocessing details, we split the G1, G2 and G3 data into train, eval and test parts respectively and combine the train data for training in our main experiments.

Please make sure you have downloaded the necessary data and put the directory (e.g. data/) under ToolBench/, so that the following bash scripts can navigate to the related data.

🤖Model

We release the ToolLLaMA-2-7b-v2 which is trained on the latest version data, and ToolLLaMA-7b-v1, ToolLLaMA-7b-LoRA-v1 which are trained on the 0801 version data. All models are trained on the released dataset in a multi-task fashion. We also release the tool retriever trained under our experimental setting.

🚀Fine-tuning

Install

Clone this repository and navigate to the ToolBench folder.

git clone git@github.com:OpenBMB/ToolBench.git
cd ToolBench

Install Package (python>=3.9)

pip install -r requirements.txt

or for ToolEval only

pip install -r toolbench/tooleval/requirements.txt

Prepare the data and tool environment:

wget --no-check-certificate 'https://drive.google.com/uc?export=downlo
View on GitHub
GitHub Stars5.6k
CategoryEducation
Updated1h ago
Forks479

Languages

Python

Security Score

95/100

Audited on Apr 1, 2026

No findings