SkillAgentSearch skills...

ComplexFuncBench

Complex Function Calling Benchmark.

Install / Use

/learn @zai-org/ComplexFuncBench
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Complex Function Calling Benchmark (ComplexFuncBench)

<p align="center"> 📄<a href="https://arxiv.org/abs/2501.10132" target="_blank"> Arxiv Paper </a> • 🤗 <a href="https://huggingface.co/papers/2501.10132" target="_blank">HF Paper</a> • 📊 <a href="https://huggingface.co/datasets/THUDM/ComplexFuncBench" target="_blank">Dataset</a> </p>

Table of Contents

Introduction

Complex Function Calling Benchmark (ComplexFuncBench) is specillly designed for complex function calling evaluation. The ComplexFuncBench dataset encompass 1,000 complex function calling samples from five aspects: (1) Function calling with multi-step in single turn; (2) Function calling with user-provided constraints; (3) Function calling that requires parameter value reasoning from implicit information; (4) Function calling with long parameter values that exceed 500 tokens; and (5) Function calling with 128k long-context length.

Complex Example

The difference between ComplexFuncBench and other function calling benchmarks is shown in the following table.

| | Real API Response | Multi-Step | Constraints | Parameter Value Reasoning | Long Parameter Reasoning | Long-Context | | :----------------: | :---------------: | :--------: | :---------: | :-----------------------: | :----------------------: | :----------: | | API-Bench | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | | ToolBench | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | | T-Eval | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | | BFCL | ❌ | ✅ | ❌ | ❌ | ✅ | ✅ | | Tool Sandbox | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ | | ComplexFuncBench | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |

Leaderboard

| Model | Overall Success Rate | Overall Call Acc. | Completeness | Correctness | | :--------------------------- | :------------------: | :---------------: | :----------: | :---------: | | Claude-3.5-Sonnet (20241022) | 61.00 | 79.27 | 1.84 | 1.85 | | GPT-4o (2024-08-06) | 60.50 | 80.55 | 1.66 | 1.75 | | GLM-4-Long | 57.10 | 76.35 | 1.72 | 1.74 | | GPT-4-Turbo (2024-04-09) | 49.50 | 71.38 | 1.72 | 1.81 | | Claude-3.5-Haiku (20241022) | 45.80 | 69.50 | 1.79 | 1.71 | | Qwen2.5-72B | 40.10 | 58.32 | 1.80 | 1.75 | | Mistral Large 2 | 20.10 | 48.78 | 0.94 | 1.0 | | GLM-4-9B | 9.40 | 27.97 | 1.15 | 1.03 | | Qwen2.5-7B | 5.0 | 18.19 | 1.5 | 1.47 | | Llama-3.1-405B | 4.00 | 11.87 | 0.43 | 0.30 | | Llama-3.1-70B | 2.70 | 8.17 | 0.67 | 0.36 | | Llama-3.1-8B | 0.10 | 1.34 | 0.18 | 0.09 |

Method

Data Collection

The collection of the ComplexFuncBench dataset consists of three stages: coarse generation, fine-grained annotation, and generalization. The dataset contains 1,000 complex function-calling samples, which comprise 600 single-domain samples and 400 cross-domain samples.

Data Collection

Automated Evaluation

The automated evaluation framework \texttt{ComplexEval} evaluates models' complex function calling ability and response generation ability simultaneously.

Evaluation Pipeline

How to evaluate on ComplexFuncBench

Preparation

First, download the repository and dataset. You can download the benchmarkd dataset through the HuggingFace datasets.

git clone https://github.com/THUDM/ComplexFuncBench.git
cd ComplexFuncBench

Then, install the dependencies.

pip install -r requirements.txt

Serve Model

  • For close source models, make sure the corresponding model API keys are included in your evironments .env . To enable response-based evaluation, you need to subscribe the Booking API from RapidAPI.

    OPENAI_API_KEY=sk-XXXXXX
    
    RAPID_API_KEY=
    
  • For open source models, you need to deploy your model via vLLM. Run the following command to serve the model. Take THUDM/glm-4-9b-chat for example:

    vllm serve THUDM/glm-4-9b-chat --api-key token-abc123 --tensor-parallel-size 4 --gpu-memory-utilization 0.95 --max_model_len 131072 --trust-remote-code
    

Run Model Inference

python evaluation.py --model_name {model_name} --proc_num {proc_num}

Take gpt-4o-2024-08-06 and THUDM/glm-4-9b-chat for example,

python evaluation.py --model_name gpt-4o-2024-08-06 --proc_num 50
python evaluation.py --model_name THUDM/glm-4-9b-chat --proc_num 50 --vllm_url http://xx.xx.xx.xx:8000/v1

The evaluation results is saved in result/{model_name}

Export Results

python print_results.py --result_dir {result_dir}

Citation

If you find our work helpful for your research, please consider citing our work.

@misc{zhong2025complexfuncbench,
      title={ComplexFuncBench: Exploring Multi-Step and Constrained Function Calling under Long-Context Scenario}, 
      author={Lucen Zhong and Zhengxiao Du and Xiaohan Zhang and Haiyi Hu and Jie Tang},
      year={2025},
      eprint={2501.10132},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2501.10132}, 
}
View on GitHub
GitHub Stars169
CategoryDevelopment
Updated11h ago
Forks28

Languages

Python

Security Score

80/100

Audited on Mar 27, 2026

No findings