CELLO
Code and data for the paper "Can Large Language Models Understand Real-World Complex Instructions?"(AAAI2024)
Install / Use
/learn @Abbey4799/CELLOREADME
CELLO
CELLO is a benchmark for evaluating theComplEx instruction understanding ability of Large Language MOdels systematically (AAAI 2024).
- We design eight features for complex instructions and construct a comprehensive evaluation dataset from real-world scenarios.
- We establish four criteria and develop corresponding metrics, as current ones are inadequate, biased or too strict and coarse-grained.
- We compare the performance of representative Chinese-oriented and English-oriented models in following complex instructions through extensive experiments.
Install Dependencies
conda create -n cello python=3.10.9
conda activate cello
conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.7 -c pytorch -c nvidia
pip install -r requirements.txt
Evaluate Models
You can evaluate any desired model via the following scirpt eval.sh:
cd CELLO/
CUDA_VISIBLE_DEVICES=0 python code/eval.py --model_name chatglm --save_name chatglm
All the models are implemented in the folder code/evaluators. All the model results are in the folder results/.
Scoring System
The metrics for our designed four criteria can be calculated using the following script score.sh:
cd CELLO/
python code/score.py
All the scorers are implemented in the folder code/scorers. All the scoring results are in the folder scores/.
Data
The collected data can be found in the data/. All samples have been anonymized.
Citation
@inproceedings{he2024can,
title={Can Large Language Models Understand Real-World Complex Instructions?},
author={He, Qianyu and Zeng, Jie and Huang, Wenhao and Chen, Lina and Xiao, Jin and He, Qianxi and Zhou, Xunzhe and Liang, Jiaqing and Xiao, Yanghua},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
volume={38},
number={16},
pages={18188--18196},
year={2024}
}
Related Skills
node-connect
352.5kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
111.3kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
352.5kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
352.5kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
