SkillAgentSearch skills...

ShortcutsBench

ShortcutsBench: A Large-Scale Real-World Benchmark for API-Based Agents

Install / Use

/learn @EachSheep/ShortcutsBench
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<div align= "center"> <h1> 🔧ShortcutsBench📱</h1> </div> <div align="center">

Dialogues Dialogues Dialogues Dialogues Dialogues

</div> <!-- <p align="center"> <a href="#model">Model</a> • <a href="#data">Data Release</a> • <a href="#web-ui">Web Demo</a> • <a href="#tool-eval">Tool Eval</a> • <a href="https://arxiv.org/pdf/2307.16789.pdf">Paper</a> • <a href="#citation">Citation</a> </p> --> </div>

Read this in 中文.

What are Shortcuts?

Shortcuts are workflows built by developers in the Shortcuts app using a user-friendly graphical interface 🖼️ with the provided basic actions. Apple describes them as "a quick way to get one or more tasks done with your apps." 📱

Project Task List (Continuously Updated) 📋

All data, data acquisition processes, data generated during cleaning, cleaning scripts, experiment scripts, results, and related files can be found in the following documents: deves_dataset/dataset_src/README.md (English) or Chinese, deves_dataset/dataset_src_valid_apis/README.md (English) or Chinese, and experiments/README.md (English) or Chinese.

  • [x] ShortcutsBench Paper Main Text
  • [x] ShortcutsBench Paper Appendix
  • [x] Scripts for Data Acquisition, Data Cleaning and Processing, Experiment Code, and Experiment Results
  • [x] We provide shortcuts with bilingual explanations for regular users: listed in users_dataset/${website name}/${category name}/README.md (English) or users_dataset/${website name}/${category name}/README_ZH.md (Chinese). Regular users can find suitable shortcuts for their work or life in our repository, which they can import into the Shortcuts app on Apple devices. Each shortcut includes:
    1. The iCloud link for the shortcut
    2. A description of the shortcut's functionality
    3. The source of the shortcut
  • For Shortcut Researchers: ShortcutsBench provides: (1) Shortcuts (i.e., sequences of actions in golden); (2) Queries (i.e., tasks assigned to the agent); (3) APIs (i.e., tools available to the agent).
    • [x] Shortcuts

    • [x] Queries. The generated queries are shown in generated_success_queries.json, which can be obtained from Google Drive or Baidu Cloud (password: shortcutsbench).

      The queries are generated based on 1_final_detailed_records_filter_apis_leq_30.json.

    • [x] APIs. The obtained APIs are shown in 4_api_json_filter.json, which can be obtained from Google Drive or Baidu Cloud (password: shortcutsbench).

      4_api_json_filter.json has been manually deduplicated, but a few duplicates remain. The raw unprocessed files extracted directly from the app are in 4_api_json.json, which can be obtained from Google Drive or Baidu Cloud (password: shortcutsbench).

How can this project help you?

The Apple Developer Conference WWDC'24 introduced a lot of AI features on Apple devices 🤖. We are very interested in how Apple combines large language models like ChatGPT with devices to provide users with a smarter experience 💡. In this process, shortcuts will play a significant role! 🚀

As a Shortcut User and Enthusiast 📱

You can find your favorite shortcuts in this dataset 📱 to help you complete various complex tasks with one click! For example:

As a Researcher 🔬

  • Research on building automated workflows: Shortcuts are essentially workflows composed of a series of API calls (actions) provided by Apple and third-party apps 🔍.
  • Research on low-code programming: Shortcuts include features like branches, loops, and variable assignments, while having a user-friendly graphical interface 🖥️.
  • Research on API-based agents: Enabling large language models to autonomously decide whether, when, and how to use APIs based on user queries (tasks) 🔧.
  • Research on fine-tuning large language models using shortcuts to closely integrate language models with phones, computers, and smartwatches, achieving the vision of an "operating system based on large language models" 📈.
  • ......

🌟Advantages of ShortcutsBench Over Existing API-Based Agent Datasets🌟

ShortcutsBench has significant advantages in terms of the authenticity, richness, and complexity of APIs, the validity of queries and corresponding action sequences, the accurate filling of parameter values, the awareness of obtaining information from the system or users, and the overall scale.

To our knowledge, ShortcutsBench is the first large-scale agent benchmark based on real APIs, considering APIs, queries, and corresponding action sequences. ShortcutsBench provides a rich set of real APIs, queries of varying difficulty and task types, high-quality human-annotated action sequences (provided by shortcut developers), and queries from real user needs. Additionally, it offers precise parameter value filling, including raw data types, enumeration types, and using outputs from previous actions as parameter values, and evaluates the agent's awareness of requesting necessary information from the system or users. Moreover, the scale of APIs, queries, and corresponding action sequences in ShortcutsBench rivals or even surpasses benchmarks and datasets created by LLMs or modified from existing datasets. A comprehensive comparison between ShortcutsBench and existing benchmarks/datasets is shown in the table below.

Example Image

If you find this project helpful, please give us a Star ⭐️! Thank you for your support! 🙏

Keywords: Shortcuts, Apple, WWDC'24, Siri, iOS, macOS, wat

Related Skills

View on GitHub
GitHub Stars112
CategoryDevelopment
Updated8d ago
Forks5

Languages

Python

Security Score

100/100

Audited on Mar 23, 2026

No findings