ShortcutsBench

ShortcutsBench: A Large-Scale Real-World Benchmark for API-Based Agents

Generate Convert Improve

Install / Use

/learn @EachSheep/ShortcutsBench

About this skill

Quality Score

0/100

README

<div align= "center"> <h1> 🔧ShortcutsBench📱</h1> </div> <div align="center">

</div>  </div>

Read this in 中文.

What are Shortcuts?

Shortcuts are workflows built by developers in the Shortcuts app using a user-friendly graphical interface 🖼️ with the provided basic actions. Apple describes them as "a quick way to get one or more tasks done with your apps." 📱

Project Task List (Continuously Updated) 📋

All data, data acquisition processes, data generated during cleaning, cleaning scripts, experiment scripts, results, and related files can be found in the following documents: deves_dataset/dataset_src/README.md (English) or Chinese, deves_dataset/dataset_src_valid_apis/README.md (English) or Chinese, and experiments/README.md (English) or Chinese.

[x] ShortcutsBench Paper Main Text
[x] ShortcutsBench Paper Appendix
[x] Scripts for Data Acquisition, Data Cleaning and Processing, Experiment Code, and Experiment Results
[x] We provide shortcuts with bilingual explanations for regular users: listed in users_dataset/${website name}/${category name}/README.md (English) or users_dataset/${website name}/${category name}/README_ZH.md (Chinese). Regular users can find suitable shortcuts for their work or life in our repository, which they can import into the Shortcuts app on Apple devices. Each shortcut includes:
1. The iCloud link for the shortcut
2. A description of the shortcut's functionality
3. The source of the shortcut

For Shortcut Researchers: ShortcutsBench provides: (1) Shortcuts (i.e., sequences of actions in golden); (2) Queries (i.e., tasks assigned to the agent); (3) APIs (i.e., tools available to the agent).
- [x] Shortcuts
  - [x] Raw Shortcut Dataset, i.e., the file 1_final_detailed_records_remove_repeat.json, can be downloaded as described in deves_dataset/dataset_src/README.md (English) or deves_dataset/dataset_src/README_ZH.md (Chinese), or directly from Google Drive or Baidu Cloud (password: shortcutsbench).
    
    The APIs involved in the shortcuts in this file may not have corresponding API definition files.
  - [x] Filtered Shortcut Dataset, i.e., the file 1_final_detailed_records_filter_apis.json, can be downloaded as described in deves_dataset/dataset_src/README.md (English) or deves_dataset/dataset_src/README_ZH.md (Chinese), or directly from Google Drive or Baidu Cloud (password: shortcutsbench).
    
    The APIs involved in the shortcuts in this file all have corresponding API definition files. This file is a cleaned version of 1_final_detailed_records_remove_repeat.json. If a shortcut contains APIs without definition files, the shortcut is removed.
  - [x] Shortcuts Dataset <=30, i.e., the file 1_final_detailed_records_filter_apis_leq_30.json, can be downloaded as described in experiments/README.md (English) or experiments/README_ZH.md (Chinese), or directly from Google Drive or Baidu Cloud (password: shortcutsbench).
    
    Considering the context length limitation of language models, we only evaluated shortcuts with lengths <=30 in the ShortcutsBench paper.
- [x] Queries. The generated queries are shown in generated_success_queries.json, which can be obtained from Google Drive or Baidu Cloud (password: shortcutsbench).
  
  The queries are generated based on 1_final_detailed_records_filter_apis_leq_30.json.
- [x] APIs. The obtained APIs are shown in 4_api_json_filter.json, which can be obtained from Google Drive or Baidu Cloud (password: shortcutsbench).
  
  4_api_json_filter.json has been manually deduplicated, but a few duplicates remain. The raw unprocessed files extracted directly from the app are in 4_api_json.json, which can be obtained from Google Drive or Baidu Cloud (password: shortcutsbench).

How can this project help you?

The Apple Developer Conference WWDC'24 introduced a lot of AI features on Apple devices 🤖. We are very interested in how Apple combines large language models like ChatGPT with devices to provide users with a smarter experience 💡. In this process, shortcuts will play a significant role! 🚀

As a Shortcut User and Enthusiast 📱

You can find your favorite shortcuts in this dataset 📱 to help you complete various complex tasks with one click! For example:

🏡 Daily Life 🤹
- Holiday Reminders
- Sign in to Baidu Tieba
- ......
🛍️ Shopping Enthusiasts 🛒
- Buy PUBG Mobile UC
- Copy Taobao Password
- ......
🧑‍🎓 Students 🧮
- Calculator
- Relax Your Mind
- ......
⌨️ Writers 🔣
- Translator
- Create PDF
- ......
🧑‍🔬 Researchers 🏫
- Get arXiv BibTeX Entry
- ......
.....

As a Researcher 🔬

Research on building automated workflows: Shortcuts are essentially workflows composed of a series of API calls (actions) provided by Apple and third-party apps 🔍.
Research on low-code programming: Shortcuts include features like branches, loops, and variable assignments, while having a user-friendly graphical interface 🖥️.
Research on API-based agents: Enabling large language models to autonomously decide whether, when, and how to use APIs based on user queries (tasks) 🔧.
Research on fine-tuning large language models using shortcuts to closely integrate language models with phones, computers, and smartwatches, achieving the vision of an "operating system based on large language models" 📈.
......

🌟Advantages of ShortcutsBench Over Existing API-Based Agent Datasets🌟

ShortcutsBench has significant advantages in terms of the authenticity, richness, and complexity of APIs, the validity of queries and corresponding action sequences, the accurate filling of parameter values, the awareness of obtaining information from the system or users, and the overall scale.

To our knowledge, ShortcutsBench is the first large-scale agent benchmark based on real APIs, considering APIs, queries, and corresponding action sequences. ShortcutsBench provides a rich set of real APIs, queries of varying difficulty and task types, high-quality human-annotated action sequences (provided by shortcut developers), and queries from real user needs. Additionally, it offers precise parameter value filling, including raw data types, enumeration types, and using outputs from previous actions as parameter values, and evaluates the agent's awareness of requesting necessary information from the system or users. Moreover, the scale of APIs, queries, and corresponding action sequences in ShortcutsBench rivals or even surpasses benchmarks and datasets created by LLMs or modified from existing datasets. A comprehensive comparison between ShortcutsBench and existing benchmarks/datasets is shown in the table below.

Example Image

If you find this project helpful, please give us a Star ⭐️! Thank you for your support! 🙏

Keywords: Shortcuts, Apple, WWDC'24, Siri, iOS, macOS, wat

Related Skills

gh-issues

342.5k

Fetch GitHub issues, spawn sub-agents to implement fixes and open PRs, then monitor and address PR review comments. Usage: /gh-issues [owner/repo] [--label bug] [--limit 5] [--milestone v1.0] [--assignee @me] [--fork user/repo] [--watch] [--interval 5] [--reviews-only] [--cron] [--dry-run] [--model glm-5] [--notify-channel -1002381931352]

node-connect

342.5k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

oracle

342.5k

Best practices for using the oracle CLI (prompt + file bundling, engines, sessions, and file attachment patterns).

tmux

342.5k

Remote-control tmux sessions for interactive CLIs by sending keystrokes and scraping pane output.