ScienceBoard
[ICLR 2026] Code, benchmark and environment for "ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows"
Install / Use
/learn @OS-Copilot/ScienceBoardREADME
<a href = "https://zhuanlan.zhihu.com/p/1914038712540574158"><img src="https://img.shields.io/badge/-%E7%9F%A5%E4%B9%8E-%232f6be0" target="_blank"></a>
Code, environment and data for "ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows"
🗞️ Updates
- 2026-01-26: ScienceBoard is accepted by ICLR 2026! 🎉
- 2025-08-28: We release CODA, a dual-brain agent that achieves SOTA on ScienceBoard. 🧠
- 2025-06-30: We release new evaluation results (GUI-Actor, UI-TARS-1.5) and agent trajectories. 🎊
- 2025-06-08: ScienceBoard will be presented at WCUA@ICML 2025 as an oral paper! 🚀
- 2025-06-04: We release the virtual machine snapshot of ScienceBoard. 🌏
- 2025-05-27: Initial release of our paper, environment, benchmark, and 🌐 Project Website. Check it out! 🚀
🛠️ Usage
📦 Installation
The infrastructure of the framework is based on OSWorld together with VMware Workstation Pro (which is free for personal use since May, 2024) in Ubuntu or Windows. Please make sure that your device meets the minimal requirements of these preliminaries.
-
Download VMware Workstation Pro 17 and our pre-made images from Hugging Face;
-
Clone this repository and install packages needed:
git clone https://github.com/OS-Copilot/ScienceBoard cd ScienceBoard conda create -n sci python=3.11 conda activate sci pip install -r requirements.txt -
We recommend you to change evaluating process in
main.pydirectly with some sensitive information hidden in environment variables, especially when it comes to complicate configs concerningcommunity.
[!NOTE]
Communityspecifies the form of cooperation in which one or more models completes the tasks. You can customize your own multi-agents system by creating a new class inheritingCommunitytogether with the method of__call__().
⚙️ Env Config
🔐 As a storage location for sensitive info
-
Used in our template of
main.py:VM_PATH: path to vmware .vmx file; will be automatically extracted (repeatedly) if set to path ofVM.zipHTTPX_PROXY: proxy URL if needed; avoid clashes withHTTP_PROXYandHTTPS_PROXYon Linux;OPENAI_API_KEY: API key for OpenAI GPT;GOOGLE_API_KEY: API key for Google Gemini;ANTHROPIC_API_KEY: API key for Anthropic Claude;
and variables for open-source models:
| Model | Base URL | Name | | :-------: | :-------------: | :--------------: | | QwenVL |
QWEN_VL_URL|QWEN_VL_NAME| | InternVL |INTERN_VL_URL|INTERN_VL_NAME| | QVQ |QVQ_VL_URL|QVQ_VL_NAME| | OS-Atlas |OS_ACT_URL|OS_ACT_NAME| | GUI-Actor |GUI_ACTOR_URL|GUI_ACTOR_NAME| | UI-Tars |TARS_DPO_URL|TARS_DPO_NAME| -
Used in
sci/Presets.py:LEAN_LIB_PATH: path for Lean 4 REPL;QT6_LIB_PATH: dynamic library directory for Qt6;FFI_LIB_PATH: dynamic library file for libffi.so;KALG_BIN_PATH: executable binary file of KAlgebra;CELE_BIN_PATH: executable binary file of Celestia;GIS_BIN_PATH: executable binary file of Grass GIS.
[!CAUTION] Configs defined in
sci/Presets.pyare only used for debugging underRawsettings and would not be loaded unless being used.
🧪 As a functionality
DEBUG_ERR_FACT: insert a breakpoint when eval exception occur if set to any value;
📏 Parameter Config
Automata: a simple encapsulation forModelandAgentmodel_style: affect the request format and response processing of model calling; you can customize your own style by adding_request_{style}()and_access_{style}()underModel;overflow_style: affect the way we detect overflow of token; you can customize your own style by adding{style}()underOverflow;code_style: affect the way we process code blocks when communicating with models; you can customize your own style by addingwrap_{style}()andextract_{style}()underCodeLike.
Tester:__init__()only register a new config. use__call__()for actual evaluation after init.tasks_path: the directory or file path for json file(s) of task(s); all*.jsonfiles under the path specified will be recursively loaded when a directory path is provided;logs_path: the directory path for log files and will be created automatically when not existed; the structure of the directory will be arranged according to that undertasks_path;community: the way of cooperation among multiple agents; useAllInOnefor standard setting inherited from OSWorld;ignore: skipped when log indicates that the task is finished (by checking the existence ofresult.out) if set toTrue; so you can re-run the same program to retry failure cases only;debug: finish the tasks manually instead of calling models;relative: allow VM to executepyautoguicodes with relative coordinates; basically used by InternVL-3.
🚧 Possible Exceptions
-
Error when initializing:
Traceback (most recent call last): File "/usr/lib/python3.11/site-packages/requests/models.py", line 971, in json return complexjson.loads(self.text, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/json/__init__.py", line 346, in loads return _default_decoder.decode(s) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/json/decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/json/decoder.py", line 355, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)the target app has not yet been started up when trying to initialize due to insufficient performance of your device; try to assign a bigger value for 'wait' field in json files of the tasks.
-
Failed to get accessibility tree:
Traceback (most recent call last): File "os-sci/sci/Tester.py", line 396, in __call__ counter._pass() if task_info() else counter._fail() ^^^^^^^^^^^ File "os-sci/sci/Tester.py", line 174, in __call__ return self.task() ^^^^^^^^^^^ File "os-sci/sci/base/task.py", line 175, in _avail_wrapper return method(self, *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "os-sci/sci/base/task.py", line 393, in __call__ return self.__call() ^^^^^^^^^^^^^ File "os-sci/sci/base/task.py", line 381, in __call stop_type, stop_args = self.predict() ^^^^^^^^^^^^^^ File "os-sci/sci/base/task.py", line 175, in _avail_wrapper return method(self, *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "os-sci/sci/base/log.py", line 462, in record_wrapper return_value = method(self) ^^^^^^^^^^^^ File "os-sci/sci/base/task.py", line 316, in predict invalid = self._step(step_index) ^^^^^^^^^^^^^^^^^^^^^^ File "os-sci/sci/base/task.py", line 260, in _step observation = { ^ File "os-sci/sci/base/task.py", line 261, in <dictcomp> obs_type: getattr(self.manager, obs_type)() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "os-sci/sci/vm/vmanager.py", line 152, in _env_wrapper return method(self, *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "os-sci/sci/base/manager.py", line 85, in _assert_wrapper result = method(self) ^^^^^^^^^^^^ File "os-sci/sci/vm/vmanager.py", line 278, in a11y_tree a11y_tree = utils.linearize(raw_a11y_tree) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "os-sci/sci/vm/utils.py", line 206, in linearize filtered_nodes = filter_nodes(ET.fromstring(a
