SkillAgentSearch skills...

AgentGym

Code and implementations for the ACL 2025 paper "AgentGym: Evolving Large Language Model-based Agents across Diverse Environments" by Zhiheng Xi et al.

Install / Use

/learn @WooooDyy/AgentGym
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

AgentGym: Evolving Large Language Model-based Agents across Diverse Environments

<p align="center"> 📃 <a href="https://arxiv.org/abs/2406.04151" target="_blank">Paper</a > • 🌐 <a href="https://agentgym.github.io/" target="_blank">Project Page</a > • 🤗 <a href="https://huggingface.co/datasets/AgentGym/AgentTraj-L" target="_blank">AgentTraj-L</a > • 🤗 <a href="https://huggingface.co/datasets/AgentGym/AgentEval" target="_blank">AgentEval</a > • 🤗 <a href="https://huggingface.co/AgentGym/AgentEvol-7B" target="_blank">Model (AgentEvol-7B)</a ><br> </p >

🔔 News

  • 🎉 [2025-09-10] You can develop your custom environment to AgentGym and perform RL on it! The tutorial is here.
  • 🍺 [2025-09-10] Our paper is released on arXiv: AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning.
  • 🚀 [2025-09-10] AgentGym-RL Framework released! We introduce the reinforcement learning (RL) version of AgentGym, enabling agents to learn directly from interactive environments: AgentGym-RL.
  • 👀 [2025/09/03] AgentGym now provides an interactive frontend for visualization. Researchers can replay and inspect full trajectories, step through agent decision-making, and analyze model behaviors more easily.
  • 🔧 [2025/09/03] We updated several environments to improve stability and robustness, with better support for large-scale parallel execution (e.g., parallel runs in WebArena). Try it for RL!
  • 🥳 [2024/06/07] Our paper is released on arXiv: AgentGym: Evolving Large Language Model-based Agents across Diverse Environments !
  • 🤖 [2024/06/06] Our model is available on Hugging Face: AgentEvol-7B.
  • 💥 [2024/06/06] Our trajectory set and benchmark are available on Hugging Face: AgentTraj-L, AgentEval.
  • ✨ [2024/06/06] The AgentGym suite is released, including the platform code, dataset, benchmark, and training implementations! We welcome contributions for more agent environments and others from the community!
<div align=center><img src="./assets/evolution.png" width="90%" /></div>

🌟 Introduction

Building generalist agents that can handle diverse tasks and evolve themselves across different environments is a long-term goal in the AI community. Large language models (LLMs) are considered a promising foundation to build such agents due to their generalized capabilities.

AgentGym is a new framework featuring a variety of environments and tasks for broad, real-time, uniformat, and concurrent agent exploration. It is designed to help the community easily evaluate and develop generally-capable LLM-based agents. It also includes a high-quality trajectory set AgentTraj and a benchmark suite AgentEval. We also propose a novel method, AgentEvol, to investigate the potential of agent self-evolution beyond previously seen data across tasks and environments. Experimental results show that the evolved agents can achieve results comparable to SOTA models.

<div align=center><img src="./assets/agentgym.png" width="90%" /></div>

🎁 AgentGym Suite

AgentGym is a framework designed to help the community easily evaluate and develop generally-capable LLM-based agents. It features diverse interactive environments and tasks with a unified format, i.e., ReAct format. It supports real-time feedback and concurrency, and is easily scalable. It includes 14 environments across web navigating, text games, house-holding tasks, digital games, embodied tasks, tool-using and programming.

| Environment | Traj | Eval | Original Repo | EnvServer | | ----------- | ---- | ---- | ----------------------------------------------------------------------- | ------------------------------------------------------------------------------------ | | WebShop | 3930 | 200 | WebShop-Repo | agentenv-webshop | | WebArena | 0 | 20 | WebArena | agentenv-webarena | | MAZE | 215 | 25 | MAZE-Repo | agentenv-lmrlgym | | Wordle | 955 | 25 | Wordle-Repo | agentenv-lmrlgym | | ALFWorld | 2420 | 200 | ALFWorld-Repo | agentenv-alfworld | | SciWorld | 2120 | 200 | SciWrold-Repo | agentenv-sciworld | | BabyAI | 810 | 90 | BabyAI-Repo | agentenv-babyai | | TextCraft | 374 | 100 | TextCraft-Repo | agentenv-textcraft | | Weather | 311 | 20 | Weather-Repo | agentenv-tool | | Movie | 215 | 20 | Movie-Repo | agentenv-tool | | Academia | 0 | 20 | Academia-Repo | agentenv-tool | | Sheet | 0 | 20 | Sheet-Repo | agentenv-tool | | TODOList | 135 | 20 | TODOList-Repo | agentenv-tool | | BIRD | 3000 | 200 | BIRD-Repo | agentenv-sqlgym |

Platform

The platform architecture of AgentGym is illustrated in the following figure. In AgentGym, different environments are deployed on different servers or ports and provide encapsulated HTTP services externally. This decouples the environments from other parts.

These services include APIs such as /createEnv to create an environment, /observation to get the current observation from the environment, /available_actions to get the currently available actions, /step to perform an action, and /reset to reset the environment.

We have implemented 14 types of environments, and developers can easily develop new environments and add them to AgentGym by encapsulating the aforementioned interfaces. EnvClients have the responsibility of receiving services provided by the server and encapsulating them into functions for user calls. AgentController is our core component that connects the agent and the environment. It is responsible for evaluating the agent, collecting data, and training the agent.

<div align=center><img src="./assets/platform.png" width="60%" /></div>

Benchmark: AgentEval

The AgentEval benchmark is on Hugging Face: AgentGym/AgentEval.

Here is an example:

{
  "conversations": null, 
  "item_id": "webshop_5238"
}

Trajectory Set: AgentTraj

The trajectory set AgentTraj-L is available on Hugging Face: AgentGym/AgentTraj-L.

Here is an example:

    {
        "conversations": [
            {
                "from": "human",
                "loss": null,
                "value": "You are web shopping.\nI will give you instructions about what to do.\nYou have to follow the instructions.\nEvery round I will give you an observation and a list of available actions, you have to respond an action based on the state and instruction.\nYou can use search action if search is available.\nYou can click one of the buttons in clickables.\nAn action should be of the following structure:\nsearch[keywords]\nclick[value]\nIf the action is not valid, perform nothing.\nKeywords in search are up to you, but the value in click must be a value in the list of available actions.\nRemember that your keywords in search should be carefully designed.\nYour response should use the following format:\n\nThought:\nI think ... \n\nAction: \nclick[something]"
            },
            {
                "from": "gpt",
                "loss": false,
                "value": "Ok."
            },
            {
                "from": "human",
                "loss": null,
                "value": "WebShop [SEP] Instruction: [SEP] Find me men's shorts with elastic waist, classic fit, short sleeve for gym workout with color: navy, and size: x-large, and price lower than 50.00 dollars [SEP] Search"
            },
            {
                "from": "gpt",
                "loss": true,
                "value": "Thought:\nI think I should u

Related Skills

View on GitHub
GitHub Stars755
CategoryDevelopment
Updated11h ago
Forks107

Languages

Python

Security Score

100/100

Audited on Mar 28, 2026

No findings