SkillAgentSearch skills...

Word2World

From Word to World: Can Large Language Models be Implicit Text-based World Models?

Install / Use

/learn @X1AOX1A/Word2World
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

From Word to World: Can Large Language Models be Implicit Text-based World Models?

arXiv Blog HF Paper Models Dataset

Code and data for "From Word to World: Can Large Language Models be Implicit Text-based World Models?".

📰 News

  • [2025/12/21] We released the paper and blog post.
  • [2025/12/22] We released the code, models and data. We verified the evaluation pipeline on ALFWorld using Qwen2.5-7B (world model) and gpt-4o (agent).
  • [2025/12/23] Paper is available on arXiv and Hugging Face.

🔗 Quick Links

📑 Table of Contents


📌 Overview

<p align="center"> <img src="./assets/main.png" alt="Main Figure" width="90%"> </p>

LLMs as text-based world models for agent learning.

  • (A) Formulation: world modeling as next-state prediction under a fixed text interaction protocol.
  • (B) Evaluation axes: fidelity/consistency, scalability/robustness, and agent utility.
  • (C) Results: strong fidelity and consistency in both single-step predictions and long-horizon rollouts.
  • (D) Scaling: predictable improvements with more training data across text environments.
  • (E) Agent gains: better verification, synthetic data generation, and RL initialization from faithful world models.

🧩 List of World Model Checkpoints

| Environment | Qwen2.5-7B | Llama3.1-8B | |---|---|---| | ALFWorld | X1AOX1A/WorldModel-Alfworld-Qwen2.5-7B | X1AOX1A/WorldModel-Alfworld-Llama3.1-8B | | SciWorld | X1AOX1A/WorldModel-Sciworld-Qwen2.5-7B | X1AOX1A/WorldModel-Sciworld-Llama3.1-8B | | TextWorld | X1AOX1A/WorldModel-Textworld-Qwen2.5-7B | X1AOX1A/WorldModel-Textworld-Llama3.1-8B | | Webshop | X1AOX1A/WorldModel-Webshop-Qwen2.5-7B | X1AOX1A/WorldModel-Webshop-Llama3.1-8B | | StableToolBench | X1AOX1A/WorldModel-Stabletoolbench-Qwen2.5-7B | X1AOX1A/WorldModel-Stabletoolbench-Llama3.1-8B |

🧪 Usage Examples

🛠️ Env Setup

# install AgentGym-RL
bash scripts/env_setup/uv_agentgym_rl.sh

# install AgentGym environments
bash scripts/env_setup/uv_alfworld.sh
bash scripts/env_setup/uv_sciworld.sh
bash scripts/env_setup/uv_textworld.sh
bash scripts/env_setup/uv_webshop.sh

You can verify each environment by launching its server:

bash scripts/env_server/start_alfworld.sh
bash scripts/env_server/start_sciworld.sh
bash scripts/env_server/start_textworld.sh
bash scripts/env_server/start_webshop.sh

📥 Data Download

Note (ALFWorld): to align with AgentGym, we renamed action putmove and added a help action (see the updated .twl2 under scripts/download_data). Therefore the original alfworld-download workflow is not compatible. If you downloaded ALFWorld data before, please remove the old data and re-download using the command below; otherwise evaluation may be lower than expected.

source uv_agentgym_rl/bin/activate
python scripts/download_data/download_data.py

📊 Evaluation

This repo reports three complementary metrics:

  • Single-step Accuracy: next-state prediction accuracy under the interaction protocol.
  • WM Task Success Rate: agent success when interacting with the learned world model.
  • WM2Real Success Rate: mapping/replay of world-model actions back to the real environment.

Single Step Accuracy

To compute Single Step Accuracy, run:

<details> <summary>Calculate Single Step Accuracy</summary>
TASK=alfworld         # alfworld, alfworld_valid_seen, alfworld_valid_unseen, sciworld, textworld, webshop, stabletoolbench
MODEL=X1AOX1A/WorldModel-Alfworld-Qwen2.5-7B                  # world model checkpoint
OUTPUT_ROOT=outputs/single_step_accuracy/${TASK}/${MODEL}     # output root directory
bash scripts/single_step_accuracy/run.sh $TASK $MODEL $OUTPUT_ROOT

Example output:

{
    "average_accuracy": 0.9987087517934002
}
</details>

Long Horizon Rollouts

1. Interaction with Real Environments

To collect trajectories on the training set with real environments, set SPLIT=train.

To compute Real Task Success Rate, run:

<details> <summary>Run via OpenAI API</summary>
TASK=alfworld         # alfworld, sciworld, textworld, webshop
RUN=0                 # run id for multiple runs, just for separating output dirs
API_KEY=your_api_key  # your OpenAI API key
API_BASE_URL=your_api_base_url # your OpenAI API base URL
MODEL=gpt-4o          # agent model name
MAX_CONCURRENCY=150   # max concurrency
MAX_ROUND=50          # max round
NUM_EXAMPLES=-1       # num examples
SPLIT=test            # train, test (and valid_seen, valid_unseen for ALFWorld only)
OUTPUT_ROOT=outputs   # output root directory
# this will auto-launch the environment server
bash scripts/interact_with_real_env/run_openai.sh $TASK $RUN $API_KEY $API_BASE_URL $MODEL $MAX_CONCURRENCY $MAX_ROUND $NUM_EXAMPLES $SPLIT $OUTPUT_ROOT
# metrics will be saved to outputs/interaction/real_env/$SPLIT/${TASK}/$MODEL/${TASK}_maxround${MAX_ROUND}_run${RUN}/_metrics.json

Example output:

{
    "accuracy": 50.50,  # task success rate
    "success": 101.0,   # total successful interactions
    "api_errors": 0,    # API errors
    "total": 200,       # total interactions
    "time_seconds": 1075.621458530426  # time taken in seconds
}
</details> <details> <summary>Run via vLLM server</summary>
TASK=alfworld         # alfworld, sciworld, textworld, webshop
RUN=0                 # run id for multiple runs, just for separating output dirs
MODEL=Qwen/Qwen2.5-7B-Instruct       # agent model name
MAX_CONCURRENCY=150   # max concurrency
MAX_ROUND=20          # max round, reduce to 20 to prevent exceeding the context length
NUM_EXAMPLES=-1       # num examples
SPLIT=test            # train, test (and valid_seen, valid_unseen for ALFWorld only)
OUTPUT_ROOT=outputs   # output root directory
# this will auto-launch the vLLM server and the environment server
bash scripts/interact_with_real_env/run_vllm.sh $TASK $RUN $MODEL $MAX_CONCURRENCY $MAX_ROUND $NUM_EXAMPLES $SPLIT $OUTPUT_ROOT
# metrics will be saved to outputs/interaction/real_env/$SPLIT/vllm/${TASK}/$MODEL/${TASK}_maxround${MAX_ROUND}_run${RUN}/_metrics.json
</details>

2. Interaction with World Models

To compute WM Task Success Rate, run:

<details> <summary>Run via OpenAI API</summary>
TASK=alfworld         # alfworld, sciworld, textworld, webshop
MODEL=gpt-4o          # agent model name
API_KEY=your_api_key  # your OpenAI API key
API_BASE_URL=your_api_base_url # your OpenAI API base URL
WORLD_MODEL=X1AOX1A/WorldModel-Alfworld-Qwen2.5-7B # world model checkpoint
MAX_CONCURRENCY=150   # max concurrency
MAX_ROUND=50          # max round
NUM_EXAMPLES=-1       # num examples
SPLIT=test            # train, test (and valid_seen, valid_unseen for ALFWorld only)
OUTPUT_ROOT=outputs   # output root directory
# this will auto-launch the vLLM server for the world model
bash scripts/interact_with_world_model/run.sh $TASK $MODEL $API_KEY $API_BASE_URL $WORLD_MODEL $MAX_CONCURRENCY $MAX_ROUND $NUM_EXAMPLES $SPLIT $OUTPUT_ROOT
# metrics will be saved to outputs/interaction/world_model/$SPLIT/$TASK/$MODEL/$WORLD_MODEL/$MODEL/_metrics.json

Example output:

{
    "task": "alfworld",                    # task name
    "agent_model": "gpt-4o",               # agent model name
    "total_items": 200,                    # total items
    "total_success": 109.0,                # total successful interactions
    "processed_items": 200,                # processed items
    "accuracy": 54.50000000000001,         # task success rate
    "api_errors": 0                        # API errors
}
</details>

3. Map WM Actions to Real Environments

To compute WM2Real Success Rate, run:

<details> <summary>This step will not need API calls</summary>
TASK=alfworld                 # alfworld, alfworld_valid_seen, alfworld_va

Related Skills

View on GitHub
GitHub Stars55
CategoryDevelopment
Updated14d ago
Forks5

Languages

Jupyter Notebook

Security Score

80/100

Audited on Mar 21, 2026

No findings