Word2World

From Word to World: Can Large Language Models be Implicit Text-based World Models?

Generate Convert Improve

Install / Use

/learn @X1AOX1A/Word2World

About this skill

Quality Score

0/100

README

From Word to World: Can Large Language Models be Implicit Text-based World Models?

Code and data for "From Word to World: Can Large Language Models be Implicit Text-based World Models?".

📰 News

[2025/12/21] We released the paper and blog post.
[2025/12/22] We released the code, models and data. We verified the evaluation pipeline on ALFWorld using Qwen2.5-7B (world model) and gpt-4o (agent).
[2025/12/23] Paper is available on arXiv and Hugging Face.

🔗 Quick Links

Environment setup: see Env Setup
Download data: see Data Download
Evaluate: see Evaluation
Train world models: see Training World Models

📑 Table of Contents

Overview
World Model Checkpoints
Usage Examples
Env Setup
Data Download
Evaluation
- Single-step Accuracy
- Long-horizon Rollouts
Training World Models
Contact
Citation

📌 Overview

LLMs as text-based world models for agent learning.

(A) Formulation: world modeling as next-state prediction under a fixed text interaction protocol.
(B) Evaluation axes: fidelity/consistency, scalability/robustness, and agent utility.
(C) Results: strong fidelity and consistency in both single-step predictions and long-horizon rollouts.
(D) Scaling: predictable improvements with more training data across text environments.
(E) Agent gains: better verification, synthetic data generation, and RL initialization from faithful world models.

🧩 List of World Model Checkpoints

| Environment | Qwen2.5-7B | Llama3.1-8B | |---|---|---| | ALFWorld | X1AOX1A/WorldModel-Alfworld-Qwen2.5-7B | X1AOX1A/WorldModel-Alfworld-Llama3.1-8B | | SciWorld | X1AOX1A/WorldModel-Sciworld-Qwen2.5-7B | X1AOX1A/WorldModel-Sciworld-Llama3.1-8B | | TextWorld | X1AOX1A/WorldModel-Textworld-Qwen2.5-7B | X1AOX1A/WorldModel-Textworld-Llama3.1-8B | | Webshop | X1AOX1A/WorldModel-Webshop-Qwen2.5-7B | X1AOX1A/WorldModel-Webshop-Llama3.1-8B | | StableToolBench | X1AOX1A/WorldModel-Stabletoolbench-Qwen2.5-7B | X1AOX1A/WorldModel-Stabletoolbench-Llama3.1-8B |

🧪 Usage Examples

For an example interaction between the agent and the world model, see scripts/interact_with_world_model/run.py.

🛠️ Env Setup

# install AgentGym-RL
bash scripts/env_setup/uv_agentgym_rl.sh

# install AgentGym environments
bash scripts/env_setup/uv_alfworld.sh
bash scripts/env_setup/uv_sciworld.sh
bash scripts/env_setup/uv_textworld.sh
bash scripts/env_setup/uv_webshop.sh

You can verify each environment by launching its server:

bash scripts/env_server/start_alfworld.sh
bash scripts/env_server/start_sciworld.sh
bash scripts/env_server/start_textworld.sh
bash scripts/env_server/start_webshop.sh

📥 Data Download

Note (ALFWorld): to align with AgentGym, we renamed action put → move and added a help action (see the updated .twl2 under scripts/download_data). Therefore the original alfworld-download workflow is not compatible. If you downloaded ALFWorld data before, please remove the old data and re-download using the command below; otherwise evaluation may be lower than expected.

source uv_agentgym_rl/bin/activate
python scripts/download_data/download_data.py

📊 Evaluation

This repo reports three complementary metrics:

Single-step Accuracy: next-state prediction accuracy under the interaction protocol.
WM Task Success Rate: agent success when interacting with the learned world model.
WM2Real Success Rate: mapping/replay of world-model actions back to the real environment.

Single Step Accuracy

To compute Single Step Accuracy, run:

<details> <summary>Calculate Single Step Accuracy</summary>

TASK=alfworld         # alfworld, alfworld_valid_seen, alfworld_valid_unseen, sciworld, textworld, webshop, stabletoolbench
MODEL=X1AOX1A/WorldModel-Alfworld-Qwen2.5-7B                  # world model checkpoint
OUTPUT_ROOT=outputs/single_step_accuracy/${TASK}/${MODEL}     # output root directory
bash scripts/single_step_accuracy/run.sh $TASK $MODEL $OUTPUT_ROOT

Example output:

{
    "average_accuracy": 0.9987087517934002
}

</details>

Long Horizon Rollouts

1. Interaction with Real Environments

To collect trajectories on the training set with real environments, set SPLIT=train.

To compute Real Task Success Rate, run:

<details> <summary>Run via OpenAI API</summary>

TASK=alfworld         # alfworld, sciworld, textworld, webshop
RUN=0                 # run id for multiple runs, just for separating output dirs
API_KEY=your_api_key  # your OpenAI API key
API_BASE_URL=your_api_base_url # your OpenAI API base URL
MODEL=gpt-4o          # agent model name
MAX_CONCURRENCY=150   # max concurrency
MAX_ROUND=50          # max round
NUM_EXAMPLES=-1       # num examples
SPLIT=test            # train, test (and valid_seen, valid_unseen for ALFWorld only)
OUTPUT_ROOT=outputs   # output root directory
# this will auto-launch the environment server
bash scripts/interact_with_real_env/run_openai.sh $TASK $RUN $API_KEY $API_BASE_URL $MODEL $MAX_CONCURRENCY $MAX_ROUND $NUM_EXAMPLES $SPLIT $OUTPUT_ROOT
# metrics will be saved to outputs/interaction/real_env/$SPLIT/${TASK}/$MODEL/${TASK}_maxround${MAX_ROUND}_run${RUN}/_metrics.json

Example output:

{
    "accuracy": 50.50,  # task success rate
    "success": 101.0,   # total successful interactions
    "api_errors": 0,    # API errors
    "total": 200,       # total interactions
    "time_seconds": 1075.621458530426  # time taken in seconds
}

</details> <details> <summary>Run via vLLM server</summary>

TASK=alfworld         # alfworld, sciworld, textworld, webshop
RUN=0                 # run id for multiple runs, just for separating output dirs
MODEL=Qwen/Qwen2.5-7B-Instruct       # agent model name
MAX_CONCURRENCY=150   # max concurrency
MAX_ROUND=20          # max round, reduce to 20 to prevent exceeding the context length
NUM_EXAMPLES=-1       # num examples
SPLIT=test            # train, test (and valid_seen, valid_unseen for ALFWorld only)
OUTPUT_ROOT=outputs   # output root directory
# this will auto-launch the vLLM server and the environment server
bash scripts/interact_with_real_env/run_vllm.sh $TASK $RUN $MODEL $MAX_CONCURRENCY $MAX_ROUND $NUM_EXAMPLES $SPLIT $OUTPUT_ROOT
# metrics will be saved to outputs/interaction/real_env/$SPLIT/vllm/${TASK}/$MODEL/${TASK}_maxround${MAX_ROUND}_run${RUN}/_metrics.json

</details>

2. Interaction with World Models

To compute WM Task Success Rate, run:

<details> <summary>Run via OpenAI API</summary>

TASK=alfworld         # alfworld, sciworld, textworld, webshop
MODEL=gpt-4o          # agent model name
API_KEY=your_api_key  # your OpenAI API key
API_BASE_URL=your_api_base_url # your OpenAI API base URL
WORLD_MODEL=X1AOX1A/WorldModel-Alfworld-Qwen2.5-7B # world model checkpoint
MAX_CONCURRENCY=150   # max concurrency
MAX_ROUND=50          # max round
NUM_EXAMPLES=-1       # num examples
SPLIT=test            # train, test (and valid_seen, valid_unseen for ALFWorld only)
OUTPUT_ROOT=outputs   # output root directory
# this will auto-launch the vLLM server for the world model
bash scripts/interact_with_world_model/run.sh $TASK $MODEL $API_KEY $API_BASE_URL $WORLD_MODEL $MAX_CONCURRENCY $MAX_ROUND $NUM_EXAMPLES $SPLIT $OUTPUT_ROOT
# metrics will be saved to outputs/interaction/world_model/$SPLIT/$TASK/$MODEL/$WORLD_MODEL/$MODEL/_metrics.json

Example output:

{
    "task": "alfworld",                    # task name
    "agent_model": "gpt-4o",               # agent model name
    "total_items": 200,                    # total items
    "total_success": 109.0,                # total successful interactions
    "processed_items": 200,                # processed items
    "accuracy": 54.50000000000001,         # task success rate
    "api_errors": 0                        # API errors
}

</details>

3. Map WM Actions to Real Environments

To compute WM2Real Success Rate, run:

<details> <summary>This step will not need API calls</summary>

TASK=alfworld                 # alfworld, alfworld_valid_seen, alfworld_va

Related Skills

node-connect

348.0k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

108.8k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

348.0k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

348.0k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。