Word2World
From Word to World: Can Large Language Models be Implicit Text-based World Models?
Install / Use
/learn @X1AOX1A/Word2WorldREADME
From Word to World: Can Large Language Models be Implicit Text-based World Models?
Code and data for "From Word to World: Can Large Language Models be Implicit Text-based World Models?".
📰 News
- [2025/12/21] We released the paper and blog post.
- [2025/12/22] We released the code, models and data. We verified the evaluation pipeline on ALFWorld using Qwen2.5-7B (world model) and gpt-4o (agent).
- [2025/12/23] Paper is available on arXiv and Hugging Face.
🔗 Quick Links
- Environment setup: see Env Setup
- Download data: see Data Download
- Evaluate: see Evaluation
- Train world models: see Training World Models
📑 Table of Contents
- Overview
- World Model Checkpoints
- Usage Examples
- Env Setup
- Data Download
- Evaluation
- Training World Models
- Contact
- Citation
📌 Overview
<p align="center"> <img src="./assets/main.png" alt="Main Figure" width="90%"> </p>LLMs as text-based world models for agent learning.
- (A) Formulation: world modeling as next-state prediction under a fixed text interaction protocol.
- (B) Evaluation axes: fidelity/consistency, scalability/robustness, and agent utility.
- (C) Results: strong fidelity and consistency in both single-step predictions and long-horizon rollouts.
- (D) Scaling: predictable improvements with more training data across text environments.
- (E) Agent gains: better verification, synthetic data generation, and RL initialization from faithful world models.
🧩 List of World Model Checkpoints
| Environment | Qwen2.5-7B | Llama3.1-8B |
|---|---|---|
| ALFWorld | X1AOX1A/WorldModel-Alfworld-Qwen2.5-7B | X1AOX1A/WorldModel-Alfworld-Llama3.1-8B |
| SciWorld | X1AOX1A/WorldModel-Sciworld-Qwen2.5-7B | X1AOX1A/WorldModel-Sciworld-Llama3.1-8B |
| TextWorld | X1AOX1A/WorldModel-Textworld-Qwen2.5-7B | X1AOX1A/WorldModel-Textworld-Llama3.1-8B |
| Webshop | X1AOX1A/WorldModel-Webshop-Qwen2.5-7B | X1AOX1A/WorldModel-Webshop-Llama3.1-8B |
| StableToolBench | X1AOX1A/WorldModel-Stabletoolbench-Qwen2.5-7B | X1AOX1A/WorldModel-Stabletoolbench-Llama3.1-8B |
🧪 Usage Examples
- For an example interaction between the agent and the world model, see scripts/interact_with_world_model/run.py.
🛠️ Env Setup
# install AgentGym-RL
bash scripts/env_setup/uv_agentgym_rl.sh
# install AgentGym environments
bash scripts/env_setup/uv_alfworld.sh
bash scripts/env_setup/uv_sciworld.sh
bash scripts/env_setup/uv_textworld.sh
bash scripts/env_setup/uv_webshop.sh
You can verify each environment by launching its server:
bash scripts/env_server/start_alfworld.sh
bash scripts/env_server/start_sciworld.sh
bash scripts/env_server/start_textworld.sh
bash scripts/env_server/start_webshop.sh
📥 Data Download
Note (ALFWorld): to align with AgentGym, we renamed action
put→moveand added ahelpaction (see the updated.twl2underscripts/download_data). Therefore the originalalfworld-downloadworkflow is not compatible. If you downloaded ALFWorld data before, please remove the old data and re-download using the command below; otherwise evaluation may be lower than expected.
source uv_agentgym_rl/bin/activate
python scripts/download_data/download_data.py
📊 Evaluation
This repo reports three complementary metrics:
- Single-step Accuracy: next-state prediction accuracy under the interaction protocol.
- WM Task Success Rate: agent success when interacting with the learned world model.
- WM2Real Success Rate: mapping/replay of world-model actions back to the real environment.
Single Step Accuracy
To compute Single Step Accuracy, run:
TASK=alfworld # alfworld, alfworld_valid_seen, alfworld_valid_unseen, sciworld, textworld, webshop, stabletoolbench
MODEL=X1AOX1A/WorldModel-Alfworld-Qwen2.5-7B # world model checkpoint
OUTPUT_ROOT=outputs/single_step_accuracy/${TASK}/${MODEL} # output root directory
bash scripts/single_step_accuracy/run.sh $TASK $MODEL $OUTPUT_ROOT
Example output:
{
"average_accuracy": 0.9987087517934002
}
</details>
Long Horizon Rollouts
1. Interaction with Real Environments
To collect trajectories on the training set with real environments, set
SPLIT=train.
To compute Real Task Success Rate, run:
TASK=alfworld # alfworld, sciworld, textworld, webshop
RUN=0 # run id for multiple runs, just for separating output dirs
API_KEY=your_api_key # your OpenAI API key
API_BASE_URL=your_api_base_url # your OpenAI API base URL
MODEL=gpt-4o # agent model name
MAX_CONCURRENCY=150 # max concurrency
MAX_ROUND=50 # max round
NUM_EXAMPLES=-1 # num examples
SPLIT=test # train, test (and valid_seen, valid_unseen for ALFWorld only)
OUTPUT_ROOT=outputs # output root directory
# this will auto-launch the environment server
bash scripts/interact_with_real_env/run_openai.sh $TASK $RUN $API_KEY $API_BASE_URL $MODEL $MAX_CONCURRENCY $MAX_ROUND $NUM_EXAMPLES $SPLIT $OUTPUT_ROOT
# metrics will be saved to outputs/interaction/real_env/$SPLIT/${TASK}/$MODEL/${TASK}_maxround${MAX_ROUND}_run${RUN}/_metrics.json
Example output:
{
"accuracy": 50.50, # task success rate
"success": 101.0, # total successful interactions
"api_errors": 0, # API errors
"total": 200, # total interactions
"time_seconds": 1075.621458530426 # time taken in seconds
}
</details>
<details>
<summary>Run via vLLM server</summary>
TASK=alfworld # alfworld, sciworld, textworld, webshop
RUN=0 # run id for multiple runs, just for separating output dirs
MODEL=Qwen/Qwen2.5-7B-Instruct # agent model name
MAX_CONCURRENCY=150 # max concurrency
MAX_ROUND=20 # max round, reduce to 20 to prevent exceeding the context length
NUM_EXAMPLES=-1 # num examples
SPLIT=test # train, test (and valid_seen, valid_unseen for ALFWorld only)
OUTPUT_ROOT=outputs # output root directory
# this will auto-launch the vLLM server and the environment server
bash scripts/interact_with_real_env/run_vllm.sh $TASK $RUN $MODEL $MAX_CONCURRENCY $MAX_ROUND $NUM_EXAMPLES $SPLIT $OUTPUT_ROOT
# metrics will be saved to outputs/interaction/real_env/$SPLIT/vllm/${TASK}/$MODEL/${TASK}_maxround${MAX_ROUND}_run${RUN}/_metrics.json
</details>
2. Interaction with World Models
To compute WM Task Success Rate, run:
TASK=alfworld # alfworld, sciworld, textworld, webshop
MODEL=gpt-4o # agent model name
API_KEY=your_api_key # your OpenAI API key
API_BASE_URL=your_api_base_url # your OpenAI API base URL
WORLD_MODEL=X1AOX1A/WorldModel-Alfworld-Qwen2.5-7B # world model checkpoint
MAX_CONCURRENCY=150 # max concurrency
MAX_ROUND=50 # max round
NUM_EXAMPLES=-1 # num examples
SPLIT=test # train, test (and valid_seen, valid_unseen for ALFWorld only)
OUTPUT_ROOT=outputs # output root directory
# this will auto-launch the vLLM server for the world model
bash scripts/interact_with_world_model/run.sh $TASK $MODEL $API_KEY $API_BASE_URL $WORLD_MODEL $MAX_CONCURRENCY $MAX_ROUND $NUM_EXAMPLES $SPLIT $OUTPUT_ROOT
# metrics will be saved to outputs/interaction/world_model/$SPLIT/$TASK/$MODEL/$WORLD_MODEL/$MODEL/_metrics.json
Example output:
{
"task": "alfworld", # task name
"agent_model": "gpt-4o", # agent model name
"total_items": 200, # total items
"total_success": 109.0, # total successful interactions
"processed_items": 200, # processed items
"accuracy": 54.50000000000001, # task success rate
"api_errors": 0 # API errors
}
</details>
3. Map WM Actions to Real Environments
To compute WM2Real Success Rate, run:
TASK=alfworld # alfworld, alfworld_valid_seen, alfworld_va
Related Skills
node-connect
348.0kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
108.8kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
348.0kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
348.0kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
