LWM
Large World Model -- Modeling Text and Video with Millions Context
Install / Use
/learn @LargeWorldModel/LWMREADME
Large World Model (LWM)
Large World Model (LWM) is a general-purpose large-context multimodal autoregressive model. It is trained on a large dataset of diverse long videos and books using RingAttention, and can perform language, image, and video understanding and generation.
Approach
<div align="center"> <img src="./imgs/data.png"/> </div>Current language models fall short in understanding aspects of the world not easily described in words, and struggle with complex, long-form tasks. Video sequences offer valuable temporal information absent in language and static images, making them attractive for joint modeling with language. Such models could develop a understanding of both human textual knowledge and the physical world, enabling broader AI capabilities for assisting humans. However, learning from millions of tokens of video and language sequences poses challenges due to memory constraints, computational complexity, and limited datasets. To address these challenges, we curate a large dataset of diverse videos and books, utilize the RingAttention technique to scalably train on long sequences, and gradually increase context size from 4K to 1M tokens. This paper makes the following contributions: (a) Largest context size neural network: We train one of the largest context size transformers on long video and language sequences, setting new benchmarks in difficult retrieval tasks and long video understanding. (b) Solutions for overcoming vision-language training challenges, including using masked sequence packing for mixing different sequence lengths, loss weighting to balance language and vision, and model-generated QA dataset for long sequence chat. (c) A highly-optimized implementation with RingAttention, masked sequence packing, and other key features for training on millions-length multimodal sequences. (d) Fully open-sourced a family of 7B parameter models capable of processing long text documents (LWM-Text, LWM-Text-Chat) and videos (LWM, LWM-Chat) of over 1M tokens. This work paves the way for training on massive datasets of long video and language to develop understanding of both human knowledge and the multimodal world, and broader capabilities.
LWM Capabilities
<div align="center"> <img src="./imgs/single_needle_1M.png"/> <p> LWM can retrieval facts across 1M context with high accuracy. </p> </div> <br /> <div align="center"> <img src="./imgs/long_video_chat_main.png"/> <p> LWM can answer questions over 1 hour YouTube video. </p> </div> <br /> <div align="center"> <img src="./imgs/image_chat.png"/> <p> LWM can chat with images. </p> </div> <br /> <div align="center"> <img src="./imgs/image_video_gen.png"/> <p> LWM can generate videos and images from text. </p> </div>Setup
This codebase is supported on Ubuntu and has not been tested on Windows or macOS. We recommend using TPUs for training and inference, although it is also possible to use GPUs. On TPU, the code is highly optimized with Jax's Pallas and can achieve high MFUs with RingAttention at very large context sizes. On GPU, the code is based on XLA and is not as optimized as it is for TPU.
Install the requirements with:
conda create -n lwm python=3.10
conda activate lwm
pip install -r gpu_requirements.txt
or set up TPU VM with:
sh tpu_requirements.sh
Available models
There are language-only and video-language versions, offering context sizes from 32K, to 128K, 256K and 1M tokens. The vision-language models are available only in Jax, and the language-only models are available in both PyTorch and Jax. Below are the names of the available models and their corresponding context sizes and capabilities:
| Model Name | Context Size | Language or Vision-Language | Chat or Base | URL | |--------------------|--------------|-----------------------------|--------------|----------------------------------------------------------------------------------------------------------------------------------------------| | LWM-Text-Chat-128K | 128K | Language | Chat | [Pytorch][Jax] | | LWM-Text-Chat-256K | 256K | Language | Chat | [Pytorch][Jax] | | LWM-Text-Chat-512K | 512K | Language | Chat | [Pytorch][Jax] | | LWM-Text-Chat-1M | 1M | Language | Chat | [Pytorch][Jax] | | LWM-Text-128K | 128K | Language | Base | [Pytorch][Jax] | | LWM-Text-256K | 256K | Language | Base | [Pytorch][Jax] | | LWM-Text-512K | 512K | Language | Base | [Pytorch][Jax] | | LWM-Text-1M | 1M | Language | Base | [Pytorch][Jax] | | LWM-Chat-32K | 32K | Vision-Language | Chat | [Jax] | | LWM-Chat-128K | 128K | Vision-Language | Chat | [Jax] | | LWM-Chat-1M | 1M | Vision-Language | Chat | [Jax] |
Code structure
Use scan_query_chunk_size and scan_key_chunk_size to control the block size in blockwise compute of the self-attention. Use scan_mlp_chunk_size to control the block size in blockwise compute of the feedforward network. Use scan_attention=True and scan_mlp=True to enable/disable blockwise compute in the self-attention and feed-forward network.
You can use mesh_dim=dp, fsdp, tp, sp to control the degree of parallelism and RingAttention. It is a string of 4 integers separated by commas, representing the number of data parallelism, fully sharded data parallelism, tensor parallelism, and sequence parallelism.
For example, mesh_dim='1,64,4,1' means 1 data parallelism, 64 fully sharded data parallelism, 4 tensor parallelism, and 1 sequence parallelism. mesh_dim='1,1,4,64' means 1 data parallelism, 1 fully sharded data parallelism, 4 tensor parallelism, and 64 sequence parallelism for RingAttention.
Running Jax Models
In this section, we provide instructions on how to run each of the provided scripts. For each script, you may need to fill in your own paths and values in the variables described in the beginning of each script.
To run each of the following scripts, use bash <script_name>.sh:
- Language model training:
bash scripts/run_train_text.sh - Vision-Language model training:
bash scripts/run_train_vision_text.sh - Single Needle Evals (Language Model):
bash scripts/run_eval_needle.sh - Multi Needle Evals (Language Model):
bash scripts/run_eval_needle_multi.sh - Sampling images (Vision-Language Model):
bash scripts/run_sample_image.sh - Sampling videos (Vision-LanguageModel):
bash scripts/run_sample_video.sh - Image / Video understanding (Vision-Language Model):
bash scripts/run_vision_chat.sh
By default the mesh_dim argument puts all devices on tp (tensor parallelism). For longer sequences, you may want to include sp, which is the last dimension in the mesh_dim.
When running needle evals, you may need to adjust the theta and max_sequence_length arguments in the scripts depending on the model. Below shows the correct values for each model.
| | LWM-Text-128K / LWM-Text-Chat-128K | LWM-Text-256K / LWM-Text-Chat-256K | LWM-Text-512K / LWM-Text-Chat-512K | LWM-Text-1M / LWM-Text-Chat-1M | |---------------------|:-----------------------------------:|:-----------------------------------:|:----------------------------------:|:------------------------------:| | theta | 10000000 | 10000000 | 25000000 | 50000000 | | max_sequence_length | 131072 | 262144 | 524288 | 1048576 |
An example of filling out a script (run_sample_video.sh) is as follows
#! /bin/bash
export SCRIPT_DIR="$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )"
export PROJECT_DIR="$( cd -- "$( dirname -- "$SCRIPT_DIR" )" &> /dev/null && pwd )"
cd $PROJECT_DIR
export PYTHONPATH="$PYTHONPATH:$PROJECT_DIR"
export llama_tokenizer_path="LargeWorldMod
Related Skills
docs-writer
98.4k`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie
model-usage
325.6kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
arscontexta
2.8kClaude Code plugin that generates individualized knowledge systems from conversation. You describe how you think and work, have a conversation and get a complete second brain as markdown files you own.
docs
High-performance, modular RAG backend and "Knowledge Engine" Built with Go & Gin, featuring Git-Ops knowledge sync, pgvector semantic search, and OpenAI-compatible model support.
