MemoryVLA
[ICLR 2026] Code of "MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation"
Install / Use
/learn @shihao1895/MemoryVLAREADME
MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation
Hao Shi, Bin Xie, Yingfei Liu, Lin Sun, Fengrong Liu Tiancai Wang, Erjin Zhou, Haoqiang Fan, Xiangyu Zhang, Gao Huang
Tsinghua University, Dexmal, MEGVII, TJU, HiT, StepFun
ICLR 2026
This is the code for the paper "MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation".
🏠Project Page | 📑Paper | 🤗Models & Logs
🌟 News
- 🔥 [2026-1-27] Our paper MemoryVLA is accepted by ICLR 2026!
- 🔥 [2025-11-5] The code of MemoryVLA is released! (Both MemoryVLA and MemoryVLA+)
- 🔥 [2025-10-20] Our VLA codebase Dexbotic is released, it now fully integrates MemoryVLA !
- 🔥 [2025-8-26] Our paper MemoryVLA is now on arxiv!
Overview
MemoryVLA is a Cognition-Memory-Action framework for robotic manipulation inspired by human memory systems. It builds a hippocampal-like perceptual-cognitive memory to capture the temporal dependencies essential for current decision-making, enabling long-horizon, temporally aware action generation.

We release two versions of the code in separate branches:
- MemoryVLA: built upon the OpenVLA codebase.
- MemoryVLA+: built upon our self-developed Dexbotic codebase, which offers higher simulation performance.
TODO
All components are now available, and we will continue to refine and improve the code.
-
[x] Code Release
- [x] MemoryVLA (OpenVLA codebase)
- [x] MemoryVLA+ (Dexbotic codebase)
-
[x] Model Weights Release
-
[x] Dataset Upload to HuggingFace
Contents
This is MemoryVLA based on OpenVLA codebase, if you need use dexbotic codebase, please use MemoryVLA+.
- Model Zoo & Benchmark Results
- Install
- Training
- Evaluation in SimplerEnv
- Evaluation in LIBERO
- Deployment in The Real World
- FAQ
- Citation
Model Zoo & Benchmark Results
All datasets use only third-person RGB and language, without using wrist-view images or state. MemoryVLA means openvla-codebase version, MemoryVLA+ means dexbotic-codebase version.
Bridge
| Model | Spoon | Carrot | Cube | Eggplant | Avg. | CKPT & Logs | | ---------- | ----- | ------ | ---- | -------- | ---- | ------------------------------------------------------------ | | MemoryVLA | 75.0 | 75.0 | 37.5 | 100.0 | 71.9 | 🤗 HF | | MemoryVLA+ | 100.0 | 66.7 | 70.8 | 100.0 | 84.4 | 🤗 HF |
LIBERO
| Model | Spatial | Object | Goal | Long-10 | Long-90 | Avg. | CKPT & Logs | | ---------------- | ------- | ------ | ---- | ------- | ------- | ---- | ------------------------------------------------------------ | | MemoryVLA | 98.4 | 98.4 | 96.4 | 93.4 | 95.6 | 96.5 | 🤗 Spa, 🤗 Obj, 🤗 Goal, 🤗 100 | | MemoryVLA+ | 98.2 | 97.8 | 96.4 | 93.6 | 96.2 | 96.5 | 🤗 Spa, 🤗 Obj, 🤗 Goal, 🤗 100 | | MemoryVLA+ (mix) | 97.2 | 99.2 | 98.4 | 93.2 | 97.2 | 97.1 | 🤗 HF |
Fractal-VM
| Model | Coke Can | Move Near | Open/Close Drawer | Put In Drawer | Avg. | CKPT & Logs | | ---------- | -------- | --------- | ----------------- | ------------- | ---- | ------------------------------------------------------------ | | MemoryVLA | 90.7 | 88.0 | 84.7 | 47.2 | 77.7 | 🤗 HF | | MemoryVLA+ | 92.0 | 91.7 | 71.8 | - | - | 🤗 HF |
Fractal-VA
| Model | Coke Can | Move Near | Open/Close Drawer | Put In Drawer | Avg. | CKPT & Logs | | ---------- | -------- | --------- | ----------------- | ------------- | ---- | ------------------------------------------------------------ | | MemoryVLA | 80.5 | 78.8 | 53.2 | 58.3 | 67.7 | 🤗 HF | | MemoryVLA+ | 83.5 | 81.8 | 63.2 | - | - | 🤗 HF |
Maniskill2
| Model | Pick Cube | Stack Cube | Pick Single YCB | Pick Single EGAD | Pick Clutter YCB | Avg. | CKPT & Logs | | ---------- | --------- | ---------- | --------------- | ---------------- | ---------------- | ---- | ------------------------------------------------------------ | | MemoryVLA+ | 85 | 75 | 60 | 85 | 45 | 70 | 🤗 HF |
Install
The code is built using Python 3.10, and we use PyTorch == 2.2.0 and CUDA == 12.1 (It may run with lower versions, but we have not tested it).
We recommend using Miniconda and setting up an environment:
conda create --name memvla python=3.10
conda activate memvla
pip install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 --index-url https://download.pytorch.org/whl/cu121
conda install -c nvidia cuda-nvcc=12.1 cuda-toolkit=12.1 -y
If you need to use the traning code, please also install the Flash Attention, we use flash-attn==2.5.5:
# Install Flash Attention 2.5.5, this is an example for pytorch2.2-cuda12.1
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.5.5/flash_attn-2.5.5+cu122torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install flash_attn-2.5.5+cu122torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
Next, clone our repo and install the required packages:
git clone https://github.com/shihao1895/MemoryVLA
cd MemoryVLA
pip install -e .
If you are using an NVIDIA Hopper GPU (e.g., H20) and encounter the error
“Floating point exception (core dumped)”, try reinstalling the specific cuBLAS version below:
# Fix for NVIDIA H20: "Floating point exception (core dumped)"
pip install nvidia-cublas-cu12==12.4.5.8
Training
-
Prepare training dataset with RLDS format:
- LIBERO (including Spatial, Object, Goal, Long-10, Long-90 suites)
- Bridge from Open X-Embodiment (OXE)
- Fractal from Open X-Embodiment (OXE)
# Make sure you have git-lfs installed (https://git-lfs.com) git lfs install # Download the LIBERO dataset (processed, ~22 GB) git clone https://huggingface.co/datasets/shihao1895/libero-rlds # Download the Bridge dataset (processed, ~157 GB) git clone https://huggingface.co/datasets/shihao1895/bridge-rlds # Download the Fractal dataset (processed) git clone https://huggingface.co/datasets/shihao1895/fractal-rlds -
Download pretrained model, we use OpenVLA Pretrained Model for LIBERO training, and CogACT Pretrained Model for Bridge and Fractal training.
# Download OpenVLA pretrained checkpoint (~30 GB) git clone https://huggingface.co/openvla/openvla-7b-prismatic # Download CogACT pretrained checkpoint (~31 GB) git clone https://huggingface.co/CogACT/CogACT-Large -
Train the model on different datasets
Before training, modify several parameters in the corresponding scripts, such as
hf_token,wandb_entity, checkpoint paths, dataset paths, and log directories.We train on a single node with 8× NVIDIA A100 GPUs.
# Train on the Bridge dataset bash script/train/bridge/train_bridge.sh # Train on the LIBERO-Spatial dataset bash script/train/libero/train_libero_spatial.sh # Train on the LIBERO-Object dataset bash script/train/libero/train_libero_object.sh # Train on the LIBERO-Goal dataset bash script/train/l
Related Skills
node-connect
342.5kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
85.3kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
342.5kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
342.5kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
