MemoryVLA

[ICLR 2026] Code of "MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation"

Generate Convert Improve

Install / Use

/learn @shihao1895/MemoryVLA

About this skill

Quality Score

0/100

README

MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation

Hao Shi, Bin Xie, Yingfei Liu, Lin Sun, Fengrong Liu Tiancai Wang, Erjin Zhou, Haoqiang Fan, Xiangyu Zhang, Gao Huang

Tsinghua University, Dexmal, MEGVII, TJU, HiT, StepFun

ICLR 2026

This is the code for the paper "MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation".

🏠Project Page | 📑Paper | 🤗Models & Logs

🌟 News

🔥 [2026-1-27] Our paper MemoryVLA is accepted by ICLR 2026!
🔥 [2025-11-5] The code of MemoryVLA is released! (Both MemoryVLA and MemoryVLA+)
🔥 [2025-10-20] Our VLA codebase Dexbotic is released, it now fully integrates MemoryVLA !
🔥 [2025-8-26] Our paper MemoryVLA is now on arxiv!

Overview

MemoryVLA is a Cognition-Memory-Action framework for robotic manipulation inspired by human memory systems. It builds a hippocampal-like perceptual-cognitive memory to capture the temporal dependencies essential for current decision-making, enabling long-horizon, temporally aware action generation.

MemoryVLA Overview

We release two versions of the code in separate branches:

MemoryVLA: built upon the OpenVLA codebase.
MemoryVLA+: built upon our self-developed Dexbotic codebase, which offers higher simulation performance.

TODO

All components are now available, and we will continue to refine and improve the code.

[x] Code Release
- [x] MemoryVLA (OpenVLA codebase)
- [x] MemoryVLA+ (Dexbotic codebase)
[x] Model Weights Release
[x] Dataset Upload to HuggingFace

This is MemoryVLA based on OpenVLA codebase, if you need use dexbotic codebase, please use MemoryVLA+.

Model Zoo & Benchmark Results
Install
Training
Evaluation in SimplerEnv
Evaluation in LIBERO
Deployment in The Real World
FAQ
Citation

Model Zoo & Benchmark Results

All datasets use only third-person RGB and language, without using wrist-view images or state. MemoryVLA means openvla-codebase version, MemoryVLA+ means dexbotic-codebase version.

Bridge

| Model | Spoon | Carrot | Cube | Eggplant | Avg. | CKPT & Logs | | ---------- | ----- | ------ | ---- | -------- | ---- | ------------------------------------------------------------ | | MemoryVLA | 75.0 | 75.0 | 37.5 | 100.0 | 71.9 | 🤗 HF | | MemoryVLA+ | 100.0 | 66.7 | 70.8 | 100.0 | 84.4 | 🤗 HF |

LIBERO

| Model | Spatial | Object | Goal | Long-10 | Long-90 | Avg. | CKPT & Logs | | ---------------- | ------- | ------ | ---- | ------- | ------- | ---- | ------------------------------------------------------------ | | MemoryVLA | 98.4 | 98.4 | 96.4 | 93.4 | 95.6 | 96.5 | 🤗 Spa, 🤗 Obj, 🤗 Goal, 🤗 100 | | MemoryVLA+ | 98.2 | 97.8 | 96.4 | 93.6 | 96.2 | 96.5 | 🤗 Spa, 🤗 Obj, 🤗 Goal, 🤗 100 | | MemoryVLA+ (mix) | 97.2 | 99.2 | 98.4 | 93.2 | 97.2 | 97.1 | 🤗 HF |

Fractal-VM

| Model | Coke Can | Move Near | Open/Close Drawer | Put In Drawer | Avg. | CKPT & Logs | | ---------- | -------- | --------- | ----------------- | ------------- | ---- | ------------------------------------------------------------ | | MemoryVLA | 90.7 | 88.0 | 84.7 | 47.2 | 77.7 | 🤗 HF | | MemoryVLA+ | 92.0 | 91.7 | 71.8 | - | - | 🤗 HF |

Fractal-VA

| Model | Coke Can | Move Near | Open/Close Drawer | Put In Drawer | Avg. | CKPT & Logs | | ---------- | -------- | --------- | ----------------- | ------------- | ---- | ------------------------------------------------------------ | | MemoryVLA | 80.5 | 78.8 | 53.2 | 58.3 | 67.7 | 🤗 HF | | MemoryVLA+ | 83.5 | 81.8 | 63.2 | - | - | 🤗 HF |

Maniskill2

| Model | Pick Cube | Stack Cube | Pick Single YCB | Pick Single EGAD | Pick Clutter YCB | Avg. | CKPT & Logs | | ---------- | --------- | ---------- | --------------- | ---------------- | ---------------- | ---- | ------------------------------------------------------------ | | MemoryVLA+ | 85 | 75 | 60 | 85 | 45 | 70 | 🤗 HF |

Install

The code is built using Python 3.10, and we use PyTorch == 2.2.0 and CUDA == 12.1 (It may run with lower versions, but we have not tested it).

We recommend using Miniconda and setting up an environment:

conda create --name memvla python=3.10
conda activate memvla

pip install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 --index-url https://download.pytorch.org/whl/cu121
conda install -c nvidia cuda-nvcc=12.1 cuda-toolkit=12.1 -y

If you need to use the traning code, please also install the Flash Attention, we use flash-attn==2.5.5:

# Install Flash Attention 2.5.5, this is an example for pytorch2.2-cuda12.1
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.5.5/flash_attn-2.5.5+cu122torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install flash_attn-2.5.5+cu122torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

Next, clone our repo and install the required packages:

git clone https://github.com/shihao1895/MemoryVLA
cd MemoryVLA
pip install -e .

If you are using an NVIDIA Hopper GPU (e.g., H20) and encounter the error
“Floating point exception (core dumped)”, try reinstalling the specific cuBLAS version below:

# Fix for NVIDIA H20: "Floating point exception (core dumped)"
pip install nvidia-cublas-cu12==12.4.5.8

Training

Prepare training dataset with RLDS format:

LIBERO (including Spatial, Object, Goal, Long-10, Long-90 suites)
Bridge from Open X-Embodiment (OXE)
Fractal from Open X-Embodiment (OXE)

# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
# Download the LIBERO dataset (processed, ~22 GB)
git clone https://huggingface.co/datasets/shihao1895/libero-rlds
# Download the Bridge dataset (processed, ~157 GB)
git clone https://huggingface.co/datasets/shihao1895/bridge-rlds
# Download the Fractal dataset (processed)
git clone https://huggingface.co/datasets/shihao1895/fractal-rlds

Download pretrained model, we use OpenVLA Pretrained Model for LIBERO training, and CogACT Pretrained Model for Bridge and Fractal training.

# Download OpenVLA pretrained checkpoint (~30 GB)
git clone https://huggingface.co/openvla/openvla-7b-prismatic

# Download CogACT pretrained checkpoint (~31 GB)
git clone https://huggingface.co/CogACT/CogACT-Large

Train the model on different datasets

Before training, modify several parameters in the corresponding scripts, such as hf_token, wandb_entity, checkpoint paths, dataset paths, and log directories.

We train on a single node with 8× NVIDIA A100 GPUs.

# Train on the Bridge dataset
bash script/train/bridge/train_bridge.sh
# Train on the LIBERO-Spatial dataset
bash script/train/libero/train_libero_spatial.sh
# Train on the LIBERO-Object dataset
bash script/train/libero/train_libero_object.sh
# Train on the LIBERO-Goal dataset
bash script/train/l

Related Skills

node-connect

342.5k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

85.3k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

342.5k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

342.5k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。

shihao1895

View profile

View on GitHub

GitHub Stars186

CategoryDevelopment

Updated3h ago

Forks9

shihao1895/MemoryVLA

Languages

Python

Security Score

85/100

Audited on Mar 31, 2026

No findings

MemoryVLA

Install / Use

README

MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation

🏠Project Page | 📑Paper | 🤗Models & Logs

🌟 News

Overview

TODO

Contents

Model Zoo & Benchmark Results

Bridge

LIBERO

Fractal-VM

Fractal-VA

Maniskill2

Install

Training

Related Skills