SkillAgentSearch skills...

MemoryVLA

[ICLR 2026] Code of "MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation"

Install / Use

/learn @shihao1895/MemoryVLA
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation

Hao Shi, Bin Xie, Yingfei Liu, Lin Sun, Fengrong Liu Tiancai Wang, Erjin Zhou, Haoqiang Fan, Xiangyu Zhang, Gao Huang

Tsinghua University, Dexmal, MEGVII, TJU, HiT, StepFun

ICLR 2026

This is the code for the paper "MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation".

🏠Project Page | 📑Paper | 🤗Models & Logs

🌟 News

  • 🔥 [2026-1-27] Our paper MemoryVLA is accepted by ICLR 2026!
  • 🔥 [2025-11-5] The code of MemoryVLA is released! (Both MemoryVLA and MemoryVLA+)
  • 🔥 [2025-10-20] Our VLA codebase Dexbotic is released, it now fully integrates MemoryVLA !
  • 🔥 [2025-8-26] Our paper MemoryVLA is now on arxiv!

Overview

MemoryVLA is a Cognition-Memory-Action framework for robotic manipulation inspired by human memory systems. It builds a hippocampal-like perceptual-cognitive memory to capture the temporal dependencies essential for current decision-making, enabling long-horizon, temporally aware action generation.

MemoryVLA Overview

We release two versions of the code in separate branches:

  • MemoryVLA: built upon the OpenVLA codebase.
  • MemoryVLA+: built upon our self-developed Dexbotic codebase, which offers higher simulation performance.

TODO

All components are now available, and we will continue to refine and improve the code.

  • [x] Code Release

    • [x] MemoryVLA (OpenVLA codebase)
    • [x] MemoryVLA+ (Dexbotic codebase)
  • [x] Model Weights Release

  • [x] Dataset Upload to HuggingFace

Contents

This is MemoryVLA based on OpenVLA codebase, if you need use dexbotic codebase, please use MemoryVLA+.

Model Zoo & Benchmark Results

All datasets use only third-person RGB and language, without using wrist-view images or state. MemoryVLA means openvla-codebase version, MemoryVLA+ means dexbotic-codebase version.

Bridge

| Model | Spoon | Carrot | Cube | Eggplant | Avg. | CKPT & Logs | | ---------- | ----- | ------ | ---- | -------- | ---- | ------------------------------------------------------------ | | MemoryVLA | 75.0 | 75.0 | 37.5 | 100.0 | 71.9 | 🤗 HF | | MemoryVLA+ | 100.0 | 66.7 | 70.8 | 100.0 | 84.4 | 🤗 HF |

LIBERO

| Model | Spatial | Object | Goal | Long-10 | Long-90 | Avg. | CKPT & Logs | | ---------------- | ------- | ------ | ---- | ------- | ------- | ---- | ------------------------------------------------------------ | | MemoryVLA | 98.4 | 98.4 | 96.4 | 93.4 | 95.6 | 96.5 | 🤗 Spa, 🤗 Obj, 🤗 Goal, 🤗 100 | | MemoryVLA+ | 98.2 | 97.8 | 96.4 | 93.6 | 96.2 | 96.5 | 🤗 Spa, 🤗 Obj, 🤗 Goal, 🤗 100 | | MemoryVLA+ (mix) | 97.2 | 99.2 | 98.4 | 93.2 | 97.2 | 97.1 | 🤗 HF |

Fractal-VM

| Model | Coke Can | Move Near | Open/Close Drawer | Put In Drawer | Avg. | CKPT & Logs | | ---------- | -------- | --------- | ----------------- | ------------- | ---- | ------------------------------------------------------------ | | MemoryVLA | 90.7 | 88.0 | 84.7 | 47.2 | 77.7 | 🤗 HF | | MemoryVLA+ | 92.0 | 91.7 | 71.8 | - | - | 🤗 HF |

Fractal-VA

| Model | Coke Can | Move Near | Open/Close Drawer | Put In Drawer | Avg. | CKPT & Logs | | ---------- | -------- | --------- | ----------------- | ------------- | ---- | ------------------------------------------------------------ | | MemoryVLA | 80.5 | 78.8 | 53.2 | 58.3 | 67.7 | 🤗 HF | | MemoryVLA+ | 83.5 | 81.8 | 63.2 | - | - | 🤗 HF |

Maniskill2

| Model | Pick Cube | Stack Cube | Pick Single YCB | Pick Single EGAD | Pick Clutter YCB | Avg. | CKPT & Logs | | ---------- | --------- | ---------- | --------------- | ---------------- | ---------------- | ---- | ------------------------------------------------------------ | | MemoryVLA+ | 85 | 75 | 60 | 85 | 45 | 70 | 🤗 HF |

Install

The code is built using Python 3.10, and we use PyTorch == 2.2.0 and CUDA == 12.1 (It may run with lower versions, but we have not tested it).

We recommend using Miniconda and setting up an environment:

conda create --name memvla python=3.10
conda activate memvla

pip install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 --index-url https://download.pytorch.org/whl/cu121
conda install -c nvidia cuda-nvcc=12.1 cuda-toolkit=12.1 -y

If you need to use the traning code, please also install the Flash Attention, we use flash-attn==2.5.5:

# Install Flash Attention 2.5.5, this is an example for pytorch2.2-cuda12.1
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.5.5/flash_attn-2.5.5+cu122torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install flash_attn-2.5.5+cu122torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

Next, clone our repo and install the required packages:

git clone https://github.com/shihao1895/MemoryVLA
cd MemoryVLA
pip install -e .

If you are using an NVIDIA Hopper GPU (e.g., H20) and encounter the error
“Floating point exception (core dumped)”, try reinstalling the specific cuBLAS version below:

# Fix for NVIDIA H20: "Floating point exception (core dumped)"
pip install nvidia-cublas-cu12==12.4.5.8

Training

  1. Prepare training dataset with RLDS format:

    # Make sure you have git-lfs installed (https://git-lfs.com)
    git lfs install
    # Download the LIBERO dataset (processed, ~22 GB)
    git clone https://huggingface.co/datasets/shihao1895/libero-rlds
    # Download the Bridge dataset (processed, ~157 GB)
    git clone https://huggingface.co/datasets/shihao1895/bridge-rlds
    # Download the Fractal dataset (processed)
    git clone https://huggingface.co/datasets/shihao1895/fractal-rlds
    
  2. Download pretrained model, we use OpenVLA Pretrained Model for LIBERO training, and CogACT Pretrained Model for Bridge and Fractal training.

    # Download OpenVLA pretrained checkpoint (~30 GB)
    git clone https://huggingface.co/openvla/openvla-7b-prismatic
    
    # Download CogACT pretrained checkpoint (~31 GB)
    git clone https://huggingface.co/CogACT/CogACT-Large
    
  3. Train the model on different datasets

    Before training, modify several parameters in the corresponding scripts, such as hf_token, wandb_entity, checkpoint paths, dataset paths, and log directories.

    We train on a single node with 8× NVIDIA A100 GPUs.

    # Train on the Bridge dataset
    bash script/train/bridge/train_bridge.sh
    # Train on the LIBERO-Spatial dataset
    bash script/train/libero/train_libero_spatial.sh
    # Train on the LIBERO-Object dataset
    bash script/train/libero/train_libero_object.sh
    # Train on the LIBERO-Goal dataset
    bash script/train/l
    

Related Skills

View on GitHub
GitHub Stars186
CategoryDevelopment
Updated3h ago
Forks9

Languages

Python

Security Score

85/100

Audited on Mar 31, 2026

No findings