VisualRWKV
VisualRWKV is the visual-enhanced version of the RWKV language model, enabling RWKV to handle various visual tasks.
Install / Use
/learn @howard-hou/VisualRWKVREADME
VisualRWKV: A Visual Language Model Based on RWKV
<p align="center"> <img src="./rwkv_emoji.png" alt="Logo" width="200"> </p>VisualRWKV is a visual language model based on the RWKV language model, enabling RWKV to handle various visual tasks.
Key Papers:
- VisualRWKV: Exploring Recurrent Neural Networks for Visual Language Models
- Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence
🚀 News and Updates
- 2025.02.10 🔥 VisualRWKV-7.00 checkpoints released! [weights]
- 2024.01.11 🔥 VisualRWKV-7.00 code released! [code]
- 2024.06.25 🔥 VisualRWKV-6.0 checkpoints released! [weights]
- 2024.05.11 🔥 VisualRWKV-6.0 code released! [code]
- 2024.03.25 🔥 VisualRWKV-5.0 released!
📊 VisualRWKV v7.0 Metrics
The following table presents the performance comparison between VisualRWKV v7.0 and its predecessor VisualRWKV v6 across several benchmark datasets.
| Model Name | VQAv2(test-dev) | ScienceQA(IMG) | TextVQA | GQA(acc) | Vision Encoder | |--------------------|--------------------|----------------|---------|----------|----------------------------------------------| | v0700+0b1 | 75.22 | 50.62 | 37.90 | 59.92 | SigLIP+dinov2+Sam | | v0700+0b4 | 77.85 | 54.98 | 41.05 | 62.30 | SigLIP+dinov2+Sam | | v0700+1b5 | 79.84 | 59.74 | 49.49 | 63.20 | SigLIP+dinov2+Sam | | VisualRWKV - v6 1.6B | 73.66 | 57.02 | 48.70 | 58.23 | SigLIP+dinov2+Sam | | VisualRWKV - v6 3B | 71.52 | 65.34 | 48.68 | 59.56 | CLIP | | VisualRWKV - v6 7B | 75.82 | 68.22 | 51.01 | 64.27 | CLIP |
🏗️ Architecture
<p align="center"> <img src="./VisualRWKV-arch.png" alt="VisualRWKV Architecture" width="800"> </p>🦄 Model Zoo
VisualRWKV weights, checkpoints, and related results can be found in the Model Zoo.
💻 Installation
1. Clone the repository
Clone the repo and navigate to the VisualRWKV folder. Version 7.00 is the stable release.
git clone https://github.com/howard-hou/VisualRWKV.git
cd VisualRWKV-v7/v7.00
2. Install dependencies
Create a conda environment and install the necessary packages.
conda create -n visualrwkv python=3.10 -y
conda activate visualrwkv
pip install --upgrade pip # Enable PEP 660 support
# Install dependencies:
pip install torch==1.13.1+cu117 --extra-index-url https://download.pytorch.org/whl/cu117
pip install pytorch-lightning==1.9.5 deepspeed==0.7.0 wandb ninja
# For best performance, use the following:
pip install torch --upgrade --extra-index-url https://download.pytorch.org/whl/cu126
pip install pytorch-lightning==1.9.5 deepspeed wandb ninja --upgrade
📚 Pre-training and Fine-tuning
Latest stable version is VisualRWKV-v7/v7.00. Please navigate to the VisualRWKV-v7/v7.00 directory for running the code.
VisualRWKV training consists of two stages:
- Pre-training: Using a pretrain dataset to train a projection layer from a frozen pretrained vision encoder to the frozen RWKV.
- Fine-tuning: Using visual instruction data to teach the model to follow visual instructions.
🔥 Pre-training
Download LLaVA-Pretrain Dataset
You can download the LLaVA-Pretrain.
Download RWKV Checkpoints for Pre-training
If you want to pretrain the model yourself, download the following RWKV checkpoints.
| VisualRWKV Version | RWKV 0B1 | RWKV 0B4 | RWKV 1B5 | RWKV 3B | RWKV 7B | | --- | --- | --- | --- |--- | --- | | VisualRWKV-v6 | - | - | RWKV-x060-World-1B6 | RWKV-x060-World-3B | RWKV-x060-World-7B | | VisualRWKV-v700 | RWKV-x070-World-0B1 | RWKV-x070-World-0B4 | RWKV-x070-World-1B5 | - | - |
Pre-training Command
To pretrain the VisualRWKV-v7.0 model (example for using 4 GPUs with a 1B5 RWKV model): please refer to pretrain script
🔧 Visual Instruction Tuning
Prepare Data
Refer to the LLaVA project for visual instruction data.
Fine-tuning Command
To fine-tune the VisualRWKV-v7.0 model, please refer to fine-tune script
