<div align="center"> <img src="assets/logo2.png" width="360px"> <h1>SparkVSR: Interactive Video Super-Resolution via Sparse Keyframe Propagation</h1> Jiongze Yu1, Xiangbo Gao1, Pooja Verlani2, Akshay Gadde2, Yilin Wang2, Balu Adsumilli2, Zhengzhong Tu†,1 1Texas A&M University    2YouTube, Google †Corresponding author <a href="https://sparkvsr.github.io/"><img src="https://img.shields.io/badge/Project-Page-Green"></a>   <a href="https://huggingface.co/JiongzeYu/SparkVSR"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue"></a>   <a href="https://arxiv.org/abs/2603.16864"><img src="https://img.shields.io/badge/arXiv-2603.16864-b31b1b.svg"></a> </div>

💡 Your ⭐ star means a lot to us and helps support the continuous development of this project!

📰 News

2026.03.17: This repo is released.🔥🔥🔥

Demo

Abstract: Video Super-Resolution (VSR) aims to restore high-quality video frames from low-resolution (LR) estimates, yet most existing VSR approaches behave like black boxes at inference time: users cannot reliably correct unexpected artifacts, but instead can only accept whatever the model produces. In this paper, we propose a novel interactive VSR framework dubbed SparkVSR that makes sparse keyframes a simple and expressive control signal. Specifically, users can first super-resolve or optionally a small set of keyframes using any off-the-shelf image super-resolution (ISR) model, then SparkVSR propagates the keyframe priors to the entire video sequence while remaining grounded by the original LR video motion. Concretely, we introduce a keyframe-conditioned latent-pixel two-stage training pipeline that fuses LR video latents with sparsely encoded HR keyframe latents to learn robust cross-space propagation and refine perceptual details. At inference time, SparkVSR supports flexible keyframe selection (manual specification, codec I-frame extraction, or random sampling) and a reference-free guidance mechanism that continuously balances keyframe adherence and blind restoration, ensuring robust performance even when reference keyframes are absent or imperfect. Experiments on multiple VSR benchmarks demonstrate improved temporal consistency and strong restoration quality, surpassing baselines by up to 24.6%, 21.8%, and 5.6% on CLIP-IQA, DOVER, and MUSIQ, respectively, enabling controllable, keyframe-driven video super-resolution. Moreover, we demonstrate that SparkVSR is a generic interactive, keyframe-conditioned video processing framework as it can be applied out of the box to unseen tasks such as old-film restoration and video style transfer.

Inference Pipeline

Training Pipeline

🔖 TODO

✅ Release inference code.
✅ Release pre-trained models.
✅ Release training code.
✅ Release project page.
⬜ Release ComfyUI.

⚙️ Dependencies

Python 3.10+
PyTorch >= 2.5.0
Diffusers
Other dependencies (see requirements.txt)

# Clone the github repo and go to the directory
git clone https://github.com/taco-group/SparkVSR
cd SparkVSR

# Create and activate conda environment
conda create -n sparkvsr python=3.10
conda activate sparkvsr

# Install all required dependencies
pip install torch==2.5.0 torchvision==0.20.0 torchaudio==2.5.0 --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txt

The installation command may need to be adjusted according to your platform, CUDA version, and desired PyTorch version. Please check the official PyTorch previous versions page for more options.

<a name="datasets"></a>📁 Datasets

🗳️ Train Datasets

Our model is trained on the same datasets as DOVE: HQ-VSR and DIV2K-HR. All datasets should be placed in the directory datasets/train/.

| Dataset | Type | # Videos / Images | Download | | ------------ | ----- | ----------------- | ------------------------------------------------------------ | | HQ-VSR | Video | 2,055 | Google Drive | | DIV2K-HR | Image | 800 | Official Link |

All datasets should follow this structure:

datasets/
└── train/
    ├── HQ-VSR/
    └── DIV2K_train_HR/

🗳️ Test Datasets

We use several real-world and synthetic test datasets for evaluation. All datasets follow a consistent directory structure:

| Dataset | Type | # Videos | Average Frames | Download | | :------ | :--------: | :---: | :---: | :----------------------------------------------------------: | | UDM10 | Synthetic | 10 | 32 | Google Drive | | SPMCS | Synthetic | 30 | 32 | Google Drive | | YouHQ40 | Synthetic | 40 | 32 | Google Drive | | RealVSR | Real-world | 50 | 50 | Google Drive | | MovieLQ | Old-movie | 10 | 192 | Google Drive |

Make sure the path (datasets/test/) is correct before running inference.

The directory structure is as follows:

datasets/
└── test/
    └── [DatasetName]/
        ├── GT/         # Ground Truth: folder of high-quality frames (one per clip)
        ├── GT-Video/   # Ground Truth (video version): lossless MKV format
        ├── LQ/         # Low-quality Input: folder of degraded frames (one per clip)
        └── LQ-Video/   # Low-Quality Input (video version): lossless MKV format

📊 Dataset Preparation (Path Lists)

Before training or testing, you need to generate .txt files containing the relative paths of all valid video and image files in your dataset directories. These text lists act as the index for the dataloader during training and inference. Run the following commands:

# 🔹 Train dataset
python finetune/scripts/prepare_dataset.py --dir datasets/train/HQ-VSR
python finetune/scripts/prepare_dataset.py --dir datasets/train/DIV2K_train_HR

# 🔹 Testing dataset
python finetune/scripts/prepare_dataset.py --dir datasets/test/UDM10/GT-Video
python finetune/scripts/prepare_dataset.py --dir datasets/test/UDM10/LQ-Video
# (You may need to repeat the above for other test datasets as needed)

<a name="models"></a>📦 Models

Our model is built upon the CogVideoX1.5-5B-I2V base model. We provide pretrained weights for SparkVSR at different training stages.

💡 Placement of Models:

Place the base model (CogVideoX1.5-5B-I2V) into the pretrained_weights/ folder.

Place the downloaded SparkVSR weights (Stage-1 and Stage-2) into the checkpoints/ folder.

<a name="training"></a>🔧 Training

Note: Training requires 4×A100 GPUs. ⚠️ Important: The Stage-1 weight is the intermediate result of our first training stage and is trained only in latent space. We release it mainly for training-time validation and comparison. The Stage-2 model is the final SparkVSR model.

🔹 Stage-1 (Latent-Space): Keyframe-Conditioned Adaptation. Enter the finetune/ directory and start training:
```
cd finetune/
bash sparkvsr_train_s1_ref.sh
```
This stage adapts the base model to VSR by learning to fuse LR video latents with sparse HR keyframe latents for robust cross-space propagation.
🔹 Stage-2 (Pixel-Space): Detail Refinement. First, convert the Stage-1 checkpoint into a loadable SFT weight format:
```
python scripts/prepare_sft_ckpt.py --checkpoint_dir ../checkpoint/SparkVSR-s1/checkpoint-10000
```
(Adjust the path and step number to match your actual training output).

You can skip Stage-1 by downloading our SparkVSR Stage-1 weight as the starting point for Stage-2.

Then, run the second-stage fine-tuning:
```
bash sparkvsr_train_s2_ref.sh
```
This stage refines perceptual details in pixel space, ensuring adherence to provided keyframes while simultaneously maintaining strong no-reference blind SR capabilities when keyframes are absent or imperfect.
Finally, convert the Stage-2 checkpoint for inference:
```
python scripts/prepare_sft_c
```

SparkVSR

Install / Use

README