SkillAgentSearch skills...

VRT

VRT: A Video Restoration Transformer (official repository)

Install / Use

/learn @JingyunLiang/VRT

README

VRT: A Video Restoration Transformer

Jingyun Liang, Jiezhang Cao, Yuchen Fan, Kai Zhang, Rakesh Ranjan, Yawei Li, Radu Timofte, Luc Van Gool

Computer Vision Lab, ETH Zurich & Meta Inc.


arxiv | supplementary | pretrained models | visual results

arXiv GitHub Stars download visitors <a href="https://colab.research.google.com/gist/JingyunLiang/deb335792768ad9eb73854a8efca4fe0#file-vrt-demo-on-video-restoration-ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="google colab logo"></a>

This repository is the official PyTorch implementation of "VRT: A Video Restoration Transformer" (arxiv, supp, pretrained models, visual results). VRT achieves state-of-the-art performance in

  • video SR (REDS, Vimeo90K, Vid4, UDM10)                            :heart_eyes: + 0.33~0.51dB :heart_eyes:
  • video deblurring (GoPro, DVD, REDS)                                 :heart_eyes:     + 1.47~2.15dB     :heart_eyes:
  • video denoising (DAVIS, Set8)                                         :heart_eyes:         + 1.56~2.16dB         :heart_eyes:
  • video frame interpolation (Vimeo90K, UCF101, DAVIS)     :heart_eyes:     + 0.28~0.45dB     :heart_eyes:
  • space-time video SR (Vimeo90K, Vid4)                                  :heart_eyes: + 0.26~1.03dB :heart_eyes:
<!-- <p align="center"> <a href="https://github.com/JingyunLiang/VRT/releases"> <img width=40% src="assets/teaser_vsr.gif"/> <img width=40% src="assets/teaser_vdb.gif"/> <img width=40% src="assets/teaser_vdn.gif"/> <img width=40% src="assets/teaser_vfi.gif"/> <img width=40% src="assets/teaser_stvsr.gif"/> </a> </p> -->

Eg1 Eg2 Eg3 Eg4 Eg5

:rocket: :rocket: :rocket: News:

| Topic | Title | Badge | |:---:|:------:| :--------------------------: | | transformer-based image restoration | SwinIR: Image Restoration Using Swin Transformer:fire: | GitHub Starsdownload <a href="https://colab.research.google.com/gist/JingyunLiang/a5e3e54bc9ef8d7bf594f6fee8208533/swinir-demo-on-real-world-image-sr.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="google colab logo"></a> | | real-world image SR | Designing a Practical Degradation Model for Deep Blind Image Super-Resolution, ICCV2021 | GitHub Stars | | normalizing flow-based image SR and image rescaling | Hierarchical Conditional Flow: A Unified Framework for Image Super-Resolution and Image Rescaling, ICCV2021 | GitHub Starsdownload <a href="https://colab.research.google.com/gist/JingyunLiang/cdb3fef89ebd174eaa43794accb6f59d/hcflow-demo-on-x8-face-image-sr.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="google colab logo"></a> | | blind image SR | Mutual Affine Network for Spatially Variant Kernel Estimation in Blind Image Super-Resolution, ICCV2021 | GitHub Starsdownload <a href="https://colab.research.google.com/gist/JingyunLiang/4ed2524d6e08343710ee408a4d997e1c/manet-demo-on-spatially-variant-kernel-estimation.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="google colab logo"></a> | | blind image SR | Flow-based Kernel Prior with Application to Blind Super-Resolution, CVPR2021 | GitHub Stars |


Video restoration (e.g., video super-resolution) aims to restore high-quality frames from low-quality frames. Different from single image restoration, video restoration generally requires to utilize temporal information from multiple adjacent but usually misaligned video frames. Existing deep methods generally tackle with this by exploiting a sliding window strategy or a recurrent architecture, which either is restricted by frame-by-frame restoration or lacks long-range modelling ability. In this paper, we propose a Video Restoration Transformer (VRT) with parallel frame prediction and long-range temporal dependency modelling abilities. More specifically, VRT is composed of multiple scales, each of which consists of two kinds of modules: temporal mutual self attention (TMSA) and parallel warping. TMSA divides the video into small clips, on which mutual attention is applied for joint motion estimation, feature alignment and feature fusion, while self-attention is used for feature extraction. To enable cross-clip interactions, the video sequence is shifted for every other layer. Besides, parallel warping is used to further fuse information from neighboring frames by parallel feature warping. Experimental results on three tasks, including video super-resolution, video deblurring and video denoising, demonstrate that VRT outperforms the state-of-the-art methods by large margins (up to 2.16 dB) on nine benchmark datasets.

<p align="center"> <img width="800" src="assets/framework.jpeg"> </p>

Contents

  1. Requirements
  2. Quick Testing
  3. Training
  4. Results
  5. Citation
  6. License and Acknowledgement

Requirements

  • Python 3.8, PyTorch >= 1.9.1
  • Requirements: see requirements.txt
  • Platforms: Ubuntu 18.04, cuda-11.1

Quick Testing

Following commands will download pretrained models and test datasets automatically (except Vimeo-90K testing set). If out-of-memory, try to reduce --tile at the expense of slightly decreased performance.

You can also try to test it on Colab <a href="https://colab.research.google.com/gist/JingyunLiang/deb335792768ad9eb73854a8efca4fe0#file-vrt-demo-on-video-restoration-ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="google colab logo"></a>,

View on GitHub
GitHub Stars1.5k
CategoryContent
Updated6d ago
Forks141

Languages

Python

Security Score

85/100

Audited on Mar 16, 2026

No findings