Videoprism
Official repository for "VideoPrism: A Foundational Visual Encoder for Video Understanding" (ICML 2024)
Install / Use
/learn @google-deepmind/VideoprismREADME
VideoPrism: A Foundational Visual Encoder for Video Understanding
VideoPrism is a general-purpose video encoder designed to handle a wide spectrum of video understanding tasks, including classification, retrieval, localization, captioning, and question answering. It is pre-trained on a massive and diverse dataset: 1 billion image-text pairs from WebLI, 36 million high-quality video-text pairs, and 582 million video clips with noisy or machine-generated parallel text (subject to data wipeout). The pre-training approach is designed for these hybrid data, to learn both from video-text pairs and the videos themselves. VideoPrism is fairly easy to adapt to new video understanding tasks, and achieves state-of-the-art performance on 31 out of 33 public video understanding benchmarks using a single frozen model.
This repository releases the model weight checkpoints and hosts JAX/Flax utility functions for checkpoint loading and model inference.
Updates
- [Mar-13-26]: Added video classification fine-tuning with the frozen backbone [
Colab notebook]. :fire::fire: - [Jul-16-25]: Released VideoPrism video-text encoders for cross-modal retrieval [
Colab notebook]. :fire::fire: - [Jun-15-25]: Added models to [
Hugging Face]. - [Jun-05-25]: Added video encoder demo [
Colab notebook]. - [Jun-03-25]: Released VideoPrism video encoders (Base and Large) [
Blog] [Paper]. :fire::fire:
TODOs
- [ ] Add PyTorch model support.
Getting started
You will need Python 3.9 or later. Download the code from GitHub and run:
$ git clone https://github.com/google-deepmind/videoprism.git
$ cd videoprism
$ pip install .
Please get started with the following example code for model checkpoint loading and inference or use the Colab notebook for video encoders / Colab notebook for video-text encoders:
import jax
from videoprism import models as vp
# Video encoders.
model_name = 'videoprism_public_v1_base' # configuration name
flax_model = vp.get_model(model_name)
loaded_state = vp.load_pretrained_weights(model_name)
@jax.jit
def forward_fn(inputs):
return flax_model.apply(loaded_state, inputs, train=False)
video_inputs = ... # Shape = [batch_size, num_frames, height, width, 3].
outputs, _ = forward_fn(video_inputs) # Shape = [batch_size, num_tokens, feature_channels].
# Video-text encoders.
model_name = 'videoprism_lvt_public_v1_base' # configuration name
flax_model = vp.get_model(model_name)
loaded_state = vp.load_pretrained_weights(model_name)
text_tokenizer = vp.load_text_tokenizer('c4_en')
@jax.jit
def forward_fn(inputs, text_token_ids, text_token_paddings, train=False):
return flax_model.apply(
loaded_state,
inputs,
text_token_ids,
text_token_paddings,
train=train,
)
video_inputs = ... # Shape = [batch_size, num_frames, height, width, 3].
text_queries = ... # A list of input text queries.
text_ids, text_paddings = vp.tokenize_texts(text_tokenizer, text_queries)
video_embeddings, text_embeddings, _ = forward_fn(
video_inputs, text_ids, text_paddings) # Shape = [batch_size, feature_channels].
Video Classification example
We provide a Colab notebook for video classification to show how to fine-tune VideoPrism for video classification by keeping the pre-trained backbone frozen and training only a lightweight attention-pooler + projection head.
Released models
We release the following model variants:
| Model Name | Configuration Name | Model Type | Backbone | #Params | File Size | Checkpoint |
| -------- | -------- | ------- | :-------: | :-------: | :-------: | :-------: |
| VideoPrism-B | videoprism_public_v1_base | Video encoder | ViT-B | 114M | 458MB | link |
| VideoPrism-L | videoprism_public_v1_large | Video encoder | ViT-L | 354M | 1.42GB | link |
| VideoPrism-LvT-B | videoprism_lvt_public_v1_base | Video-text encoders | ViT-B | 248M | 991MB | link |
| VideoPrism-LvT-L | videoprism_lvt_public_v1_large | Video-text encoders | ViT-L | 580M | 2.30GB | link |
Video encoders take videos with shape (batch_size, num_frames, 288, 288, 3)
as inputs and output embeddings with shape
(batch_size, num_frames * 16 * 16, feature_channels) which could be reshaped
into (batch_size, num_frames, 16, 16, feature_channels) for spatiotemporal
representations. During model training, num_frames is set to 16 and 8 for
VideoPrism-B and VideoPrism-L, respectively. Both models are expected to work
with arbitrary num_frames by interpolating the temporal positional embeddings.
The RGB values of input videos should be normalized in [0.0, 1.0].
In video-text models, both video and text encoders produce global embeddings
with shape (batch_size, feature_channels), whose similarities could be
measured by cosine distances. We use the c4_en SentencePiece model for text tokenization. During inference, embedding
calculation for either modality can be skipped by providing None as the input.
Results with frozen backbones
"Public" denotes models we released in this repository. "Paper" and "Prior SOTA" denote our models and previous best-performing models reported in the paper, respectively. Our public models perform slightly worse than the paper models due to different pre-training image-text data we used subject to data policy.
Video-focused tasks (VideoGLUE)
| Models | K400 | MiT | SSv2 | D48 | Charades | ActivityNet | AVA | AVA-K | | -------- | :-------: | :-------: | :-------: | :-------: | :-------: | :-------: | :-------: | :-------: | | VideoPrism-B (public) | 82.9 | 39.7 | 62.2 | 64.3 | 43.5 | 36.5 | 28.3 | 30.8 | | VideoPrism-L (public) | 85.0 | 43.3 | 64.6 | 67.6 | 53.2 | 37.0 | 32.4 | 34.5 | | VideoPrism-B (paper) | 84.2 | 40.8 | 63.6 | 67.4 | 40.4 | 36.6 | 30.6 | 31.8 | | VideoPrism-g (paper) | 87.2 | 45.5 | 68.5 | 71.3 | 62.3 | 37.8 | 36.2 | 37.3 | | Prior SOTA (B) | 77.1 | 34.0 | 58.2 | 55.6 | 33.3 | 35.8 | 21.1 | 25.9 | | Prior SOTA (L+) | 82.8 | 40.3 | 67.4 | 69.6 | 39.9 | 36.7 | 24.4 | 26.2 |
Zero-shot video-text retrieval
| Models | MSRVTT-1K (v2t) | MSRVTT-1K (t2v) | VATEX (v2t) | VATEX (t2v) | ActivityNet (v2t) | ActivityNet (t2v) | | -------- | :-------: | :-------: | :-------: | :-------: | :-------: | :-------: | | VideoPrism-LvT-B (public) | 49.8 | 50.1 | 73.1 | 56.2 | 47.9 | 48.8 | | VideoPrism-LvT-L (public) | 50.6 | 50.1 | 75.0 | 57.2 | 49.1 | 51.3 | | VideoPrism-LvT-B (paper) | 50.2 | 51.4 | 76.2 | 57.7 | 47.9 | 49.6 | | VideoPrism-LvT-g (paper) | 51.7 | 52.7 | 77.1 | 62.5 | 50.3 | 52.7 | | Prior SOTA (B) | - | 34.0 | - | - | - | 30.6 | | Prior SOTA (L+) | 45.4 | 43.9 | 73.6 | 53.2 | 40.7 | 42.8 |
Zero-shot video classification
| Models | K400 | SSv2 (Temporal) | SSv2 (Events) | NExT-QA (Hard) | Charades | Charades (STA) | | -------- | :-------: | :-------: | :-------: | :-------: | :-------: | :-------: | | VideoPrism-LvT-B (public) | 69.2 | 14.6 | 11.3 | 31.1 | 26.9 | 48.6 | | VideoPrism-LvT-L (public) | 72.4 | 18.0 | 12.4 | 32.1 | 32.4 | 50.2 | | VideoPrism-LvT-B (paper)
Related Skills
qqbot-channel
351.2kQQ 频道管理技能。查询频道列表、子频道、成员、发帖、公告、日程等操作。使用 qqbot_channel_api 工具代理 QQ 开放平台 HTTP 接口,自动处理 Token 鉴权。当用户需要查看频道、管理子频道、查询成员、发布帖子/公告/日程时使用。
docs-writer
100.5k`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie
model-usage
351.2kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
Design
Campus Second-Hand Trading Platform \- General Design Document (v5.0 \- React Architecture \- Complete Final Version)1\. System Overall Design 1.1. Project Overview This project aims t
