Videosaur

Repository for our paper "Object-Centric Learning for Real-World Videos by Predicting Temporal Feature Similarities"

Generate Convert Improve

Install / Use

/learn @martius-lab/Videosaur

About this skill

Quality Score

0/100

README

VideoSAUR

This is the code release for the paper Object-Centric Learning for Real-World Videos by Predicting Temporal Feature Similarities (NeurIPS 2023), by Andrii Zadaianchuk, Maximilian Seitzer and Georg Martius.

Webpage: https://martius-lab.github.io/videosaur
Arxiv: https://arxiv.org/abs/2306.04829
OpenReview: https://openreview.net/forum?id=t1jLRFvBqm

Temporal Feature Similarities

Summary

Unsupervised video-based object-centric learning is a promising avenue to learn structured representations from large, unlabeled video collections, but previous approaches have only managed to scale to real-world datasets in restricted domains. Recently, it was shown that the reconstruction of pre-trained self-supervised features leads to object-centric representations on unconstrained real-world image datasets. Building on this approach, we propose a novel way to use such pre-trained features in the form of a temporal feature similarity loss. This loss encodes semantic and temporal correlations between image patches and is a natural way to introduce a motion bias for object discovery. We demonstrate that this loss leads to state-of-the-art performance on the challenging synthetic MOVi datasets. When used in combination with the feature reconstruction loss, our model is the first object-centric video model that scales to unconstrained video datasets such as YouTube-VIS.

Usage

Setup

First, setup the python environment setup. We use Poetry for this:

poetry install

Then you could run a test configuration to see if everything works:

poetry run python -m videosaur.train tests/configs/test_dummy_image.yml

Second, to download the datasets used in this work, follow the instructions in data/README.md. By default, datasets are expected to be contained in the folder ./data. You can change this to the actual folder your data is in by setting the environment variable VIDEOSAUR_DATA_PATH, or by running train.py with the --data-dir option.

Training

Run one of the configurations in configs/videosaur, for example:

poetry run python -m videosaur.train configs/videosaur/movi_c.yml

The results are stored in a folder created under the log root folder (by defaults ./logs, changeable by the argument --log-dir). If you want to continue training from a previous run, you can use the --continue argument, like in the following command:

poetry run python -m videosaur.train --continue <path_to_log_dir_or_checkpoint_file> configs/videosaur/movi_c.yml

Inference

If you want to run one of the released checkpoints (see below) on your own video you can use inference script with corresponding config file:

poetry run python -m videosaur.inference --config configs/inference/movi_c.yml

in the released config, please change checkpoint: path/to/videosaur-movi-c.ckpt to the real path to your checkpoint. For different video formats you would need to modify corresponding transformations in build_inference_transform function.

Results

VideoSAUR

We list the results you should roughly be able to obtain with the configs included in this repository:

| Dataset | Model Variant | Video ARI | Video mBO | Config | Checkpoint Link | |--------------|------------------|-----------|-----------|-----------------------------|------------------------------------------------------------------------------------------------------------| | MOVi-C | ViT-B/8, DINO | 64.8 | 38.9 | videosaur/movi_c.yml | Checkpoint | | MOVi-E | ViT-B/8, DINO | 73.9 | 35.6 | videosaur/movi_e.yml | Checkpoint | | YT-VIS 2021 | ViT-B/16, DINO | 39.5 | 29.1 | videosaur/ytvis.yml | Checkpoint | | YT-VIS 2021 | ViT-B/14, DINOv2 | 39.7 | 35.6 | videosaur/ytvis_dinov2.yml | Checkpoint |

DINOSAUR

We also include a configuration for the DINOSAUR model from our previous paper Bridging the gap to real-world object-centric learning. This configuration yields improved results compared to the DINOSAUR model in the original paper (mainly due to using DINOv2 pre-trained features). Note that there might be minor differences in the metrics, as the numbers here are computed for 224x224 masks, compared to 320x320 masks in the DINOSAUR paper.

| Dataset | Model Variant | Image ARI | Image mBO | Config | Checkpoint | |---------|------------------|-----------|-----------|----------------------------------|--------------------------------------------------------------------------------------------------------| | COCO | ViT-B/14, DINOv2 | 45.6 | 29.6 | dinosaur/coco_base14_dinov2.yml | Checkpoint |

Citation

If you make use of this repository, please use the following bibtex entry to cite us:

  @inproceedings{zadaianchuk2023objectcentric,
      title={Object-Centric Learning for Real-World Videos by Predicting Temporal Feature Similarities},
      author={Zadaianchuk, Andrii and Seitzer, Maximilian and Martius, Georg},
      booktitle={Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS 2023)},
      year={2023},
  }

License

This codebase is released under the MIT license. Some parts of the codebase were adapted from other codebases. A comment was added to the code where this is the case. Those parts are governed by their respective licenses.

Related Skills

qqbot-channel

343.3k

QQ 频道管理技能。查询频道列表、子频道、成员、发帖、公告、日程等操作。使用 qqbot_channel_api 工具代理 QQ 开放平台 HTTP 接口，自动处理 Token 鉴权。当用户需要查看频道、管理子频道、查询成员、发布帖子/公告/日程时使用。

docs-writer

99.7k

`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie

model-usage

343.3k

Use CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.

project-overview

FlightPHP Skeleton Project Instructions This document provides guidelines and best practices for structuring and developing a project using the FlightPHP framework. Instructions for AI Coding A

martius-lab

View profile

View on GitHub

GitHub Stars34

CategoryContent

Updated13d ago

Forks3

martius-lab/videosaur

Languages

Python

Security Score

90/100

Audited on Mar 18, 2026

No findings