RepLAI
Self-supervised algorithm for learning representations from ego-centric video data. Code is tested on EPIC-Kitchens-100 and Ego4D in PyTorch. (NeurIPS 2022)
Install / Use
/learn @HimangiM/RepLAIREADME
Learning State-Aware Visual Representations from Audible Interactions (NeurIPS 2022)
Code release for the paper Self-Supervised Representation Learning from Videos of Audible Interactions. This repo contains the PyTorch implementation and pre-trained models.
Authors: Himangi Mittal, Pedro Morgado, Unnat Jain, Abhinav Gupta
[Arxiv (Paper + Supplementary)]
Introduction
We propose a self-supervised algorithm to learn representations from egocentric video data. Recently, significant efforts have been made to capture humans interacting with their own environments as they go about their daily activities. In result, several large egocentric datasets of interaction-rich multi-modal data have emerged. However, learning representations from videos can be challenging. First, given the uncurated nature of long-form continuous videos, learning effective representations require focusing on moments in time when interactions take place. Second, visual representations of daily activities should be sensitive to changes in the state of the environment. However, current successful multi-modal learning frameworks encourage representation invariance over time. To address these challenges, we leverage audio signals to identify moments of likely interactions which are conducive to better learning. We also propose a novel self-supervised objective that learns from audible state changes caused by interactions. We validate these contributions extensively on two large-scale egocentric datasets, EPIC-Kitchens-100 and the recently released Ego4D, and show improvements on several downstream tasks, including action recognition, long-term action anticipation, and object state change classification.
For more details, please refer to our paper.

Citation
If you find our work useful in your research, please cite:
@article{mittal2022learning,
title={Learning State-Aware Visual Representations from Audible Interactions},
author={Mittal, Himangi and Morgado, Pedro and Jain, Unnat and Gupta, Abhinav},
journal={arXiv preprint arXiv:2209.13583},
year={2022}
}
Installation
(a). Clone the repository
git clone https://github.com/HimangiM/RepLAI.git
(b). Install dependencies by setting up conda
conda env create -f environment.yml
Self-Supervised Training
Training on EPIC-KITCHENS-100
Run the following command. Change the logging.name to create new log directories, logs/dataset root path by changing environment.data_dir, and dataset path by changing data.args.base_path.
sh commands/command_train_epic_kitchens.sh
Training on Ego4D
python main.py config-name default_ssrl -m logging.name train_replai data ego4D_10s_audiopeak backbone/video avid_r2plus1d_18 backbone/audio avid_spec_cnn_9 optim.batch_size 512 environment.slurm False environment.world_size -1 environment.multiprocessing_distributed False environment.distributed True environment.dist_url env:// environment.rank -1 environment.ngpu 8 environment.workers 96 environment.data_dir ./experiments/ logging.save_freq 10 optim.epochs 100 optim.args.lr 0.05 criterion.args.clr_coeff 0.5 criterion.args.aot_coeff 0.5 data.args.base_path ego4d_data/ data.args.delta_non_overlap 0.1 optim.use_lr_scheduler True optim.lr_scheduler_args.max_lr 0.05 optim.lr_scheduler_args.total_steps 100 backbone.video.args.pretrained True backbone.audio.args.pretrained True
Evaluation
Evaluation on EPIC-KITCHENS-100
Run the following command for evaluation the downstream task of action recognition using a linear classifier. Change the logging.name to load the pre-trained model, logs/dataset root path by changing environment.data_dir, and logging.suffix to create new logs for evaluations.
sh commands/command_eval_epic_kitchens.sh
Evaluation on Ego4D
State Change Classification and Point-of-no-return temporal localization
cd hands-and-objects/state-change-localization-classification/i3d-resnet50
python train.py --cfg configs/ssrl_keyframe_loc_release1-v2_main-experiment.yaml --extra-args MISC.OUTPUT_DIR ./log/outputs/state-change-localization-classification/run2 MISC.NUM_GPUS 1 MISC.NUM_SHARDS 8 DATA_LOADER.NUM_WORKERS 4 TRAIN.BATCH_SIZE 32 SOLVER.ACCELERATOR ddp SOLVER.BASE_LR 0.0001 MODEL.PRETRAINED /checkpoints/replai
Action recognition
cd ego4d-forecasting
tools/long_term_anticipation/ego4d_recognition.sh ${PRETRAINED_DIR}
Long-term anticipation
cd ego4d-forecasting
tools/long_term_anticipation/ego4d_forecasting.sh ${PRETRAINED_DIR}
Pre-trained models
We provide checkpoints for pre-trained models.
EPIC-KITCHENS-100
| Method | Top1 Acc (Verb) | Top1 Acc (Noun) | Top5 Acc (Verb) | Top5 Acc (Noun) | Model | |------- |-----------------|-----------------|-----------------|-----------------|-------| | RepLAI w/o AStC | 29.29 | 9.67 | 73.33 | 29.54 | url | | RepLAI w/o MoI | 28.71 | 8.33 | 73.17 | 27.29 | url | | RepLAI (scratch) | 25.75 | 8.12 | 71.25 | 27.29 | url | | RepLAI | 31.71 | 11.25 | 73.54 | 30.54 | url |
Ego4D
| Method | StCC: Acc | AR: Top1 Acc (Verb) | AR: Top1 Acc (Noun) | LTA: ED@(Z=20) (Verb) | LTA: ED@(Z=20) (Noun) | PNR: Err | Model | | --- | --- | --- | --- | --- | --- | --- | --- | | RepLAI w/o AStC | 63.60 | 21.1 | 13.5 | 0.774 | 0.853 | 0.795 | url | | RepLAI w/o MoI | 62.90 | 19.8 | 11.2 | 0.792 | 0.868 | 0.801 | url | | RepLAI (scratch) | 66.20 | 22.2 | 14.1 | 0.760 | 0.840 | 0.775 | url | | RepLAI | 66.30 | 22.5 | 14.7 | 0.755 | 0.834 | 0.772 | url |
StCC: State Change Classification, AR: Action Recognition, LTA: Long-Term Anticipation, PNR: Point-of-no-return temporal localization
Related Skills
qqbot-channel
352.5kQQ 频道管理技能。查询频道列表、子频道、成员、发帖、公告、日程等操作。使用 qqbot_channel_api 工具代理 QQ 开放平台 HTTP 接口,自动处理 Token 鉴权。当用户需要查看频道、管理子频道、查询成员、发布帖子/公告/日程时使用。
docs-writer
100.7k`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie
model-usage
352.5kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
Design
Campus Second-Hand Trading Platform \- General Design Document (v5.0 \- React Architecture \- Complete Final Version)1\. System Overall Design 1.1. Project Overview This project aims t
