COMODO
Official Repo for COMODO: Cross-Modal Video-to-IMU Distillation for Efficient Egocentric Human Activity Recognition
Install / Use
/learn @cruiseresearchgroup/COMODOREADME
Baiyu Chen<sup>1,2</sup>, Wilson Wongso<sup>1,2</sup>, Zechen Li<sup>1</sup>, Yonchanok Khaokaew<sup>1,2</sup>, Hao Xue<sup>1,2</sup>, and Flora Salim<sup>1,2</sup>
<sup>1</sup> School of Computer Science and Engineering, University of New South Wales, Sydney, Australia<br/> <sup>2</sup> ARC Centre of Excellence for Automated Decision Making + Society
</div> <p align="center"> <img src="assets/logo.png" height="100"> </p>🌟 Overview
COMODO is an open source framework for Cross-Modal Video-to-IMU Distillation for Efficient Egocentric Human Activity Recognition.
🔑 The key features of COMODO:
- Self-supervised Cross-modal Knowledge Transfer: We propose COMODO, a cross-modal self-supervised distillation framework that leverages pretrained video and time-series models enabling label-free knowledge transfer from a stronger modality (video) with richer training data to a weaker modality (IMU) with limited data.
- A Self-supervised and Effective Cross-modal Queuing Mechanism: We introduce a cross-modal FIFO queue that maintains video embeddings as a stable and diverse reference distribution for IMU feature distillation, extending the instance queue distribution learning approach from single-modality to cross-modality.
- Teacher-Student Model Agnostic: COMODO supports diverse video and time-series pretrained models, enabling flexible teacher-student configurations and future integration with stronger foundation models.
- Cross-dataset Generalization: We demonstrate that COMODO maintains superior performance even when evaluated on unseen datasets, and more superior than fully supervised models, highlighting its robustness and generalizability for egocentric HAR tasks.
📂 Data & Results
All experimental results and ablation study findings can be found in the /results folder.
The /dataset folder contains the train, val, and test splits for each dataset, along with our preprocessing scripts. Specifically, ego4d_subset_ids.txt is a subset of all available IMU-containing IDs, which we obtained by applying the official Ego4D filter from their website. This represents the complete subset of data that we can access.
🚀 Getting started
Cross-modal Self-supervised Distillation
To run a Self-supervised Video-to-IMU Distillation, use the following command:
Note:
[ ]denotes optional parameters.
Currently supported pretrained models:
- Time-series models: MOMENT, Mantis
- Video models: VideoMAE, TimeSformer
Other pretrained models can be used with minor modifications to the code.
python train.py \
--video_ckpt "facebook/timesformer-base-finetuned-k400" \
--imu_ckpt "paris-noah/Mantis-8M" \
--dataset_path "DATASET_PATH" \
--encoded_video_path "ENCODED_VIDEO_PATH" \
--anchor_video_path "ANCHOR_VIDEO_PATH" \
[--queue_size QUEUE_SIZE] \
[--student_temp STUDENT_TEMP] \
[--teacher_temp TEACHER_TEMP] \
[--learning_rate LR] \
[--num_epochs EPOCH] \
[--batch_size BS] \
[--num_clips 0] \
[--seed SEED] \
[--mlp_hidden_dim MLP_HIDDEN_DIM] \
[--mlp_output_dim MLP_OUTPUT_DIM] \
[--reduction "concat"] \
[--is_raw true]
Unsupervised Representation Learning Evaluation
We evaluate the learned IMU representations in an unsupervised manner. See Section 3.2 in our paper. We train a Support Vector Machine (SVM) on the extracted IMU features and evaluate classification accuracy on the test set. Run the following command to start the evaluation:
python unsupervised_rep_test.py \
--imu_ckpt "AutonLab/MOMENT-1-small" \
--model_path "MODEL_WEIGHT_PATH" \
--dataset_path "DATASET_PATH" \
🌍 Related Works & Baselines
There's a lot of outstanding work on time-series and human activity recognition! Here's an incomplete list. Checkout Table 1 in our paper for IMU-based Human Activity Recognition comparisons with these studies:
- MOMENT: A Family of Open Time-series Foundation Models [Paper, Code, Hugging Face]
- Mantis: Lightweight Calibrated Foundation Model for User-Friendly Time Series Classification [Paper, Code, Hugging Face]
- TimesNet: Temporal 2D-Variation Modeling for General Time Series Analysis [Paper, Code]
- DLinear: Are Transformers Effective for Time Series Forecasting? [Paper, Code]
- Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting [Paper, Code]
- IMU2CLIP: Language-grounded Motion Sensor Translation with Multimodal Contrastive Learning [Paper, Code]
Citation
If you find this repository useful for your research, please consider citing our paper:
@article{chen2025comodo,
title={Comodo: Cross-modal video-to-imu distillation for efficient egocentric human activity recognition},
author={Chen, Baiyu and Wongso, Wilson and Li, Zechen and Khaokaew, Yonchanok and Xue, Hao and Salim, Flora},
journal={arXiv preprint arXiv:2503.07259},
year={2025}
}
📩 Contact
If you have any questions or suggestions, feel free to contact Baiyu (Breeze) at breeze.chen(at)unsw(dot)edu(dot)au.
