VidHOI
Official implementation of "ST-HOI: A Spatial-Temporal Baseline for Human-Object Interaction Detection in Videos" (ACM ICMRW 2021)
Install / Use
/learn @coldmanck/VidHOIREADME
Human-Object Interaction in Videos (ST-HOI/VidHOI)
<div align="center">ST-HOI: A Spatial-Temporal Baseline for Human-Object Interaction Detection in Videos<br> [Paper] [Slides] [Video]<br><br> Meng-Jiun Chiou<sup>1</sup>, Chun-Yu Liao<sup>2</sup>, Li-Wei Wang<sup>2</sup>, Roger Zimmermann<sup>1</sup> and Jiashi Feng<sup>1</sup><br> <sup>1</sup>National University of Singapore <sup>2</sup>ASUS Intelligent Cloud Services<br><br> appears at ACM ICMR 2021 Workshop on Intelligent Cross-Data Analysis and Retrieval
</div> <div align="center"> <img src="figs/motivation.jpg" width="500"><br> </div>ST-HOI is a strong, spatial-temporal-aware human-object interaction (HOI) detection baseline. To take into account accurate spatial-temporal information, ST-HOI exploits trajectory-based features including correctly-localized visual features, spatial-temporal masking pose features and trajectory features.
VidHOI is one of the first large-scale video-based HOI detection benchmark. Note that in contrast to action detection datasets such as AVA/Kinetics, the interacting objects are explicitly annotated in VidHOI. We sampled and transformed video HOIs (i.e., image HOIs in continuous frames) from an existing video dataset, VidOR.
<div align="center"> <img src="figs/VidHOI_comparison.png"> </div> <div align="center"> <img src="figs/vidhoi_predicate_freq.jpg" width="800"> </div>Note that each experiment was performed with eight NVIDIA Tesla V100 GPU with 32G memory. Before running the training commands ensure that your GPUs have enough memories. Otherwise, you might need to reduce the batch size accordingly. In contrast, only 1 GPU with less than 4GB GPU is used for validation commands as we evaluate with batch size of 1.
To-dos
- [ ] Automate evaluation process (instead of using vidor_eval.ipynb)
- [ ] Clean visualization tools
Installation
- Create a conda environment
conda create -n vidhoi python=3.6 scipy numpy
conda activate vidhoi
- Install PyTorch 1.4.0 and torchvision 0.5.0 following the official installation guide
- Install other requirements via
pip install -r requirements.txt(Note: remove torch/torchvision/torchaudio and other mismatched package requirements before proceeding!) - Install SlowFast and detectron2 following the instructions in OLD_README.md. (Note: skip the step of cloning a detectron2_repo/ from FacebookResearch. Install our provided detectron2_repo/ in this repository.)
Download VidHOI Benchmark
Please refer to Section 4.1 of our paper for more detail about the proposed benchmark.
First, download the original VidOR dataset and annotations from the official website and unzip to $ROOT/slowfast/dataset/vidor-github. To download VidHOI (i.e., HOI-specific) annotations, refer to files under the same folder in this repoistory, and for larger files, download them from here.
Files
- Sampled frame lists
frame_lists/train.csvframe_lists/val.csv
- Human/object frame-wise annotations for training/validation
train_frame_annots.jsonval_frame_annots.json
- Human/object trajectories for training/validation
train_trajectories.jsonval_trajectories.json
- For removing testing frames with missing predicted boxes (during evaluation with precomputed boxes; details below)
val_instances_predictions_train_small_vidor_with_pseudo_labels.pth
One then needs to extract frames from VidOR videos using $ROOT/slowfast/dataset/vidor-github/extract_vidor_frames.sh.
Notes (Important!)
Since ST-HOI baselines that are evaluated with predicted trajectories (during validation) miss bounding boxes for some validation frames, to make their results comparable with the results using ground truth boxes, we remove those testing frames that no any bounding box got predicted by the trajectory generation model, i.e., we evaluate our all baselines only on those testing frames with at least one predicted boxes. This results in 168 less testing examples (22,967 -> 22,808 frames). Moreover, for models with Spatial-Temporal Masking Pose Module, further 1,050 out of 22,808 testing frames cannot be used as our human pose estimation model doesn't output any valid, predicted human pose. For fair comparisons, we only evaluate on the final 21,758 frames. This is done by changing the default value of VIDOR.TEST_PREDICT_BOX_LISTS from val_frame_annots.json to val_instances_predictions_train_small_vidor_with_pseudo_labels.pth. To validate models on all 22,967 frames (with ground truth trajectories), pass
VIDOR.TEST_PREDICT_BOX_LISTS val_frame_annots.jsonVIDOR.TEST_GT_LESS_TO_ALIGN_NONGT False
to configs when starting a validation session.
Download ST-HOI Baselines
Files
To reproduce results of ST-HOI baselines, please download essential files from here and put (after unzipping, if applicable) the files to the same folder (vidor-github) as above.
Note that if you'd only like to testing with ground truth trajectories, you only need to download human_poses.zip!
det_val_trajectories.json: detected trajectories (validation split)det_val_frame_annots.json: detected frame-wise annotations (validation split)- Human poses
human_poses.zip: generated human poses using ground truth boxes/trajectorieshuman_poses_detected-bboxes.zip: generated human poses using detected boxes/trajectoriesvidor_training_3d_human_poses_from_VIBE.pkl: (optional) 3D human poses generated with VIBE (training split)vidor_validation_3d_human_poses_from_VIBE.pkl: (optional) 3D human poses generated with VIBE (validation split)
detection_results.zip: raw detected boxes results (optional as it's been transformed intodet_val_trajectories,jsonanddet_val_frame_annots.json)vidvrd-mff.zip: (optional) the top-1 solution in Relation Understanding in Videos ACM MM 2019 Grand Challenge which includes the detected human/object trajectories used in our project. This zip file is the same as the file here.
Note that for the Detection results in Table 2, we evaluate the models (trained with ground truth boxes/trajectories) on detected boxes/trajectories. That's why we only need detected boxes/trajectories for VidHOI validation split.
Checkpoints
[Aug. 21, 2022] The file is now unavailable due to limited cloud space.
~~Trained models are provided for performance verification purpose without running training, and only 1 GPU is used during validation. Download the checkpoints from here and extract them under $ROOT/checkpoints/.~~
~~- checkpoints.zip: Final trained models' weights~~
Performance Validation
For the ease of verifying models' performance, we have uploaded the output json files of 2D/3D baselins and ST-HOI models (evaluated with ground truth boxes) here (under the output folder). One may directly download these files and refer to vidor_eval.ipynb for evaluation and visualization.
Experiments
First, rename the folder vidor-github under $ROOT/slowfast/dataset to vidor before running any command. The following commands use ground truth GT (Oracle mode) by default. To use detected trajectories, refer to NONGT version of each model.
Second, rename the paths in defaults.py: specifically, search for aicsvidhoi1 and replace the matched paths with yours.
For checking each model's final performance including mAP, use vidor_eval.ipynb (TODO: write an automatic evaluation script)
Image Baseline (2D Model)
- Training: Run
python tools/run_net_vidor.py --cfg configs/vidor/BASELINE_32x2_R50_SHORT_SCRATCH_EVAL_GT.yaml DATA.PATH_TO_DATA_DIR slowfast/datasets/vidor NUM_GPUS 1 DATA_LOADER.NUM_WORKERS 0 TRAIN.BATCH_SIZE 128 TEST.BATCH_SIZE 1 LOG_MODEL_INFO False
- Validation: Run
python tools/run_net_vidor.py --cfg configs/vidor/BASELINE_32x2_R50_SHORT_SCRATCH_EVAL_GT.yaml DATA.PATH_TO_DATA_DIR slowfast/datasets/vidor NUM_GPUS 1 DATA_LOADER.NUM_WORKERS 0 TEST.BATCH_SIZE 1 LOG_MODEL_INFO False TRAIN.ENABLE False TEST.CHECKPOINT_FILE_PATH ./checkpoints/BASELINE_32x2_R50_SHORT_SCRATCH_EVAL_GT/checkpoint_epoch_00020.pyth TRAIN.CHECKPOINT_TYPE pytorch VIDOR.TEST_DEBUG False
- NON-GT version:
BASELINE_32x2_R50_SHORT_SCRATCH_EVAL_NONGT
Video Baseline (3D Model)
- Training: Run
python tools/run_net_vidor.py --cfg configs/vidor/SLOWFAST_32x2_R50_SHORT_SCRATCH_EVAL_GT.yaml DATA.PATH_TO_DATA_DIR slowfast/datasets/vidor NUM_GPUS 8 DATA_LOADER.NUM_WORKERS 0 TRAIN.BATCH_SIZE 128 TEST.BATCH_SIZE 1 LOG_MODEL_INFO False
- Validation: Run
python tools/run_net_vidor.py --cfg configs/vidor/SLOWFAST_32x2_R50_SHORT_SCRATCH_EVAL_GT.
Related Skills
qqbot-channel
349.0kQQ 频道管理技能。查询频道列表、子频道、成员、发帖、公告、日程等操作。使用 qqbot_channel_api 工具代理 QQ 开放平台 HTTP 接口,自动处理 Token 鉴权。当用户需要查看频道、管理子频道、查询成员、发布帖子/公告/日程时使用。
docs-writer
100.3k`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie
model-usage
349.0kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
Design
Campus Second-Hand Trading Platform \- General Design Document (v5.0 \- React Architecture \- Complete Final Version)1\. System Overall Design 1.1. Project Overview This project aims t
