ECVA
Official repository of the paper "Exploring What Why and How: A Multifaceted Benchmark for Causation Understanding of Video Anomaly"
Install / Use
/learn @Dulpy/ECVAREADME
ECVA
Overview
We develop ECVA, a new benchmark for causation understanding of video anomaly. ECVA is the first large-scale benchmark focused on the causation of video anomalies. Compared with existing datasets, our dataset is more comprehensive and more challenging with much higher-quality annotations. This work is an extension of "Uncovering What, Why and How: A Comprehensive Benchmark for Causation Understanding of Video Anomaly" [CVPR2024]
Uncovering What, Why and How: A Comprehensive Benchmark for Causation Understanding of Video Anomaly
Table of Contents
- Installation
- Benchmark and Evaluation Metric
- Train Dataset Preparation
- Model Training
- Inference
- Acknowledgement
- License
Installation
To install and set up the environment, follow these steps:
git clone git@github.com:Dulpy/ECVA.git
cd multimodal-video-large-model
pip install -r requirements.txt
Benchmark and Evaluation Metric
-
We develop ECVA, a new benchmark for causation understanding of video anomaly. ECVA is the first large-scale benchmark focused on the causation of video anomalies. Compared with existing datasets, our dataset is more comprehensive and more challenging with much higher-quality annotation.
-
Our ECVA dataset contains 2240 video clips and 6720 question-answer pairs, the total length of these videos is 88.16 hours, and the average frames of videos is 8460. The frames are extracted from the original videos at a rate of 60 FPS. The videos encompass a wide range of domain
-
You can download the original video data from this link: Download Original Video Data
-
The ECVA video data along its annotation can be found in https://www.modelscope.cn/datasets/gouchenyi/ECVA/files

The proposed evaluation mertic
The proposed evaluation mertic mainly measures the performance of the model comprehensively through the following three aspects.
-
Basic Reasoning
In the Basic Reasoning part, we use the GPT model to assess whether the candidate answers comprehensively cover all key phrases and rate the answers based on their logical coherence.
-
Consistency
For the Consistency evaluation, we leverage the binarity of the GPT to score the candidate answers.
-
Hallucination
As for the Hallucination part, we remove key frames from the video and input it into the VLMs to observe how consistent the model's responses are with or without the key frames.

Evaluate your results on ECVA
1. Reformat your results
For each video response, you need to organize it into the following format:
[{
"video_file": '00001.mp4'
"prompt": 'Give a detailed description of the anomalous segment in the video. Please remember to describe the details of the incident'
"output": 'your model's response to this prompt'
"task_type": 'Description'
"human_expert_answer": 'The standard answer for the task'
},
]
2. Evaluate your results
Prepare the model's answers and our benchmark answers, then use the script here to score them with GPT assistant. Because GPT will be used to assist in the evaluation, you will need to fill in your own key in the relevant configuration file
Evaluate your results on traditional mertic
1. Reformat your results as shown above
2. Evaluate your results
Prepare the model's answers and our benchmark answers, then use the script here to evaluate them use BLUE, ROUGE, BLEURT and UNIEVAL.
Training Dataset Preparation
We introduce a novel video large language model named Anomaly Shield (AnomShield), which is designed to address the three challenges presented by ECVA. You can re-organize the annotated video/image sft data according to the following format and place the image/video data in the path ECVA/datasets/pretraining/ and ECVA/datasets/videosft/
[
{
"id": 0,
"video": "images/xxx.jpg",
"conversations": [
{
"from": "human",
"value": "<image>\nWhat are the colors of the bus in the image?"
},
{
"from": "gpt",
"value": "The bus in the image is white and red."
},
...
],
}
{
"id": 1,
"video": "videos/xxx.mp4",
"conversations": [
{
"from": "human",
"value": "<video>\nWhat are the main activities that take place in the video?"
},
{
"from": "gpt",
"value": "The main activities that take place in the video are the preparation of camera equipment by a man, a group of men riding a helicopter, and a man sailing a boat through the water."
},
...
],
},
...
]
Model Training
1. Prepare CLIP and Mistral Weight
-
For Vision-Encoder, similar to most multi-modal large models, AnomShield uses the CLIP series as the visual encoder. You can download the related pre-trained weights from openai/clip-vit-large-patch14.
-
For the base model, we utilize the powerful Mistral series to help analyze the video content and provide reliable, accurate answers. You can download the related pre-trained weights from mistralai/Mistral-7B-Instruct-v0.2.
2. Pretrain Command
cd ECVA/scripts/vllava/mamba/
./pretrain.sh
3. Video SFT Command
cd ECVA/scripts/vllava/mamba/
./finetune.sh
Inference
Video/Image Inference. We have inherited the inference code from VideoLLaMA2.You can refer to the inference.ipynb to implement the model inference, and you need to prepare the relevant model weights according to the instructions in the script.
cd ECVA/
run inference.ipynb on the jupyter environment
Acknowledgement
The codebase of ECVA is adapted from VideoLLaMA2. We are grateful for the foundational work done by the VideoLLaMA2 team, which has significantly contributed to the development of this project.
License
Cite
If you find our work useful for your research, please consider citing.
@article{du2024exploring,
title={Exploring What Why and How: A Multifaceted Benchmark for Causation Understanding of Video Anomaly},
author={Du, Hang and Nan, Guoshun and Qian, Jiawen and Wu, Wangchenhui and Deng, Wendi and Mu, Hanqing and Chen, Zhenyan and Mao, Pengxuan and Tao, Xiaofeng and Liu, Jun},
journal={arXiv preprint arXiv:2412.07183},
year={2024}
}
Related Skills
qqbot-channel
344.1kQQ 频道管理技能。查询频道列表、子频道、成员、发帖、公告、日程等操作。使用 qqbot_channel_api 工具代理 QQ 开放平台 HTTP 接口,自动处理 Token 鉴权。当用户需要查看频道、管理子频道、查询成员、发布帖子/公告/日程时使用。
docs-writer
99.8k`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie
model-usage
344.1kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
Design
Campus Second-Hand Trading Platform \- General Design Document (v5.0 \- React Architecture \- Complete Final Version)1\. System Overall Design 1.1. Project Overview This project aims t
