ECVA

Official repository of the paper "Exploring What Why and How: A Multifaceted Benchmark for Causation Understanding of Video Anomaly"

Generate Convert Improve

Install / Use

/learn @Dulpy/ECVA

About this skill

Quality Score

0/100

README

ECVA

Overview

We develop ECVA, a new benchmark for causation understanding of video anomaly. ECVA is the first large-scale benchmark focused on the causation of video anomalies. Compared with existing datasets, our dataset is more comprehensive and more challenging with much higher-quality annotations. This work is an extension of "Uncovering What, Why and How: A Comprehensive Benchmark for Causation Understanding of Video Anomaly" [CVPR2024]

Uncovering What, Why and How: A Comprehensive Benchmark for Causation Understanding of Video Anomaly

Installation
Benchmark and Evaluation Metric
Train Dataset Preparation
Model Training
Inference
Acknowledgement
License

Installation

To install and set up the environment, follow these steps:

git clone git@github.com:Dulpy/ECVA.git
cd multimodal-video-large-model
pip install -r requirements.txt

Benchmark and Evaluation Metric

We develop ECVA, a new benchmark for causation understanding of video anomaly. ECVA is the first large-scale benchmark focused on the causation of video anomalies. Compared with existing datasets, our dataset is more comprehensive and more challenging with much higher-quality annotation.
Our ECVA dataset contains 2240 video clips and 6720 question-answer pairs, the total length of these videos is 88.16 hours, and the average frames of videos is 8460. The frames are extracted from the original videos at a rate of 60 FPS. The videos encompass a wide range of domain
You can download the original video data from this link: Download Original Video Data
The ECVA video data along its annotation can be found in https://www.modelscope.cn/datasets/gouchenyi/ECVA/files

Classes

The proposed evaluation mertic

The proposed evaluation mertic mainly measures the performance of the model comprehensively through the following three aspects.

Basic Reasoning

In the Basic Reasoning part, we use the GPT model to assess whether the candidate answers comprehensively cover all key phrases and rate the answers based on their logical coherence.
Consistency

For the Consistency evaluation, we leverage the binarity of the GPT to score the candidate answers.
Hallucination

As for the Hallucination part, we remove key frames from the video and input it into the VLMs to observe how consistent the model's responses are with or without the key frames.

Evaluation_metric

Evaluate your results on ECVA

1. Reformat your results

For each video response, you need to organize it into the following format:

[{
  "video_file": '00001.mp4'
  "prompt": 'Give a detailed description of the anomalous segment in the video. Please remember to describe the details of the incident'
  "output": 'your model's response to this prompt'
  "task_type": 'Description'
  "human_expert_answer": 'The standard answer for the task'
},
]

2. Evaluate your results

Prepare the model's answers and our benchmark answers, then use the script here to score them with GPT assistant. Because GPT will be used to assist in the evaluation, you will need to fill in your own key in the relevant configuration file

Evaluate your results on traditional mertic

1. Reformat your results as shown above

2. Evaluate your results

Prepare the model's answers and our benchmark answers, then use the script here to evaluate them use BLUE, ROUGE, BLEURT and UNIEVAL.

Training Dataset Preparation

We introduce a novel video large language model named Anomaly Shield (AnomShield), which is designed to address the three challenges presented by ECVA. You can re-organize the annotated video/image sft data according to the following format and place the image/video data in the path ECVA/datasets/pretraining/ and ECVA/datasets/videosft/

[
    {
        "id": 0,
        "video": "images/xxx.jpg",
        "conversations": [
            {
                "from": "human",
                "value": "<image>\nWhat are the colors of the bus in the image?"
            },
            {
                "from": "gpt",
                "value": "The bus in the image is white and red."
            },
            ...
        ],
    }
    {
        "id": 1,
        "video": "videos/xxx.mp4",
        "conversations": [
            {
                "from": "human",
                "value": "<video>\nWhat are the main activities that take place in the video?"
            },
            {
                "from": "gpt",
                "value": "The main activities that take place in the video are the preparation of camera equipment by a man, a group of men riding a helicopter, and a man sailing a boat through the water."
            },
            ...
        ],
    },
    ...
]

Model Training

1. Prepare CLIP and Mistral Weight

For Vision-Encoder, similar to most multi-modal large models, AnomShield uses the CLIP series as the visual encoder. You can download the related pre-trained weights from openai/clip-vit-large-patch14.
For the base model, we utilize the powerful Mistral series to help analyze the video content and provide reliable, accurate answers. You can download the related pre-trained weights from mistralai/Mistral-7B-Instruct-v0.2.

2. Pretrain Command

cd ECVA/scripts/vllava/mamba/
./pretrain.sh

3. Video SFT Command

cd ECVA/scripts/vllava/mamba/
./finetune.sh

Inference

Video/Image Inference. We have inherited the inference code from VideoLLaMA2.You can refer to the inference.ipynb to implement the model inference, and you need to prepare the relevant model weights according to the instructions in the script.


cd ECVA/
run inference.ipynb on the jupyter environment

Acknowledgement

The codebase of ECVA is adapted from VideoLLaMA2. We are grateful for the foundational work done by the VideoLLaMA2 team, which has significantly contributed to the development of this project.

License

Cite

If you find our work useful for your research, please consider citing.


@article{du2024exploring,
  title={Exploring What Why and How: A Multifaceted Benchmark for Causation Understanding of Video Anomaly},
  author={Du, Hang and Nan, Guoshun and Qian, Jiawen and Wu, Wangchenhui and Deng, Wendi and Mu, Hanqing and Chen, Zhenyan and Mao, Pengxuan and Tao, Xiaofeng and Liu, Jun},
  journal={arXiv preprint arXiv:2412.07183},
  year={2024}
}

Related Skills

qqbot-channel

344.1k

QQ 频道管理技能。查询频道列表、子频道、成员、发帖、公告、日程等操作。使用 qqbot_channel_api 工具代理 QQ 开放平台 HTTP 接口，自动处理 Token 鉴权。当用户需要查看频道、管理子频道、查询成员、发布帖子/公告/日程时使用。

docs-writer

99.8k

`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie

model-usage

344.1k

Use CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.

Design

Campus Second-Hand Trading Platform \- General Design Document (v5.0 \- React Architecture \- Complete Final Version)1\. System Overall Design 1.1. Project Overview This project aims t

Dulpy

View profile

View on GitHub

GitHub Stars83

CategoryContent

Updated1d ago

Forks15

Dulpy/ECVA

Languages

Python

Security Score

80/100

Audited on Mar 31, 2026

No findings

ECVA

Install / Use

README

ECVA

Overview

Table of Contents

Installation

Benchmark and Evaluation Metric

The proposed evaluation mertic

Basic Reasoning

Consistency

Hallucination

Evaluate your results on ECVA

1. Reformat your results

2. Evaluate your results

Evaluate your results on traditional mertic

1. Reformat your results as shown above

2. Evaluate your results

Training Dataset Preparation

Model Training

1. Prepare CLIP and Mistral Weight

2. Pretrain Command

3. Video SFT Command

Inference

Acknowledgement

License

Cite

Related Skills