Skyra

[CVPR2026] Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning

Generate Convert Improve

Install / Use

/learn @JoeLeelyf/Skyra

About this skill

Quality Score

0/100

README

<h1 align="center"> Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning </h1> <a href="https://arxiv.org/abs/2512.15693" style="margin-right: 10px;"> <img src="https://img.shields.io/badge/arXiv-2512.15693-b31b1b.svg?logo=arXiv"> </a> <a href="https://huggingface.co/collections/JoeLeelyf/skyra" style="margin-right: 10px;"> <img src="https://img.shields.io/badge/🤗%20Hugging%20Face-Skyra-ffd21e"> </a> <a href="https://joeleelyf.github.io/Skyra/" style="margin-right: 10px;"> <img src="https://img.shields.io/badge/Project-Skyra-black?logo=github"> </a>

Abstract: The misuse of AI-driven video generation technologies has raised serious social concerns, highlighting the urgent need for reliable AI-generated video detectors. However, most existing methods are limited to binary classification and lack the necessary explanations for human interpretation. In this paper, we present Skyra, a specialized multimodal large language model (MLLM) that identifies human-perceivable visual artifacts in AI-generated videos and leverages them as grounded evidence for both detection and explanation. To support this objective, we construct ViF-CoT-4K for Supervised Fine-Tuning (SFT), which represents the first large-scale AI-generated video artifact dataset with fine-grained human annotations. We then develop a two-stage training strategy that systematically enhances our model's spatio-temporal artifact perception, explanation capability, and detection accuracy. Extensive experiments demonstrate that Skyra surpasses existing methods across multiple benchmarks, providing valuable insights for advancing explainable AI-generated video detection.

🌟 Introduction

🎯 Core Capabilities

Unlike traditional binary detectors or general MLLMs, Skyra focuses on Grounded Artifact Reasoning:

Artifact Perception: Identifies subtle visual anomalies (e.g., Physics Violation, Texture Jittering).
Spatio-Temporal Grounding: Pinpoints exact timestamps and bounding boxes where artifacts occur.
Explanatory Reasoning: Provides detailed Chain-of-Thought (CoT) explanations for why a video is Real or Fake.

🧩 Hierarchical Artifact Taxonomy

We define a comprehensive taxonomy to categorize AI generation errors, dividing them into Low-level Forgery (e.g., texture/color anomalies) and Violation of Laws (e.g., physical inconsistencies).

📊 Dataset: ViF-CoT-4K

ViF-CoT-4K is constructed to address the lack of detailed artifact annotations in existing datasets.

Scale: ~4,000 videos, including high-quality samples from Sora-2, Wan2.1, Kling, and more.
Annotation: Fine-grained labels including artifact type, textual explanation, timestamps, and bounding boxes.
Real-Fake Pairs: Generated videos are semantically aligned with real counterparts to prevent shortcut learning.

🚀 Methodology

Skyra employs a Two-Stage Training Strategy to achieve interpretable detection:

Cold-Start Initialization (SFT): Fine-tuning Qwen2.5-VL on ViF-CoT-4K to endow the model with basic detection and explanation capabilities.
Reinforcement Learning (RL): Utilizing Group Relative Policy Optimization (GRPO) with an Asymmetric Reward design. This encourages the model to actively explore artifacts while strictly supervising classification accuracy.

📈 Experimental Results

Skyra achieves state-of-the-art performance, significantly outperforming binary detectors (e.g., DeMamba, NSG-VD) and general MLLMs (e.g., GPT-4o, Gemini).

ViF-Bench: Skyra achieves 91.02% Accuracy, surpassing the second-best method by a large margin.

🛠️ Usage

Requirements

SFT Stage: follow LlaMA-Factory for environment setup.
RL Stage: follow verl for environment setup.
Inference: follow Qwen-2.5-VL for quick start and vLLM for deployment.

Data Preparation

Training data: Download and prepare the ViF-CoT-4K dataset from here.
Evaluation data: Download evaluation datasets (e.g., ViF-Bench) from here. And modify the path to your local directory in test_index.json. The test_index.json file should contain the following format:

{
    "Real": [
        "path_to_parsed_frames_dir/Real/gdymHI9S6gM-0",
        ...
    ],
    "LTX-Video-13B-T": [
        "path_to_parsed_frames_dir/Fake/LTX-Video-13B-T/gdymHI9S6gM-0",
        ...
    ],
    ...

Supervised Fine-Tuning (SFT)

We use LLaMA-Factory for SFT. You can start training after setup the dataset config following the instructions in the LLaMA-Factory repository.

cd train/LLaMA-Factory
bash train.sh

Reinforcement Learning (RL)

We use verl for RL training with GRPO, with adapted reward design provided in train/verl/verl/utils/reward_score/ladm.py.

Evaluation

Evaluate scripts are provided in the eval/ directory. You can run the evaluation script as follows:

inference: Run inference to get model predictions and explanations, save the results in a JSON file.

cd eval
bash scripts/Skyra/inference.sh
# or
python inference.py \
    --index_json /path_to/test_index.json \
    --model_path /path_to/Skyra-SFT \
    --model_name Skyra-SFT \
    --save_dir results/Skyra

evaluation: Evaluate the model predictions against ground truth and compute metrics.

cd eval
bash scripts/Skyra/eval.sh
# or
python eval.py \
    --json_file_path results/Skyra/Skyra-SFT_predictions.json

End-to-End Inference

We provide a self-contained end-to-end inference pipeline in eval/inference_end2end/, which integrates video preprocessing, vLLM model loading, and supports both single-video inference and ViF-Bench batch evaluation.

File Structure

| File | Description | |------|-------------| | serve.sh | Launch the vLLM inference server | | server.py | FastAPI server with endpoints: /v1/analyze, /v1/analyze_batch, /v1/analyze_frames | | model_engine.py | vLLM model loading, prompt construction, and inference | | video_processor.py | Video frame extraction (16 uniformly-sampled frames, short side resized to 256px) | | infer.py | CLI inference tool (single video / batch, local / server mode) | | eval_vifbench.py | ViF-Bench evaluation with paired metrics (Balanced ACC, Recall, F1) | | run_eval_vifbench.sh | One-click script to evaluate Skyra-SFT and Skyra-RL on ViF-Bench |

1. Start the Inference Server

cd eval/inference_end2end

# Default: 4 GPUs, port 8000
MODEL_PATH=/path/to/Skyra-SFT bash serve.sh

# Custom configuration
CUDA_VISIBLE_DEVICES=0,1 TP_SIZE=2 PORT=8001 MODEL_PATH=/path/to/Skyra-SFT bash serve.sh

2. Single / Batch Video Inference

# Single video via server
python infer.py --server_url http://localhost:8000 --input /path/to/video.mp4

# Directory of videos
python infer.py --server_url http://localhost:8000 --input /path/to/videos/ --output results.json

# Local mode (no server needed, loads vLLM directly)
python infer.py --local --model_path /path/to/Skyra-SFT --input /path/to/video.mp4

3. ViF-Bench Evaluation

# Quick start: evaluate both Skyra-SFT and Skyra-RL
bash run_eval_vifbench.sh all    # or "sft" / "rl" for individual models

# Via server
python eval_vifbench.py \
    --bench_root /path/to/ViF-Bench \
    --server_url http://localhost:8000 \
    --save_dir results/

# Local mode
python eval_vifbench.py \
    --bench_root /path/to/ViF-Bench \
    --local --model_path /path/to/Skyra-SFT \
    --save_dir results/

The evaluation script supports checkpoint resuming — re-run the same command to continue from where it left off. Results include per-model paired metrics (Balanced ACC, Recall, F1) saved in both JSON and CSV formats.

⚖️ License

The ViF-CoT-4K dataset and Skyra model weights are released under the CC BY 4.0 license. Users must adhere to the terms of source datasets (Kinetics-400, Panda-70M, HD-VILA-100M).

📍 Citation

If you find Skyra or ViF-CoT-4K useful, please cite our paper:

@article{li2025skyra,
  title={Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning},
  author={Li, Yifei and Zheng, Wenzhao and Zhang, Yanran and Sun, Runze and Zheng, Yu and Chen, Lei and Zhou, Jie and Lu, Jiwen},
  journal={arXiv preprint arXiv:2512.15693},
  year={2025}
}

Related Skills

qqbot-channel

344.1k

QQ 频道管理技能。查询频道列表、子频道、成员、发帖、公告、日程等操作。使用 qqbot_channel_api 工具代理 QQ 开放平台 HTTP 接口，自动处理 Token 鉴权。当用户需要查看频道、管理子频道、查询成员、发布帖子/公告/日程时使用。

docs-writer

99.8k

`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie

model-usage

344.1k

Use CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.

Design

Campus Second-Hand Trading Platform \- General Design Document (v5.0 \- React Architecture \- Complete Final Version)1\. System Overall Design 1.1. Project Overview This project aims t

JoeLeelyf

View profile

View on GitHub

GitHub Stars41

CategoryContent

Updated5d ago

Forks4

JoeLeelyf/Skyra

Languages

Python

Security Score

75/100

Audited on Mar 27, 2026

No findings