Skyra
[CVPR2026] Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning
Install / Use
/learn @JoeLeelyf/SkyraREADME
Abstract: The misuse of AI-driven video generation technologies has raised serious social concerns, highlighting the urgent need for reliable AI-generated video detectors. However, most existing methods are limited to binary classification and lack the necessary explanations for human interpretation. In this paper, we present Skyra, a specialized multimodal large language model (MLLM) that identifies human-perceivable visual artifacts in AI-generated videos and leverages them as grounded evidence for both detection and explanation. To support this objective, we construct ViF-CoT-4K for Supervised Fine-Tuning (SFT), which represents the first large-scale AI-generated video artifact dataset with fine-grained human annotations. We then develop a two-stage training strategy that systematically enhances our model's spatio-temporal artifact perception, explanation capability, and detection accuracy. Extensive experiments demonstrate that Skyra surpasses existing methods across multiple benchmarks, providing valuable insights for advancing explainable AI-generated video detection.
🌟 Introduction
🎯 Core Capabilities
Unlike traditional binary detectors or general MLLMs, Skyra focuses on Grounded Artifact Reasoning:
- Artifact Perception: Identifies subtle visual anomalies (e.g., Physics Violation, Texture Jittering).
- Spatio-Temporal Grounding: Pinpoints exact timestamps and bounding boxes where artifacts occur.
- Explanatory Reasoning: Provides detailed Chain-of-Thought (CoT) explanations for why a video is Real or Fake.
🧩 Hierarchical Artifact Taxonomy
We define a comprehensive taxonomy to categorize AI generation errors, dividing them into Low-level Forgery (e.g., texture/color anomalies) and Violation of Laws (e.g., physical inconsistencies).
<p align="center"> <img src="static/images/taxonomy.png" alt="Taxonomy of Artifacts" width="60%"> </p>📊 Dataset: ViF-CoT-4K
ViF-CoT-4K is constructed to address the lack of detailed artifact annotations in existing datasets.
- Scale: ~4,000 videos, including high-quality samples from Sora-2, Wan2.1, Kling, and more.
- Annotation: Fine-grained labels including artifact type, textual explanation, timestamps, and bounding boxes.
- Real-Fake Pairs: Generated videos are semantically aligned with real counterparts to prevent shortcut learning.
🚀 Methodology
Skyra employs a Two-Stage Training Strategy to achieve interpretable detection:
- Cold-Start Initialization (SFT): Fine-tuning Qwen2.5-VL on ViF-CoT-4K to endow the model with basic detection and explanation capabilities.
- Reinforcement Learning (RL): Utilizing Group Relative Policy Optimization (GRPO) with an Asymmetric Reward design. This encourages the model to actively explore artifacts while strictly supervising classification accuracy.
📈 Experimental Results
Skyra achieves state-of-the-art performance, significantly outperforming binary detectors (e.g., DeMamba, NSG-VD) and general MLLMs (e.g., GPT-4o, Gemini).
<p align="center"> <img src="static/images/performance.png" alt="Radar Chart Performance" width="45%"> </p>ViF-Bench: Skyra achieves 91.02% Accuracy, surpassing the second-best method by a large margin.
🛠️ Usage
Requirements
- SFT Stage: follow LlaMA-Factory for environment setup.
- RL Stage: follow verl for environment setup.
- Inference: follow Qwen-2.5-VL for quick start and vLLM for deployment.
Data Preparation
-
Training data: Download and prepare the ViF-CoT-4K dataset from here.
-
Evaluation data: Download evaluation datasets (e.g., ViF-Bench) from here. And modify the path to your local directory in
test_index.json. Thetest_index.jsonfile should contain the following format:
{
"Real": [
"path_to_parsed_frames_dir/Real/gdymHI9S6gM-0",
...
],
"LTX-Video-13B-T": [
"path_to_parsed_frames_dir/Fake/LTX-Video-13B-T/gdymHI9S6gM-0",
...
],
...
Supervised Fine-Tuning (SFT)
We use LLaMA-Factory for SFT. You can start training after setup the dataset config following the instructions in the LLaMA-Factory repository.
cd train/LLaMA-Factory
bash train.sh
Reinforcement Learning (RL)
We use verl for RL training with GRPO, with adapted reward design provided in train/verl/verl/utils/reward_score/ladm.py.
Evaluation
Evaluate scripts are provided in the eval/ directory. You can run the evaluation script as follows:
- inference: Run inference to get model predictions and explanations, save the results in a JSON file.
cd eval
bash scripts/Skyra/inference.sh
# or
python inference.py \
--index_json /path_to/test_index.json \
--model_path /path_to/Skyra-SFT \
--model_name Skyra-SFT \
--save_dir results/Skyra
- evaluation: Evaluate the model predictions against ground truth and compute metrics.
cd eval
bash scripts/Skyra/eval.sh
# or
python eval.py \
--json_file_path results/Skyra/Skyra-SFT_predictions.json
End-to-End Inference
We provide a self-contained end-to-end inference pipeline in eval/inference_end2end/, which integrates video preprocessing, vLLM model loading, and supports both single-video inference and ViF-Bench batch evaluation.
File Structure
| File | Description |
|------|-------------|
| serve.sh | Launch the vLLM inference server |
| server.py | FastAPI server with endpoints: /v1/analyze, /v1/analyze_batch, /v1/analyze_frames |
| model_engine.py | vLLM model loading, prompt construction, and inference |
| video_processor.py | Video frame extraction (16 uniformly-sampled frames, short side resized to 256px) |
| infer.py | CLI inference tool (single video / batch, local / server mode) |
| eval_vifbench.py | ViF-Bench evaluation with paired metrics (Balanced ACC, Recall, F1) |
| run_eval_vifbench.sh | One-click script to evaluate Skyra-SFT and Skyra-RL on ViF-Bench |
1. Start the Inference Server
cd eval/inference_end2end
# Default: 4 GPUs, port 8000
MODEL_PATH=/path/to/Skyra-SFT bash serve.sh
# Custom configuration
CUDA_VISIBLE_DEVICES=0,1 TP_SIZE=2 PORT=8001 MODEL_PATH=/path/to/Skyra-SFT bash serve.sh
2. Single / Batch Video Inference
# Single video via server
python infer.py --server_url http://localhost:8000 --input /path/to/video.mp4
# Directory of videos
python infer.py --server_url http://localhost:8000 --input /path/to/videos/ --output results.json
# Local mode (no server needed, loads vLLM directly)
python infer.py --local --model_path /path/to/Skyra-SFT --input /path/to/video.mp4
3. ViF-Bench Evaluation
# Quick start: evaluate both Skyra-SFT and Skyra-RL
bash run_eval_vifbench.sh all # or "sft" / "rl" for individual models
# Via server
python eval_vifbench.py \
--bench_root /path/to/ViF-Bench \
--server_url http://localhost:8000 \
--save_dir results/
# Local mode
python eval_vifbench.py \
--bench_root /path/to/ViF-Bench \
--local --model_path /path/to/Skyra-SFT \
--save_dir results/
The evaluation script supports checkpoint resuming — re-run the same command to continue from where it left off. Results include per-model paired metrics (Balanced ACC, Recall, F1) saved in both JSON and CSV formats.
⚖️ License
The ViF-CoT-4K dataset and Skyra model weights are released under the CC BY 4.0 license. Users must adhere to the terms of source datasets (Kinetics-400, Panda-70M, HD-VILA-100M).
📍 Citation
If you find Skyra or ViF-CoT-4K useful, please cite our paper:
@article{li2025skyra,
title={Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning},
author={Li, Yifei and Zheng, Wenzhao and Zhang, Yanran and Sun, Runze and Zheng, Yu and Chen, Lei and Zhou, Jie and Lu, Jiwen},
journal={arXiv preprint arXiv:2512.15693},
year={2025}
}
Related Skills
qqbot-channel
344.1kQQ 频道管理技能。查询频道列表、子频道、成员、发帖、公告、日程等操作。使用 qqbot_channel_api 工具代理 QQ 开放平台 HTTP 接口,自动处理 Token 鉴权。当用户需要查看频道、管理子频道、查询成员、发布帖子/公告/日程时使用。
docs-writer
99.8k`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie
model-usage
344.1kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
Design
Campus Second-Hand Trading Platform \- General Design Document (v5.0 \- React Architecture \- Complete Final Version)1\. System Overall Design 1.1. Project Overview This project aims t
