SkillAgentSearch skills...

VideoAgent

"VideoAgent: All-in-One Agentic Framework for Video Understanding, Editing, and Remaking"

Install / Use

/learn @HKUDS/VideoAgent

README

<div align="center"> <img src='./assets/logo_new.png' width=40%/> <!-- # Open Agentic Video Intelligence --> <br>

🌟 Comprehensive Video Intelligence: <br> An All-in-One Framework for Understanding, Editing, and Generation

<div align="center"> </div>

<a href='https://space.bilibili.com/3546868449544308'><img src="https://img.shields.io/badge/bilibili-00A1D6.svg?logo=bilibili&logoColor=white" /></a>  <a href='https://www.youtube.com/@AI-Creator-is-here'><img src='https://badges.aleen42.com/src/youtube.svg' /></a>  <br> <a href="./Communication.md"><img src="https://img.shields.io/badge/💬Feishu-Group-07c160?style=for-the-badge&logoColor=white&labelColor=1a1a2e"></a> <a href="./Communication.md"><img src="https://img.shields.io/badge/WeChat-Group-07c160?style=for-the-badge&logo=wechat&logoColor=white&labelColor=1a1a2e"></a>

</div> <div align="center">

English | 简体中文

</div>

📹 Demo Video

<div> <a href="https://www.youtube.com/watch?v=JZkXO1NG2Ok" target='_blank'><img src="assets/overview.png" width="100%"></a> </div>

In this video, we demonstrate how to use VideoAgent to:

  • Clearly articulate user requirements
  • Achieve ​intent analysis and ​autonomous tool use & planning
  • Create ​multi-modal products, including detailed workflows
  • Fully automatic generation of video overview

🚀 Key Features

🧠 - Understanding Video Content<br> Enable in-depth analysis, summarization, and insight extraction from video media with advanced multi-modal intelligence capabilities.

✂️ - Editing Video Clips<br> Provide intuitive tools for assembling, clipping, and reconfiguring content with seamless workflow integration.

🎨 - Remaking Creative Videos<br> Utilize generative technologies to produce new, imaginative video content through AI-powered creative assistance.

🔧 - Multi-Modal Agentic Framework<br> Deliver comprehensive video intelligence through an integrated framework that combines multiple AI modalities for enhanced performance.

🚀 - Seamless Natural Language Experience<br> Transform video interaction and creation through pure conversational AI - no complex interfaces or technical expertise required, just natural dialogue with VideoAgent.

graph TB
    A[🎬 VideoAgent Framework] --> B[🧠 Video Understanding & Summarization]
    A --> C[✂️ Video Editing]
    A --> D[🎨 VIdeo Remaking]
    
    B --> B1[Video Q&A]
    B --> B2[Video Summarization]
    
    C --> C1[Movie Edits]
    C --> C2[Commentary Video]
    C --> C3[Video Overview]
    
    D --> D1[Meme Videos]
    D --> D2[Music Videos]
    D --> D3[Cross-Cultural Comedy]
</div> <div align="center"> <table> <tr> <th align="center"> </th> <th align="center">VideoAgent</th> <th align="center">Director</th> <th align="center">Funclip</th> <th align="center">NarratoAI</th> <th align="center">NotebookLM</th> </tr> <tr> <td align="center">Beat-synced Edits</td> <td align="center">✅</td> <td align="center">✅</td> <td align="center">✅</td> <td align="center">—</td> <td align="center">—</td> </tr> <tr> <td align="center">Storytelling Video</td> <td align="center">✅</td> <td align="center">—</td> <td align="center">—</td> <td align="center">—</td> <td align="center">—</td> </tr> <tr> <td align="center">Video Overview</td> <td align="center">✅</td> <td align="center">✅</td> <td align="center">✅</td> <td align="center">✅</td> <td align="center">✅</td> </tr> <tr> <td align="center">Meme Video Remaking</td> <td align="center">✅</td> <td align="center">—</td> <td align="center">—</td> <td align="center">—</td> <td align="center">—</td> </tr> <tr> <td align="center">Song Remixes</td> <td align="center">✅</td> <td align="center">—</td> <td align="center">—</td> <td align="center">—</td> <td align="center">—</td> </tr> <tr> <td align="center">Cross-lingual Adaptations</td> <td align="center">✅</td> <td align="center">—</td> <td align="center">—</td> <td align="center">—</td> <td align="center">—</td> </tr> <tr> <td align="center">Video Q&A</td> <td align="center">✅</td> <td align="center">✅</td> <td align="center">—</td> <td align="center">—</td> <td align="center">✅</td> </tr> <tr> <td align="center">Sound Effects Tools</td> <td align="center">✅</td> <td align="center">—</td> <td align="center">—</td> <td align="center">—</td> <td align="center">—</td> </tr> </table> </div>

📑 Table of Contents

🔥 Why VideoAgent?

| 🧠 Easy-to-Use | 🚀 Boundless Creativity | 🎨 High-Quality | |:---:|:---:|:---:| | One-Prompt Video Creation | Create From Any Ideas | Human-Quality Video Production | | Transform your ideas into professional videos | Workflow generation for your unique ideas | Deliver videos that meet professional standards |


🌟System Overview

Our system introduces three key innovations for automated video processing. Intent Analysis captures both explicit and implicit sub-intents beyond user commands. Autonamous Tool Use & Planning employs graph-powered workflow generation with adaptive feedback loops for automated agent orchestration. Multi-Modal Understanding transforms raw input into semantically aligned visual queries for enhanced retrieval.

🧠 Intent Analysis

  • 🔍 VideoAgent intelligently decomposes user instructions into both explicit and implicit sub-intents, capturing nuanced requirements that users may not explicitly state. This advanced parsing ensures comprehensive understanding of user goals beyond surface-level commands.

  • 🎯 Through an intent-to-agent mapping mechanism, the system identifies precisely which capabilities within the multi-agent framework are needed. This targeted approach enables efficient activation of relevant system components while avoiding unnecessary computational overhead for optimal task execution.

🔧 Autonomous Tool Use & Planning

  • ⚙️ A graph-powered framework automatically translates user intents into executable workflows. The system dynamically selects appropriate agents and constructs optimal execution sequences. Nodes represent tool capabilities while edges define workflow connections for complex video tasks.

  • 🔄 Adaptive feedback loops continuously refine the planning process through two-step self-evaluation. This ensures robust automated decision-making and seamless execution. The system self-corrects and optimizes performance throughout the entire task lifecycle.

🎬 Multi-Modal Understanding

  • 📋 The Storyboard Agent transforms raw user input into optimized visual queries. It first analyzes pre-captioned video material banks to understand available resources. This foundational analysis ensures the system knows exactly what content is accessible for query processing.

  • 💡 The agent then decomposes user input into fine-grained sub-queries that are both visually and semantically aligned. This sophisticated breakdown enables enhanced video retrieval by matching user intentions with the most relevant visual content in the database.

<div align="center"> <img src='./assets/framework.jpg' /><br> </div>

🔧Evaluation

We conduct extensive experiments across multiple dimensions to validate the effectiveness of VideoAgent in addressing key challenges.

Boundless Creativity via Workflow Construction

To evaluate VideoAgent's boundless creativity through automatic workflow construction, we compared five broadly applicable agents across three backbone models. Our findings demonstrate that VideoAgent significantly outperforms other baselines on the Audio and Video datasets, showcasing its creative workflow generation capabilities through graph-structured guidance and self-reflection driven by dedicated self-evaluation feedback. Furthermore, we observe that VideoAgent exhibits superior and more stable creative performance under the Claude 3.7 backbone compared to GPT-4o and Deepseek-v3, while other baseline methods show fluctuations across different backbones. This highlights VideoAgent's ability to unleash boundless creativity by automatically constructing diverse and effective workflows that adapt to various user requirements, with more capable LLMs achieving deeper comprehension and providing more robust creative solutions for complex graph-based tasks.

<div align="center"> <img src='./assets/eval1_audio_new.png' /><br> <img src='./assets/eval1_video_new.png' /><br> </div>

Superior Multimodal Understanding

To validate our multimodal understanding capabilities, we conducted text-to-video retrieval experiments using shuffled caption queries. The evaluation employs three metrics to assess our model's ability to retrieve corresponding visual content: Recall measures the model's ability to correctly reorder shuffled video clips by comparing retrieved clip midpoints against ground truth positions; Embedding Matching-based score assesses coarse-grained alignment between generated videos and high-level caption summaries; and Intersection over Union quantifies temporal alignment accuracy at the clip level by computing the ratio of temporal overlap to total coverage between retrieved and ground truth intervals. The experimental results demonstrate that our approach can retrieve more accurate video segments, thereby showcasing our precise multimodal understanding capabilities.

<div align="center"> <img src='./assets/eva2.png' /><br> </div>

More Iterations, Better Performance

We investigate VideoAgent's iterative refinement capabilities by analyzing the impact of reflection rounds on performance. Through comprehensive hyperparameter experiments on workflow composition across two datasets using three LLM backbones, we demonstrate VideoAgent's notable self-improvement ability

Related Skills

View on GitHub
GitHub Stars528
CategoryContent
Updated15h ago
Forks76

Languages

Python

Security Score

100/100

Audited on Mar 20, 2026

No findings