CogAgent

An open-sourced end-to-end VLM-based GUI Agent

Generate Convert Improve

Install / Use

/learn @zai-org/CogAgent

About this skill

Quality Score

0/100

README

CogAgent: An open-sourced VLM-based GUI Agent

中文文档

🔥 🆕 December 2024: We open-sourced the latest version of the CogAgent-9B-20241220 model. Compared to the previous version of CogAgent, CogAgent-9B-20241220 features significant improvements in GUI perception, reasoning accuracy, action space completeness, task universality, and generalization. It supports bilingual (Chinese and English) interaction through both screen captures and natural language.
🏆 June 2024: CogAgent was accepted by CVPR 2024 and recognized as a conference Highlight (top 3%).
December 2023: We open-sourced the first GUI Agent: CogAgent (with the former repository available here) and published the corresponding paper: 📖 CogAgent Paper.

Model Introduction

| Model | Model Download Links | Technical Documentation | Online Demo |
|:--------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| cogagent-9b-20241220 | 🤗 HuggingFace<br> 🤖 ModelScope <br> 🟣 WiseModel <br>🧩 Modelers (Ascend) | 📄 Official Technical Blog<br/>📘 Practical Guide (Chinese) | 🤗 HuggingFace Space<br/>🤖 ModelScope Space<br/>🧩 Modelers Space (Ascend) |

Model Overview

CogAgent-9B-20241220 model is based on GLM-4V-9B, a bilingual open-source VLM base model. Through data collection and optimization, multi-stage training, and strategy improvements, CogAgent-9B-20241220 achieves significant advancements in GUI perception, inference prediction accuracy, action space completeness, and generalizability across tasks. The model supports bilingual (Chinese and English) interaction with both screenshots and language input. This version of the CogAgent model has already been applied in ZhipuAI's GLM-PC product. We hope the release of this model can assist researchers and developers in advancing the research and applications of GUI agents based on vision-language models.

Capability Demonstrations

The CogAgent-9b-20241220 model has achieved state-of-the-art results across multiple platforms and categories in GUI Agent tasks and GUI Grounding Benchmarks. In the CogAgent-9b-20241220 Technical Blog, we compared it against API-based commercial models (GPT-4o-20240806, Claude-3.5-Sonnet), commercial API + GUI Grounding models (GPT-4o + UGround, GPT-4o + OS-ATLAS), and open-source GUI Agent models (Qwen2-VL, ShowUI, SeeClick). The results demonstrate that CogAgent leads in GUI localization (Screenspot), single-step operations (OmniAct), the Chinese step-wise in-house benchmark (CogAgentBench-basic-cn), and multi-step operations (OSWorld), with only a slight disadvantage in OSWorld compared to Claude-3.5-Sonnet, which specializes in Computer Use, and GPT-4o combined with external GUI Grounding models.

<div style="display: flex; flex-direction: column; width: 100%; align-items: center; margin-top: 20px;"> <div style="text-align: center; margin-bottom: 20px; width: 100%; max-width: 600px; height: auto;"> <video src="https://github.com/user-attachments/assets/4d39fe6a-d460-427c-a930-b7cbe0d082f5" width="100%" height="auto" controls autoplay loop></video> <p>CogAgent wishes you a Merry Christmas! Let the large model automatically send Christmas greetings to your friends.</p> </div> <div style="text-align: center; width: 100%; max-width: 600px; height: auto;"> <video src="https://github.com/user-attachments/assets/87f00f97-1c4f-4152-b7c0-d145742cb910" width="100%" height="auto" controls autoplay loop></video> <p>Want to open an issue? Let CogAgent help you send an email.</p> </div> </div>

Table of Contents

CogAgent

Inference and Fine-tuning Costs

The model requires at least 29GB of VRAM for inference at BF16 precision. Using INT4 precision for inference is not recommended due to significant performance loss. The VRAM usage for INT4 inference is about 8GB, while for INT8 inference it is about 15GB. In the inference/cli_demo.py file, we have commented out these two lines. You can uncomment them and use INT4 or INT8 inference. This solution is only supported on NVIDIA devices.
All GPU references above refer to A100 or H100 GPUs. For other devices, you need to calculate the required GPU/CPU memory accordingly.
During SFT (Supervised Fine-Tuning), this codebase freezes the Vision Encoder, uses a batch size of 1, and trains on 8 * A100 GPUs. The total input tokens (including images, which account for 1600 tokens) add up to 2048 tokens. This codebase cannot conduct SFT fine-tuning without freezing the Vision Encoder.
For LoRA fine-tuning, Vision Encoder is not frozen; the batch size is 1, using 1 * A100 GPU. The total input tokens (including images, 1600 tokens) also amount to 2048 tokens. In the above setup, SFT fine-tuning requires at least 60GB of GPU memory per GPU (with 8 GPUs), while LoRA fine-tuning requires at least 70GB of GPU memory on a single GPU (cannot be split).
Ascend devices have not been tested for SFT fine-tuning. We have only tested them on the Atlas800 training server cluster. You need to modify the inference code accordingly based on the loading mechanism described in the Ascend device download link.
The online demo link does not support controlling computers; it only allows you to view the model's inference results. We recommend deploying the model locally.

Model Inputs and Outputs

cogagent-9b-20241220 is an agent-type execution model rather than a conversational model. It does not support continuous dialogue, but it does support a continuous execution history. (In other words, each time a new conversation session needs to be started, and the past history should be provided to the model.) The workflow of CogAgent is illustrated as following:

To achieve optimal GUI Agent performance, we have adopted a strict input-output format. Below is how users should format their inputs and feed them to the model, and how to interpret the model’s responses.

User Input

You can refer to app/client.py#L115 for constructing user input prompts. A minimal example of user input concatenation code is shown below:


current_platform = identify_os() # "Mac" or "WIN" or "Mobile". Pay attention to case sensitivity.
platform_str = f"(Platform: {current_platform})\n"
format_str = "(Answer in Action-Operation-Sensitive format.)\n" # You can use other format to replace "Action-Operation-Sensitive"

history_str = "\nHistory steps: "
for index, (grounded_op_func, action) in enumerate(zip(histo

Related Skills

node-connect

344.1k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

96.8k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

344.1k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

344.1k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。