CogAgent
An open-sourced end-to-end VLM-based GUI Agent
Install / Use
/learn @zai-org/CogAgentREADME
CogAgent: An open-sourced VLM-based GUI Agent
-
🔥 🆕 December 2024: We open-sourced the latest version of the CogAgent-9B-20241220 model. Compared to the previous version of CogAgent,
CogAgent-9B-20241220features significant improvements in GUI perception, reasoning accuracy, action space completeness, task universality, and generalization. It supports bilingual (Chinese and English) interaction through both screen captures and natural language. -
🏆 June 2024: CogAgent was accepted by CVPR 2024 and recognized as a conference Highlight (top 3%).
-
December 2023: We open-sourced the first GUI Agent: CogAgent (with the former repository available here) and published the corresponding paper: 📖 CogAgent Paper.
Model Introduction
| Model | Model Download Links | Technical Documentation | Online Demo |
|:--------------------:|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| cogagent-9b-20241220 | 🤗 HuggingFace<br> 🤖 ModelScope <br> 🟣 WiseModel <br>🧩 Modelers (Ascend) | 📄 Official Technical Blog<br/>📘 Practical Guide (Chinese) | 🤗 HuggingFace Space<br/>🤖 ModelScope Space<br/>🧩 Modelers Space (Ascend) |
Model Overview
CogAgent-9B-20241220 model is based on GLM-4V-9B, a bilingual open-source
VLM base model. Through data collection and optimization, multi-stage training, and strategy improvements,
CogAgent-9B-20241220 achieves significant advancements in GUI perception, inference prediction accuracy, action space
completeness, and generalizability across tasks. The model supports bilingual (Chinese and English) interaction with
both screenshots and language input. This version of the CogAgent model has already been applied in
ZhipuAI's GLM-PC product. We hope the release of this model can assist researchers
and developers in advancing the research and applications of GUI agents based on vision-language models.
Capability Demonstrations
The CogAgent-9b-20241220 model has achieved state-of-the-art results across multiple platforms and categories in GUI Agent tasks and GUI Grounding Benchmarks. In the CogAgent-9b-20241220 Technical Blog, we compared it against API-based commercial models (GPT-4o-20240806, Claude-3.5-Sonnet), commercial API + GUI Grounding models (GPT-4o + UGround, GPT-4o + OS-ATLAS), and open-source GUI Agent models (Qwen2-VL, ShowUI, SeeClick). The results demonstrate that CogAgent leads in GUI localization (Screenspot), single-step operations (OmniAct), the Chinese step-wise in-house benchmark (CogAgentBench-basic-cn), and multi-step operations (OSWorld), with only a slight disadvantage in OSWorld compared to Claude-3.5-Sonnet, which specializes in Computer Use, and GPT-4o combined with external GUI Grounding models.
<div style="display: flex; flex-direction: column; width: 100%; align-items: center; margin-top: 20px;"> <div style="text-align: center; margin-bottom: 20px; width: 100%; max-width: 600px; height: auto;"> <video src="https://github.com/user-attachments/assets/4d39fe6a-d460-427c-a930-b7cbe0d082f5" width="100%" height="auto" controls autoplay loop></video> <p>CogAgent wishes you a Merry Christmas! Let the large model automatically send Christmas greetings to your friends.</p> </div> <div style="text-align: center; width: 100%; max-width: 600px; height: auto;"> <video src="https://github.com/user-attachments/assets/87f00f97-1c4f-4152-b7c0-d145742cb910" width="100%" height="auto" controls autoplay loop></video> <p>Want to open an issue? Let CogAgent help you send an email.</p> </div> </div>Table of Contents
- CogAgent
Inference and Fine-tuning Costs
- The model requires at least 29GB of VRAM for inference at
BF16precision. UsingINT4precision for inference is not recommended due to significant performance loss. The VRAM usage forINT4inference is about 8GB, while forINT8inference it is about 15GB. In theinference/cli_demo.pyfile, we have commented out these two lines. You can uncomment them and useINT4orINT8inference. This solution is only supported on NVIDIA devices. - All GPU references above refer to A100 or H100 GPUs. For other devices, you need to calculate the required GPU/CPU memory accordingly.
- During SFT (Supervised Fine-Tuning), this codebase freezes the
Vision Encoder, uses a batch size of 1, and trains on8 * A100GPUs. The total input tokens (including images, which account for1600tokens) add up to 2048 tokens. This codebase cannot conduct SFT fine-tuning without freezing theVision Encoder.
For LoRA fine-tuning,Vision Encoderis not frozen; the batch size is 1, using1 * A100GPU. The total input tokens (including images,1600tokens) also amount to 2048 tokens. In the above setup, SFT fine-tuning requires at least60GBof GPU memory per GPU (with 8 GPUs), while LoRA fine-tuning requires at least70GBof GPU memory on a single GPU (cannot be split). Ascend deviceshave not been tested for SFT fine-tuning. We have only tested them on theAtlas800training server cluster. You need to modify the inference code accordingly based on the loading mechanism described in theAscend devicedownload link.- The online demo link does not support controlling computers; it only allows you to view the model's inference results. We recommend deploying the model locally.
Model Inputs and Outputs
cogagent-9b-20241220 is an agent-type execution model rather than a conversational model. It does not support
continuous dialogue, but it does support a continuous execution history. (In other words, each time a new
conversation session needs to be started, and the past history should be provided to the model.) The workflow of
CogAgent is illustrated as following:
To achieve optimal GUI Agent performance, we have adopted a strict input-output format. Below is how users should format their inputs and feed them to the model, and how to interpret the model’s responses.
User Input
You can refer to app/client.py#L115 for constructing user input prompts. A minimal example of user input concatenation code is shown below:
current_platform = identify_os() # "Mac" or "WIN" or "Mobile". Pay attention to case sensitivity.
platform_str = f"(Platform: {current_platform})\n"
format_str = "(Answer in Action-Operation-Sensitive format.)\n" # You can use other format to replace "Action-Operation-Sensitive"
history_str = "\nHistory steps: "
for index, (grounded_op_func, action) in enumerate(zip(histo
Related Skills
node-connect
344.1kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
96.8kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
344.1kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
344.1kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
