CogVLM & CogAgent

🌟 Jump to detailed introduction: Introduction to CogVLM， 🆕 Introduction to CogAgent

📔 For more detailed usage information, please refer to: CogVLM & CogAgent's technical documentation (in Chinese)

<table> <tr> <td> <h2> CogVLM </h2> 📖 Paper: <a href="https://arxiv.org/abs/2311.03079">CogVLM: Visual Expert for Pretrained Language Models</a> CogVLM is a powerful open-source visual language model (VLM). CogVLM-17B has 10 billion visual parameters and 7 billion language parameters, supporting image understanding and multi-turn dialogue with a resolution of 490*490. CogVLM-17B achieves state-of-the-art performance on 10 classic cross-modal benchmarks, including NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA and TDIUC. </td> <td> <h2> CogAgent </h2> 📖 Paper: <a href="https://arxiv.org/abs/2312.08914">CogAgent: A Visual Language Model for GUI Agents </a> CogAgent is an open-source visual language model improved based on CogVLM. CogAgent-18B has 11 billion visual parameters and 7 billion language parameters, supporting image understanding at a resolution of 1120*1120. On top of the capabilities of CogVLM, it further possesses GUI image Agent capabilities. CogAgent-18B achieves state-of-the-art generalist performance on 9 classic cross-modal benchmarks, including VQAv2, OK-VQ, TextVQA, ST-VQA, ChartQA, infoVQA, DocVQA, MM-Vet, and POPE. It significantly surpasses existing models on GUI operation datasets including AITW and Mind2Web. </td> </tr> <tr> <td colspan="2" align="center"> 🌐 Web Demo for both CogVLM2: <a href="http://36.103.203.44:7861">this link</a> </td> </tr> </table>

Table of Contents

CogVLM & CogAgent

Release

🔥🔥🔥 News: 2024/5/20: We released the next generation of model, CogVLM2, which is based on llama3-8b and on the par of (or better than) GPT-4V in most cases! DOWNLOAD and TRY!
🔥🔥 News: 2024/4/5: CogAgent was selected as a CVPR 2024 Highlights!
🔥 News: 2023/12/26: We have released the CogVLM-SFT-311K dataset, which contains over 150,000 pieces of data that we used for CogVLM v1.0 only training. Welcome to follow and use.
News: 2023/12/18: New Web UI Launched! We have launched a new web UI based on Streamlit, users can painlessly talk to CogVLM, CogAgent in our UI. Have a better user experience.
News: 2023/12/15: CogAgent Officially Launched! CogAgent is an image understanding model developed based on CogVLM. It features visual-based GUI Agent capabilities and has further enhancements in image understanding. It supports image input with a resolution of 1120*1120, and possesses multiple abilities including multi-turn dialogue with images, GUI Agent, Grounding, and more.
News: 2023/12/8 We have updated the checkpoint of cogvlm-grounding-generalist to cogvlm-grounding-generalist-v1.1, with image augmentation during training, therefore more robust. See details.
News: 2023/12/7 CogVLM supports 4-bit quantization now! You can inference with just 11GB GPU memory!
News: 2023/11/20 We have updated the checkpoint of cogvlm-chat to cogvlm-chat-v1.1, unified the versions of chat and VQA, and refreshed the SOTA on various datasets. See details
News: 2023/11/20 We release cogvlm-chat, cogvlm-grounding-generalist/base, cogvlm-base-490/224 on 🤗Huggingface. you can infer with transformers in a few lines of codenow!
2023/10/27 CogVLM bilingual version is available online! Welcome to try it out!
2023/10/5 CogVLM-17B released。

Get Started

Option 1: Inference Using Web Demo.

Click here to enter CogVLM2 Demo。

If you need to use Agent and Grounding functions, please refer to Cookbook - Task Prompts

Option 2：Deploy CogVLM / CogAgent by yourself

We support two GUIs for model inference, CLI and web demo . If you want to use it in your python code, it is easy to modify the CLI scripts for your case.

First, we need to install the dependencies.

# CUDA >= 11.8
pip install -r requirements.txt
python -m spacy download en_core_web_sm

All code for inference is located under the basic_demo/ directory. Please switch to this directory first before proceeding with further operations.

Situation 2.1 CLI (SAT version)

Run CLI demo via:

# CogAgent
python cli_demo_sat.py --from_pretrained cogagent-chat --version chat --bf16  --stream_chat
python cli_demo_sat.py --from_pretrained cogagent-vqa --version chat_old --bf16  --stream_chat

# CogVLM
python cli_demo_sat.py --from_pretrained cogvlm-chat --version chat_old --bf16  --stream_chat
python cli_demo_sat.py --from_pretrained cogvlm-grounding-generalist --version base --bf16  --stream_chat

The program will automatically download the sat model and interact in the command line. You can generate replies by entering instructions and pressing enter. Enter clear to clear the conversation history and stop to stop the program.

We also support model parallel inference, which splits model to multiple (2/4/8) GPUs. --nproc-per-node=[n] in the following command controls the number of used GPUs.

torchrun --standalone --nnodes=1 --nproc-per-node=2 cli_demo_sat.py --from_pretrained cogagent-chat --version chat --bf16

If you want to manually download the weights, you can replace the path after --from_pretrained with the model path.

Our model supports SAT's 4-bit quantization and 8-bit quantization. You can change --bf16 to --fp16, or --fp16 --quant 4, or --fp16 --quant 8.

For example

python cli_demo_sat.py --from_pretrained cogagent-chat --fp16 --quant 8 --stream_chat
python cli_demo_sat.py --from_pretrained cogvlm-chat-v1.1 --fp16 --quant 4 --stream_chat
# In SAT version，--quant should be used with --fp16

The program provides the following hyperparameters to control the generation process:

usage: cli_demo_sat.py [-h] [--max_length MAX_LENGTH] [--top_p TOP_P] [--top_k TOP_K] [--temperature TEMPERATURE]

optional arguments:
-h, --help            show this help message and exit
--max_length MAX_LENGTH
                        max length of the total sequence
--top_p TOP_P         top p for nucleus sampling
--top_k TOP_K         top k for top k sampling
--temperature TEMPERATURE
                        temperature for sampling

Click here to view the correspondence between different models and the --version parameter.

Situation 2.2 CLI (Huggingface version)

Run CLI demo via:

# CogAgent
python cli_demo_hf.py --from_pretrained THUDM/cogagent-chat-hf --bf16
python cli_demo_hf.py --from_pretrained THUDM/cogagent-vqa-hf --bf16

# CogVLM
python cli_demo_hf.py --from_pretrained THUDM/cogvlm-chat-hf --bf16
python cli_demo_hf.py --from_pretrained THUDM/cogvlm-grounding-generalist-hf --bf16

If you want to manually download the weights, you can replace the path after --from_pretrained with the model path.
You can change --bf16 to --fp16, or --quant 4. For example, our model supports Huggingface's 4-bit quantization:
```
python cli_demo_hf.py --from_pretrained THUDM/cogvlm-chat-hf --quant 4
```

Situation 2.3 Web Demo

We also offer a local web demo based on Gradio. First, install Gradio by running: pip install gradio. Then download and enter this repository and run web_demo.py. See the next section for detailed usage:

python web_demo.py --from_pretrained cogagent-chat --version chat --bf16
python web_demo.py --from_pretrained cogagent-vqa --version chat_old --bf16
python web_demo.py --from_pretrained cogvlm-chat-v1.1 --version chat_old --bf16
python web_demo.py --fr

CogVLM

Install / Use

README