CogVLM
a state-of-the-art-level open visual language model | 多模态预训练模型
Install / Use
/learn @zai-org/CogVLMREADME
CogVLM & CogAgent
🌟 Jump to detailed introduction: Introduction to CogVLM, 🆕 Introduction to CogAgent
📔 For more detailed usage information, please refer to: CogVLM & CogAgent's technical documentation (in Chinese)
<table> <tr> <td> <h2> CogVLM </h2> <p> 📖 Paper: <a href="https://arxiv.org/abs/2311.03079">CogVLM: Visual Expert for Pretrained Language Models</a></p> <p><b>CogVLM</b> is a powerful open-source visual language model (VLM). CogVLM-17B has 10 billion visual parameters and 7 billion language parameters, <b>supporting image understanding and multi-turn dialogue with a resolution of 490*490</b>.</p> <p><b>CogVLM-17B achieves state-of-the-art performance on 10 classic cross-modal benchmarks</b>, including NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA and TDIUC.</p> </td> <td> <h2> CogAgent </h2> <p> 📖 Paper: <a href="https://arxiv.org/abs/2312.08914">CogAgent: A Visual Language Model for GUI Agents </a></p> <p><b>CogAgent</b> is an open-source visual language model improved based on CogVLM. CogAgent-18B has 11 billion visual parameters and 7 billion language parameters, <b>supporting image understanding at a resolution of 1120*1120</b>. <b>On top of the capabilities of CogVLM, it further possesses GUI image Agent capabilities</b>.</p> <p> <b>CogAgent-18B achieves state-of-the-art generalist performance on 9 classic cross-modal benchmarks</b>, including VQAv2, OK-VQ, TextVQA, ST-VQA, ChartQA, infoVQA, DocVQA, MM-Vet, and POPE. <b>It significantly surpasses existing models on GUI operation datasets</b> including AITW and Mind2Web.</p> </td> </tr> <tr> <td colspan="2" align="center"> <p>🌐 Web Demo for both CogVLM2: <a href="http://36.103.203.44:7861">this link</a></p> </td> </tr> </table>Table of Contents
- CogVLM & CogAgent
Release
-
🔥🔥🔥 News:
2024/5/20: We released the next generation of model, CogVLM2, which is based on llama3-8b and on the par of (or better than) GPT-4V in most cases! DOWNLOAD and TRY! -
🔥🔥 News:
2024/4/5: CogAgent was selected as a CVPR 2024 Highlights! -
🔥 News:
2023/12/26: We have released the CogVLM-SFT-311K dataset, which contains over 150,000 pieces of data that we used for CogVLM v1.0 only training. Welcome to follow and use. -
News:
2023/12/18: New Web UI Launched! We have launched a new web UI based on Streamlit, users can painlessly talk to CogVLM, CogAgent in our UI. Have a better user experience. -
News:
2023/12/15: CogAgent Officially Launched! CogAgent is an image understanding model developed based on CogVLM. It features visual-based GUI Agent capabilities and has further enhancements in image understanding. It supports image input with a resolution of 1120*1120, and possesses multiple abilities including multi-turn dialogue with images, GUI Agent, Grounding, and more. -
News:
2023/12/8We have updated the checkpoint of cogvlm-grounding-generalist to cogvlm-grounding-generalist-v1.1, with image augmentation during training, therefore more robust. See details. -
News:
2023/12/7CogVLM supports 4-bit quantization now! You can inference with just 11GB GPU memory! -
News:
2023/11/20We have updated the checkpoint of cogvlm-chat to cogvlm-chat-v1.1, unified the versions of chat and VQA, and refreshed the SOTA on various datasets. See details -
News:
2023/11/20We release cogvlm-chat, cogvlm-grounding-generalist/base, cogvlm-base-490/224 on 🤗Huggingface. you can infer with transformers in a few lines of codenow! -
2023/10/27CogVLM bilingual version is available online! Welcome to try it out! -
2023/10/5CogVLM-17B released。
Get Started
Option 1: Inference Using Web Demo.
- Click here to enter CogVLM2 Demo。
If you need to use Agent and Grounding functions, please refer to Cookbook - Task Prompts
Option 2:Deploy CogVLM / CogAgent by yourself
We support two GUIs for model inference, CLI and web demo . If you want to use it in your python code, it is easy to modify the CLI scripts for your case.
First, we need to install the dependencies.
# CUDA >= 11.8
pip install -r requirements.txt
python -m spacy download en_core_web_sm
All code for inference is located under the basic_demo/ directory. Please switch to this directory first before
proceeding with further operations.
Situation 2.1 CLI (SAT version)
Run CLI demo via:
# CogAgent
python cli_demo_sat.py --from_pretrained cogagent-chat --version chat --bf16 --stream_chat
python cli_demo_sat.py --from_pretrained cogagent-vqa --version chat_old --bf16 --stream_chat
# CogVLM
python cli_demo_sat.py --from_pretrained cogvlm-chat --version chat_old --bf16 --stream_chat
python cli_demo_sat.py --from_pretrained cogvlm-grounding-generalist --version base --bf16 --stream_chat
The program will automatically download the sat model and interact in the command line. You can generate replies by
entering instructions and pressing enter.
Enter clear to clear the conversation history and stop to stop the program.
We also support model parallel inference, which splits model to multiple (2/4/8) GPUs. --nproc-per-node=[n] in the
following command controls the number of used GPUs.
torchrun --standalone --nnodes=1 --nproc-per-node=2 cli_demo_sat.py --from_pretrained cogagent-chat --version chat --bf16
-
If you want to manually download the weights, you can replace the path after
--from_pretrainedwith the model path. -
Our model supports SAT's 4-bit quantization and 8-bit quantization. You can change
--bf16to--fp16, or--fp16 --quant 4, or--fp16 --quant 8.For example
python cli_demo_sat.py --from_pretrained cogagent-chat --fp16 --quant 8 --stream_chat python cli_demo_sat.py --from_pretrained cogvlm-chat-v1.1 --fp16 --quant 4 --stream_chat # In SAT version,--quant should be used with --fp16 -
The program provides the following hyperparameters to control the generation process:
usage: cli_demo_sat.py [-h] [--max_length MAX_LENGTH] [--top_p TOP_P] [--top_k TOP_K] [--temperature TEMPERATURE] optional arguments: -h, --help show this help message and exit --max_length MAX_LENGTH max length of the total sequence --top_p TOP_P top p for nucleus sampling --top_k TOP_K top k for top k sampling --temperature TEMPERATURE temperature for sampling -
Click here to view the correspondence between different models and the
--versionparameter.
Situation 2.2 CLI (Huggingface version)
Run CLI demo via:
# CogAgent
python cli_demo_hf.py --from_pretrained THUDM/cogagent-chat-hf --bf16
python cli_demo_hf.py --from_pretrained THUDM/cogagent-vqa-hf --bf16
# CogVLM
python cli_demo_hf.py --from_pretrained THUDM/cogvlm-chat-hf --bf16
python cli_demo_hf.py --from_pretrained THUDM/cogvlm-grounding-generalist-hf --bf16
-
If you want to manually download the weights, you can replace the path after
--from_pretrainedwith the model path. -
You can change
--bf16to--fp16, or--quant 4. For example, our model supports Huggingface's 4-bit quantization:python cli_demo_hf.py --from_pretrained THUDM/cogvlm-chat-hf --quant 4
Situation 2.3 Web Demo
We also offer a local web demo based on Gradio. First, install Gradio by running: pip install gradio. Then download
and enter this repository and run web_demo.py. See the next section for detailed usage:
python web_demo.py --from_pretrained cogagent-chat --version chat --bf16
python web_demo.py --from_pretrained cogagent-vqa --version chat_old --bf16
python web_demo.py --from_pretrained cogvlm-chat-v1.1 --version chat_old --bf16
python web_demo.py --fr
