CogVLM2
GPT4V-level open-source multi-modal model based on Llama3-8B
Install / Use
/learn @zai-org/CogVLM2README
CogVLM2 & CogVLM2-Video
<div align="center"> <img src=resources/logo.svg width="40%"/> </div> <p align="center"> 👋 Join our <a href="resources/WECHAT.md" target="_blank">Wechat</a> · 💡Try CogVLM2 <a href="http://cogvlm2-online.cogviewai.cn:7861/" target="_blank">Online</a> 💡Try CogVLM2-Video <a href="http://cogvlm2-online.cogviewai.cn:7868/" target="_blank">Online</a> </p> <p align="center"> 📍Experience the larger-scale CogVLM model on the <a href="https://open.bigmodel.cn/?utm_campaign=open&_channel_track_key=OWTVNma9">ZhipuAI Open Platform</a>. </p>Recent updates
- 🔥 News:
2024/8/30: The CogVLM2 paper has been published on arXiv. - 🔥 News:
2024/7/12: We have released CogVLM2-Video online web demo, welcome to experience it. - 🔥 News:
2024/7/8: We released the video understanding version of the CogVLM2 model, the CogVLM2-Video model. By extracting keyframes, it can interpret continuous images. The model can support videos of up to 1 minute. See more in our blog. - 🔥 News:
2024/6/8:We release CogVLM2 TGI Weight, which is a model can be inferred in TGI. See Inference Code in here - 🔥 News:
2024/6/5:We release GLM-4V-9B, which use the same data and training recipes as CogVLM2 but with GLM-9B as the language backbone. We removed visual experts to reduce the model size to 13B. More details at GLM-4 repo. - 🔥 News:
2024/5/24: We have released the Int4 version model, which requires only 16GB of video memory for inference. You can also run on-the-fly int4 version by passing--quant 4. - 🔥 News:
2024/5/20: We released the next generation model CogVLM2, which is based on llama3-8b and is equivalent (or better) to GPT-4V in most cases ! Welcome to download!
Model introduction
We launch a new generation of CogVLM2 series of models and open source two models based on Meta-Llama-3-8B-Instruct. Compared with the previous generation of CogVLM open source models, the CogVLM2 series of open source models have the following improvements:
- Significant improvements in many benchmarks such as
TextVQA,DocVQA. - Support 8K content length.
- Support image resolution up to 1344 * 1344.
- Provide an open source model version that supports both Chinese and English.
You can see the details of the CogVLM2 family of open source models in the table below:
| Model Name | cogvlm2-llama3-chat-19B | cogvlm2-llama3-chinese-chat-19B | cogvlm2-video-llama3-chat | cogvlm2-video-llama3-base |
|------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|
| Base Model | Meta-Llama-3-8B-Instruct | Meta-Llama-3-8B-Instruct | Meta-Llama-3-8B-Instruct | Meta-Llama-3-8B-Instruct |
| Language | English | Chinese, English | English | English |
| Task | Image Understanding, Multi-turn Dialogue Model | Image Understanding, Multi-turn Dialogue Model | Video Understanding, Single-turn Dialogue Model | Video Understanding, Base Model, No Dialogue |
| Model Link | 🤗 Huggingface 🤖 ModelScope 💫 Wise Model | 🤗 Huggingface 🤖 ModelScope 💫 Wise Model | 🤗 Huggingface 🤖 ModelScope | 🤗 Huggingface 🤖 ModelScope |
| Experience Link | 📙 Official Page | 📙 Official Page 🤖 ModelScope | 📙 Official Page 🤖 ModelScope | / |
| Int4 Model | 🤗 Huggingface 🤖 ModelScope 💫 Wise Model | 🤗 Huggingface 🤖 ModelScope 💫 Wise Model | / | / |
| Text Length | 8K | 8K
