LAVIS
LAVIS - A One-stop Library for Language-Vision Intelligence
Install / Use
/learn @salesforce/LAVISREADME
LAVIS - A Library for Language-Vision Intelligence
What's New: 🎉
- [Model Release] November 2023, released implementation of X-InstructBLIP <br>
Paper, Project Page, Website,
A simple, yet effective, cross-modality framework built atop frozen LLMs that allows the integration of various modalities (image, video, audio, 3D) without extensive modality-specific customization.
- [Model Release] July 2023, released implementation of BLIP-Diffusion <br> Paper, Project Page, Website
A text-to-image generation model that trains 20x than DreamBooth. Also facilitates zero-shot subject-driven generation and editing.
- [Model Release] May 2023, released implementation of InstructBLIP <br> Paper, Project Page
A new vision-language instruction-tuning framework using BLIP-2 models, achieving state-of-the-art zero-shot generalization performance on a wide range of vision-language tasks.
- [Model Release] Jan 2023, released implementation of BLIP-2 <br>
Paper, Project Page,
A generic and efficient pre-training strategy that easily harvests development of pretrained vision models and large language models (LLMs) for vision-language pretraining. BLIP-2 beats Flamingo on zero-shot VQAv2 (65.0 vs 56.3), establishing new state-of-the-art on zero-shot captioning (on NoCaps 121.6 CIDEr score vs previous best 113.2). In addition, equipped with powerful LLMs (e.g. OPT, FlanT5), BLIP-2 also unlocks the new zero-shot instructed vision-to-language generation capabilities for various interesting applications!
- Jan 2023, LAVIS is now available on PyPI for installation!
- [Model Release] Dec 2022, released implementation of Img2LLM-VQA (CVPR 2023, "From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models", by Jiaxian Guo et al) <br>
Paper, Project Page,
A plug-and-play module that enables off-the-shelf use of Large Language Models (LLMs) for visual question answering (VQA). Img2LLM-VQA surpasses Flamingo on zero-shot VQA on VQAv2 (61.9 vs 56.3), while in contrast requiring no end-to-end training!
- [Model Release] Oct 2022, released implementation of PNP-VQA (EMNLP Findings 2022, "Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training", by Anthony T.M.H. et al), <br>
Paper, Project Page,
)
A modular zero-shot VQA framework that requires no PLMs training, achieving SoTA zero-shot VQA performance.
Technical Report and Citing LAVIS
You can find more details in our technical report.
If you're using LAVIS in your research or applications, please cite it using this BibTeX:
@inproceedings{li-etal-2023-lavis,
title = "{LAVIS}: A One-stop Library for Language-Vision Intelligence",
author = "Li, Dongxu and
Li, Junnan and
Le, Hung and
Wang, Guangsen and
Savarese, Silvio and
Hoi, Steven C.H.",
booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)",
month = jul,
year = "2023",
address = "Toronto, Canada",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.acl-demo.3",
pages = "31--41",
abstract = "We introduce LAVIS, an open-source deep learning library for LAnguage-VISion research and applications. LAVIS aims to serve as a one-stop comprehensive library that brings recent advancements in the language-vision field accessible for researchers and practitioners, as well as fertilizing future research and development. It features a unified interface to easily access state-of-the-art image-language, video-language models and common datasets. LAVIS supports training, evaluation and benchmarking on a rich variety of tasks, including multimodal classification, retrieval, captioning, visual question answering, dialogue and pre-training. In the meantime, the library is also highly extensible and configurable, facilitating future development and customization. In this technical report, we describe design principles, key components and functionalities of the library, and also present benchmarking results across common language-vision tasks.",
}
Table of Contents
- Introduction
- Installation
- Getting Started
- Jupyter Notebook Examples
- Resources and Tools
- Documentations
- Ethical and Responsible Use
- Technical Report and Citing LAVIS
- License
Introduction
LAVIS is a Python deep learning library for LAnguage-and-VISion intelligence research and applications. This library aims to provide engineers and researchers with a one-stop solution to rapidly develop models for their specific multimodal scenarios, and benchmark them across standard and customized datasets. It features a unified interface design to access
- 10+ tasks (retrieval, captioning, visual question answering, multimodal classification etc.);
- 20+ datasets (COCO, Flickr, Nocaps, Conceptual Commons, SBU, etc.);
- 30+ pretrained weights of state-of-the-art foundation language-vision models and their task-specific adaptations, including ALBEF, BLIP, ALPRO, CLIP.
Key features of LAVIS include:
-
Unified and Modular Interface: facilitating to easily leverage and repurpose existing modules (datasets, models, preprocessors), also to add new modules.
-
Easy Off-the-shelf Inference and Feature Extraction: readily available pre-trained models let you take advantage of state-of-the-art multimodal understanding and generation capabilities on your own data.
-
Reproducible Model Zoo and Training Recipes: easily replicate and extend state-of-the-art models on existing and new tasks.
-
Dataset Zoo and Automatic Downloading Tools: it can be a hassle to prepare the many language-vision datasets. LAVIS provides automatic downloading scripts to help prepare a large variety of datasets and their annotations.
The following table shows the supported tasks, datasets and models in our library. This is a continuing effort and we are working on further growing the list.
| Tasks | Supported Models | Supported Datasets | | :--------------------------------------: | :----------------------: | :----------------------------------------: | | Image-text Pre-training | ALBEF, BLIP | COCO, VisualGenome, SBU ConceptualCaptions | | Image-text Retrieval | ALBEF,
Related Skills
debug-log-analyzer-mcp
10MCP server for AI-powered Salesforce Apex debug log analysis. Find performance bottlenecks, slow methods, SOQL bottlenecks, and governor limit issues.
ditto-sales-enablement
2Claude Code skill: Generate a complete sales enablement kit (battlecard, objection guide, quote bank, one-pager, pitch narrative, ROI framework, demo script) from a single Ditto research study.
heroku-agentforce-mcp
3This repository has 4 different MCP projects that demonstrates some of the inner workings of the MCP and architectural patterns when integrating with various Agents as well as Agentforce.
