LLMGA

This project is the official implementation of 'LLMGA: Multimodal Large Language Model based Generation Assistant', ECCV2024 Oral

Generate Convert Improve

Install / Use

/learn @JIA-Lab-research/LLMGA

About this skill

Quality Score

0/100

README

LLMGA: Multimodal Large Language Model-based Generation Assistant (ECCV2024 Oral)

Bin Xia, Shiyin Wang, Yingfan Tao, Yitong Wang, and Jiaya Jia

News

New Version (Accepted by ECCV2024):

[x] [2024.07.06] The finetuned SD15 models have been released, including SD15-T2I and SD15-inpainting. Notably, our SD15-T2I model can also be used for instruction-based editing of LLMGA.
[x] [2024.07.06] The finetuned SDXL models have been released, including SDXL-T2I and SDXL-inpainting.
[x] [2024.07.06] The pre-trained models, which further support Chinese (obtained by further fine-tuned on mixed Chinese and English data), have been released, including llmga-cn-vicuna 7b, llmga-cn-llama3 8b, llmga-cn-gemma 2b, and llmga-cn-qwen2 0.5b.
[x] [2024.07.06] We release new version LLMGA's training datasets, including texts and images.
[x] [2024.07.05] The pre-trained model has been released, including llmga-vicuna 7b, llmga-mistral 7b, llmga-llama3 8b, llmga-vicuna7b, llmga-qwen2 0.5b, llmga-qwen2 1.5b, llmga-qwen2 7b, llmga-phi3 3b, and llmga-gemma 2b.
[x] [2024.07.05] The code has been updated.
[x] [2024.07.04] I am organizing and uploading the new version of the LLMGA code and the dataset and model. I will have a status update when I complete this process, please wait for me for a few days. Notably, in this new version, we build our LLMGA on different base LLM models, such as Llama2 7b, Mistral 7b, LLama3 8b, Qwen2 0.5b, Qwen2 1.5b, Qwen2 7b, Phi3 3b, and gemma 2b. They have different performance and model sizes, as well as commercial licenses, there is always one that can meet your usage scenario.

Old Version:

[x] [2023.12.20] We release LLMGA's [training datasets].
[x] [2023.12.20] We release the gradio codes of LLMGA7b-SDXL-T2I.
[x] [2023.12.08] We release LLMGA7b-SDXL-T2I [demo].
[x] [2023.11.30] We have released the code for DiffRIR. It can effectively eliminate differences in brightness, contrast, and texture between generated and preserved regions in inpainting and outpainting. Considering its applicability to projects beyond LLMGA, we have open-sourced it at Github.
[x] [2023.11.29] The models is released at [Huggingface].
[x] [2023.11.29] The training and inference code is released.
[x] [2023.11.29] We will upload all models, code, and data within a week and further refine this project.
[x] [2023.11.28] GitHub repo is created.

Abstract: In this paper, we introduce a Multimodal Large Language Model-based Generation Assistant (LLMGA), leveraging the vast reservoir of knowledge and proficiency in reasoning, comprehension, and response inherent in Large Language Models (LLMs) to assist users in image generation and editing. Diverging from existing approaches where Multimodal Large Language Models (MLLMs) generate fixed-size embeddings to control Stable Diffusion (SD), our LLMGA provides a detailed language generation prompt for precise control over SD. This not only augments LLM context understanding but also reduces noise in generation prompts, yields images with more intricate and precise content, and elevates the interpretability of the network. To this end, we curate a comprehensive dataset comprising prompt refinement, similar image generation, inpainting & outpainting, and instruction-based editing. Moreover, we propose a two-stage training scheme. In the first stage, we train the MLLM to grasp the properties of image generation and editing, enabling it to generate detailed prompts. In the second stage, we optimize SD to align with the MLLM's generation prompts. Additionally, we propose a reference-based restoration network to alleviate texture, brightness, and contrast disparities between generated and preserved regions during inpainting and outpainting. Extensive results show that LLMGA has promising generation and editing capabilities and can enable more flexible and expansive applications in an interactive manner.

Why do you need LLMGA?

[x] Generation Assiatant. As a unified system, LLMGA can generate and edit images using methods such as Text-to-Image (T2I), inpainting, outpainting, and instruction-based editing through conversational interactions with users. By leveraging the extensive knowledge and understanding of image design from LLMGA, users can easily produce and revise images to obtain highly satisfactory images.
[x] Design Expert. LLMGA incorporates an extensive array of image design data, offering deep insights for a wide range of design tasks, including logo creation, game character design, poster design, T-shirt design, infographic design, and more.
[x] Illustration Generation. LLMGA can interactively generate story illustrations based on user-input story snippets.
[x] Picture Book Generation. With a single user's instruction, LLMGA can generate an interwoven storybook of text and illustrations.
[x] Multilingual Support.Through the multilingual adaptation of the LLMGA, T2I and editing model can generate content using Chinese language instructions.
[x] Flexible Expansion. LLMGA offers enhanced flexibility by integrating with external plugins like ControlNet, enabling a wider range of functionalities.
[x] To be continued ......

TODO
Install
Model
Preparation
Train
Inference
Citation
Acknowledgement

TODO

[x] Support gradio demo.
[ ] Support more generation models

Install

Please follow the instructions below to install the required packages.

Clone this repository

git clone https://github.com/dvlab-research/LLMGA.git

Install Package

conda create -n llmga python=3.9 -y
conda activate llmga
cd LLMGA
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
cd ./llmga/diffusers
pip install .

Install additional packages for training cases

pip install -e ".[train]"
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
pip install datasets
pip install albumentations
pip install ninja

Model

Preparation

Training Dataset

We provide the training data for LLMGA training.

please download LLMGA datasets and LLaVA pretrain datasets.

Besides, download LLaVA1.5 instruction tuning datasets llava_v1_5_mix665k.json, and download the images from constituting datasets:

COCO: train2017
GQA: images
OCR-VQA: download script, we save all files as .jpg
TextVQA: train_val_images
VisualGenome: part1, part2.

Please organize these downloaded data as in Structure.

The MLP Projector Pretrained Weights

We recommend users to download the pretrained MLP projector weights. Then put them in ./checkpoints following [S

Related Skills

diffs

341.8k

Use the diffs tool to produce real, shareable diffs (viewer URL, file artifact, or both) instead of manual edit summaries.

clearshot

Structured screenshot analysis for UI implementation and critique. Analyzes every UI screenshot with a 5×5 spatial grid, full element inventory, and design system extraction — facts and taste together, every time. Escalates to full implementation blueprint when building. Trigger on any digital interface image file (png, jpg, gif, webp — websites, apps, dashboards, mockups, wireframes) or commands like 'analyse this screenshot,' 'rebuild this,' 'match this design,' 'clone this.' Skip for non-UI images (photos, memes, charts) unless the user explicitly wants to build a UI from them. Does NOT trigger on HTML source code, CSS, SVGs, or any code pasted as text.

openpencil

1.9k

The world's first open-source AI-native vector design tool and the first to feature concurrent Agent Teams. Design-as-Code. Turn prompts into UI directly on the live canvas. A modern alternative to Pencil.

ui-ux-designer

Use this agent when you need to design, implement, or improve user interface components and user experience flows. Examples include: creating new pages or components, improving existing UI layouts, implementing responsive designs, optimizing user interactions, building forms or dashboards, analyzing existing UI through browser snapshots, or when you need to ensure UI components follow design system standards and shadcn/ui best practices.\n\n<example>\nContext: User needs to create a new dashboard page for team management.\nuser: "I need to create a team management dashboard where users can view team members, invite new members, and manage roles"\nassistant: "I'll use the ui-ux-designer agent to design and implement this dashboard with proper UX considerations, using shadcn/ui components and our design system tokens."\n</example>\n\n<example>\nContext: User wants to improve the user experience of an existing form.\nuser: "The signup form feels clunky and users are dropping off. Can you improve it?"\nassistant: "Let me use the ui-ux-designer agent to analyze the current form UX and implement improvements using our design system and shadcn/ui components."\n</example>\n\n<example>\nContext: User wants to evaluate and improve existing UI.\nuser: "Can you take a look at our pricing page and see how we can make it more appealing and user-friendly?"\nassistant: "I'll use the ui-ux-designer agent to take a snapshot of the current pricing page, analyze the UX against Notion-inspired design principles, and implement improvements using our design tokens."\n</example>

JIA-Lab-research

View profile

View on GitHub

GitHub Stars398

CategoryDesign

Updated10d ago

Forks25

JIA-Lab-research/LLMGA

Languages

Python

Security Score

100/100

Audited on Mar 20, 2026

No findings