FluxText

[TMM] Implementation of "FLUX-Text: A Simple and Advanced Diffusion Transformer Baseline for Scene Text Editing"

Generate Convert Improve

Install / Use

/learn @AMAP-ML/FluxText

About this skill

Quality Score

0/100

README

Implementation of FLUX-Text

FLUX-Text: A Simple and Advanced Diffusion Transformer Baseline for Scene Text Editing

Rui Lan<sup>1</sup>, Yancheng Bai<sup>1</sup>, Xu Duan<sup>1</sup>, Mingxing Li<sup>1</sup>, Dongyang Jin<sup>1</sup>, Ryan Xu<sup>1</sup>, Dong Nie<sup>2</sup>, Lei Sun<sup>1</sup>, Xiangxiang Chu<sup>1</sup> <br> <sup>1</sup>ALibaba Group <sup>2</sup>University of North Carolina at Chapel Hill

📖 Overview

Motivation: Scene text editing is a challenging task that aims to modify or add text in images while maintaining the fidelity of newly generated text and visual coherence with the background. The main challenge of this task is that we need to edit multiple line texts with diverse language attributes (e.g., fonts, sizes, and styles), language types (e.g., English, Chinese), and visual scenarios (e.g., poster, advertising, gaming).
Contribution: We propose FLUX-Text, a novel text editing framework for editing multi-line texts in complex visual scenes. By incorporating a lightweight Condition Injection LoRA module, Regional text perceptual loss, and two-stage training strategy, we significantly significant improvements on both Chinese and English benchmarks. <img src='assets/method.png'>

News

2025-07-16: 🔥 Update comfyui node. We have decoupled the FLUX-Text node to support the use of more basic nodes. Due to differences in node computation in ComfyUI, if you need more consistent results, you should set min_length to 512 in the code.

<div align="center"> <table> <tr> <td><img src="assets/comfyui2.png" alt="workflow/FLUX-Text-Basic-Workflow.json" width="400"/></td> </tr> <tr> <td align="center">workflow/FLUX-Text-Basic-Workflow.json</td> </tr> </table> </div>

2025-07-13: 🔥 The training code has been updated. The code now supports multi-scale training.
2025-07-13: 🔥 Update the low-VRAM version of the Gradio demo, which It currently requires 25GB of VRAM to run. Looking forward to more efficient, lower-memory solutions from the community.
2025-07-08: 🔥 ComfyUI Node is supported! You can now build an workflow based on FLUX-Text for editing posters. It is definitely worth trying to set up a workflow to automatically enhance product image service information and service scope. Meanwhile, utilizing the first and last frames enables the creation of video data with text effects. Thanks to the community work, FLUX-Text was run on 8GB VRAM.

<div align="center"> <table> <tr> <td><img src="assets/comfyui.png" alt="workflow/FLUX-Text-Workflow.json" width="400"/></td> </tr> <tr> <td align="center">workflow/FLUX-Text-Workflow.json</td> </tr> </table> </div> <div align="center"> <table> <tr> <td><img src="assets/ori_img1.png" alt="assets/ori_img1.png" width="200"/></td> <td><img src="assets/new_img1.png" alt="assets/new_img1.png" width="200"/></td> <td><img src="assets/ori_img2.png" alt="assets/ori_img2.png" width="200"/></td> <td><img src="assets/new_img2.png" alt="assets/new_img2.png" width="200"/></td> </tr> <tr> <td align="center">original image</td> <td align="center">edited image</td> <td align="center">original image</td> <td align="center">edited image</td> </tr> </table> </div> <div align="center"> <table> <tr> <td><img src="assets/video_end1.png" alt="assets/video_end1.png" width="400"/></td> <td><img src="assets/video1.gif" alt="assets/video1.gif" width="400"/></td> </tr> <tr> <td><img src="assets/video_end2.png" alt="assets/video_end2.png" width="400"/></td> <td><img src="assets/video2.gif" alt="assets/video2.gif" width="400"/></td> </tr> <tr> <td align="center">last frame</td> <td align="center">video</td> </tr> </table> </div>

2025-07-04: 🔥 We have released gradio demo! You can now try out FLUX-Text.

<div align="center"> <table> <tr> <td><img src="assets/gradio_1.png" alt="Example 1" width="400"/></td> <td><img src="assets/gradio_2.png" alt="Example 2" width="400"/></td> </tr> <tr> <td align="center">Example 1</td> <td align="center">Example 2</td> </tr> </table> </div>

2025-07-03: 🔥 We have released our pre-trained checkpoints on Hugging Face! You can now try out FLUX-Text with the official weights.
2025-06-26: ⭐️ Inference and evaluate code are released. Once we have ensured that everything is functioning correctly, the new model will be merged into this repository.

Todo List

- [x] Inference code
- [x] Pre-trained weights
- [x] Gradio demo
- [x] ComfyUI
- [x] Training code

🛠️ Installation

We recommend using Python 3.10 and PyTorch with CUDA support. To set up the environment:

# Create a new conda environment
conda create -n flux_text python=3.10
conda activate flux_text

# Install other dependencies
pip install -r requirements.txt
pip install flash_attn --no-build-isolation
pip install Pillow==9.5.0

🤗 Model Introduction

FLUX-Text is an open-source version of the scene text editing model. FLUX-Text can be used for editing posters, emotions, and more. The table below displays the list of text editing models we currently offer, along with their foundational information.

<table style="border-collapse: collapse; width: 100%;"> <tr> <th style="text-align: center;">Model Name</th> <th style="text-align: center;">Image Resolution</th> <th style="text-align: center;">Memory Usage</th> <th style="text-align: center;">English Sen.Acc</th> <th style="text-align: center;">Chinese Sen.Acc</th> <th style="text-align: center;">Download Link</th> </tr> <tr> <th style="text-align: center;">FLUX-Text-512</th> <th style="text-align: center;">512*512</th> <th style="text-align: center;">34G</th> <th style="text-align: center;">0.8419</th> <th style="text-align: center;">0.7132</th> <th style="text-align: center;"><a href="https://huggingface.co/GD-ML/FLUX-Text/tree/main/model_512">🤗 HuggingFace</a></th> </tr> <tr> <th style="text-align: center;">FLUX-Text</th> <th style="text-align: center;">Multi Resolution</th> <th style="text-align: center;">34G for (512*512)</th> <th style="text-align: center;">0.8228</th> <th style="text-align: center;">0.7161</th> <th style="text-align: center;"><a href="https://huggingface.co/GD-ML/FLUX-Text/tree/main/model_multisize">🤗 HuggingFace</a></th> </tr> </table>

🔥 ComfyUI

<details close> <summary> Installing via GitHub </summary>

First, install and set up ComfyUI, and then follow these steps:

Clone FLUXText Repository:

git clone https://github.com/AMAP-ML/FluxText.git

Install FluxText:

cd FluxText && pip install -r requirements.txt

Integrate FluxText Comfy Nodes with ComfyUI:

Symbolic Link (Recommended):

ln -s $(pwd)/ComfyUI-fluxtext path/to/ComfyUI/custom_nodes/

Copy Directory:

cp -r ComfyUI-fluxtext path/to/ComfyUI/custom_nodes/

</details>

🔥 Quick Start

Here's a basic example of using FLUX-Text:

import numpy as np
from PIL import Image
import torch
import yaml

from src.flux.condition import Condition
from src.flux.generate_fill import generate_fill
from src.train.model import OminiModelFIll
from safetensors.torch import load_file

config_path = ""
lora_path = ""
with open(config_path, "r") as f:
    config = yaml.safe_load(f)
model = OminiModelFIll(
        flux_pipe_id=config["flux_path"],
        lora_config=config["train"]["lora_config"],
        device=f"cuda",
        dtype=getattr(torch, config["dtype"]),
        optimizer_config=config["train"]["optimizer"],
        model_config=config.get("model", {}),
        gradient_checkpointing=True,
        byt5_encoder_config=None,
    )

state_dict = load_file(lora_path)
state_dict_new = {x.replace('lora_A', 'lora_A.default').replace('lora_B', 'lora_B.default').replace('transformer.', ''): v for x, v in state_dict.items()}
model.transformer.load_state_dict(state_dict_new, strict=False)
pipe = model.flux_pipe

prompt = "lepto college of education, the written materials on the picture: LESOTHO , COLLEGE OF , RE BONA LESELI LESEL , EDUCATION ."
hint = Image.open("assets/hint.png").resize((512, 512)).convert('RGB')
img = Image.open("assets/hint_imgs.jpg").resize((512, 512))
condition_img = Image.open("assets/hint_imgs_word.png").resize((512, 512)).convert('RGB')
hint = np.array(hint) / 255
condition_img = np.array(condition_img)
condition_img = (255 - condition_img) / 255
condition_img = [condition_img, hint, img]
position_del

Related Skills

node-connect

344.1k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

96.8k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

344.1k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

344.1k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。