JoyType
JoyType: A Robust Design for Multilingual Visual Text Creation
Install / Use
/learn @jdh-algo/JoyTypeREADME
JoyType: A Robust Design for Multilingual Visual Text Creation
<a href='https://jdh-algo.github.io/JoyType'><img src='https://img.shields.io/badge/Page-JoyType-orange'></a> <a href='https://github.com/jdh-algo/JoyType.git'><img src='https://img.shields.io/badge/Code-Github-blue'></a> <a href='https://huggingface.co/spaces/jdh-algo/JoyType'><img src='https://img.shields.io/badge/Demo-HuggingFace-yellow'></a>

News
[2024.07.01] - Inference code is now available.
[2024.07.01] - Hugging Face Online demo is available here!
[2024.06.30] - Our Online demo is available here!
TODOS
- [x] Release online demo
- [ ] Release our latest checkpoint
- [ ] Release model and trainning code
- [ ] Support JoyType in ComfyUI
- [ ] Release our research paper
Methodology
The Figure introduces the whole framework of our method, including data collection, training pipeline, and inference pipeline. In the data collection phase, we leveraged the open-source CapOnImage2M dataset, selecting a subset of 1M images. For each selected image, we employed a Visual Language Model (e.g., CogVLM) to generate textual descriptions, thereby obtaining prompts associated with the images. We applied the canny algorithm to extract edges from text regions within the images, creating a canny map. The training pipeline comprises three primary components: the latent diffusion module, the Font ControlNet module, and the loss design module. More precisely, during training, the raw image, canny map, and prompt are fed into the Variational Autoencoder (VAE), Font ControlNet, and text encoder, respectively. The loss function is bifurcated into two segments: the latent space and the pixel space. Within the latent space, we utilize the loss function $L_{LDM}$ associated with Latent Diffusion Models as outlined in the source paper. The latent features are then decoded back into images via the VAE decoder. Within the pixel space, the text regions of both the predicted and the ground truth images are cropped and processed through an OCR model independently. We extract the convolutional layer features from the OCR model and compute the Mean Squared Error (MSE) loss between the features of each layer, thereby constituting the loss $L_{ocr}$. During the inference phase, the image prompt, textual content, and specified areas for text generation are input into the text encoder and Font ControlNet, respectively. The final image is then generated by the VAE decoder.


Installation
# Initial a conda enviroment
conda create -n joytype python=3.9
conda activate joytype
# Clone joytype repo
git clone ...
cd JoyType
# Install requirements
pip install -r requirements.txt
Inference
[Recommend]: We already released a demo on JDHealth and HuggingFace!
you can run with this code to infer:
python infer.py --prompt "a card" --input_yaml examples/test.yaml --img_name test
- prompt corresponds to the text description of the image you want to generate
- input_yaml corresponds to the information of texts' layout in the generated image
- img_name corresponds to the file name of the generated image
You can see more arguments by:
python infer.py --help
Please note that the model will be pulled from Hugging Face by default, if you want to load it locally, please pre-download the model from here and modify the argument: --load_path.
Gallery

Related Skills
openpencil
2.2kThe world's first open-source AI-native vector design tool and the first to feature concurrent Agent Teams. Design-as-Code. Turn prompts into UI directly on the live canvas. A modern alternative to Pencil.
ui-ux-pro-max-skill
62.6kAn AI SKILL that provide design intelligence for building professional UI/UX multiple platforms
ui-ux-pro-max-skill
62.6kAn AI SKILL that provide design intelligence for building professional UI/UX multiple platforms
onlook
25.1kThe Cursor for Designers • An Open-Source AI-First Design tool • Visually build, style, and edit your React App with AI
