Joycaption

JoyCaption is an image captioning Visual Language Model (VLM) being built from the ground up as a free, open, and uncensored model for the community to use in training Diffusion models.

Generate Convert Improve

Install / Use

/learn @fpgaminer/Joycaption

About this skill

Quality Score

0/100

README

JoyCaption

JoyCaption is an open, free, and uncensored captioning Visual Language Model (VLM).

Try the Demo on HuggingFace | Download the Current Model on Hugging Face | Latest Release Post | Data

What is JoyCaption?

JoyCaption is an image captioning Visual Language Model (VLM) being built from the ground up as a free, open, and uncensored model for the community to use in training Diffusion models.

Key Features:

Free and Open: Released for free, open weights, no restrictions, and just like bigASP, will come with training scripts and lots of juicy details on how it gets built.
Uncensored: Equal coverage of SFW and NSFW concepts. No "cylindrical shaped object with a white substance coming out on it" here.
Diversity: All are welcome here. Do you like digital art? Photoreal? Anime? Furry? JoyCaption is for everyone. Pains are being taken to ensure broad coverage of image styles, content, ethnicity, gender, orientation, etc.
Minimal Filtering: JoyCaption is trained on large swathes of images so that it can understand almost all aspects of our world. almost. Illegal content will never be tolerated in JoyCaption's training.

Motivation

Automated descriptive captions enable the training and finetuning of diffusion models on a wider range of images, since trainers are no longer required to either find images with already associated text or write the descriptions themselves. They also improve the quality of generations produced by Text-to-Image models trained on them (ref: DALL-E 3 paper). But to-date, the community has been stuck with ChatGPT, which is expensive and heavily censored; or alternative models, like CogVLM, which are weaker than ChatGPT and have abysmal performance outside of the SFW domain.

I'm building JoyCaption to help fill this gap by performing near or on-par with GPT4o in captioning images, while being free, unrestricted, and open.

Using JoyCaption

Demo

To see JoyCaption in action, check out the demo on HuggingFace Spaces.

Installation

To use JoyCaption locally, you can download the model from Hugging Face and integrate it into your existing workflows.

VRAM Requirements

At its native datatype of bfloat16, JoyCaption needs about 17GB of VRAM for the model, so it runs comfortably on 24GB and up GPUs. If you need a lighterweight version, it can be quantized to 8-bit or 4-bit (https://github.com/fpgaminer/joycaption/issues/3#issuecomment-2870217672). The ComfyUI node (https://github.com/fpgaminer/joycaption_comfyui/) supports this natively.

Example Usage

import torch
from PIL import Image
from transformers import AutoProcessor, LlavaForConditionalGeneration


IMAGE_PATH = "image.jpg"
PROMPT = "Write a long descriptive caption for this image in a formal tone."
MODEL_NAME = "fancyfeast/llama-joycaption-beta-one-hf-llava"


# Load JoyCaption
# bfloat16 is the native dtype of the LLM used in JoyCaption (Llama 3.1)
# device_map=0 loads the model into the first GPU
processor = AutoProcessor.from_pretrained(MODEL_NAME)
llava_model = LlavaForConditionalGeneration.from_pretrained(MODEL_NAME, torch_dtype="bfloat16", device_map=0)
llava_model.eval()

with torch.no_grad():
	# Load image
	image = Image.open(IMAGE_PATH)

	# Build the conversation
	convo = [
		{
			"role": "system",
			"content": "You are a helpful image captioner.",
		},
		{
			"role": "user",
			"content": PROMPT,
		},
	]

	# Format the conversation
	# WARNING: HF's handling of chat's on Llava models is very fragile.  This specific combination of processor.apply_chat_template(), and processor() works
	# but if using other combinations always inspect the final input_ids to ensure they are correct.  Often times you will end up with multiple <bos> tokens
	# if not careful, which can make the model perform poorly.
	convo_string = processor.apply_chat_template(convo, tokenize = False, add_generation_prompt = True)
	assert isinstance(convo_string, str)

	# Process the inputs
	inputs = processor(text=[convo_string], images=[image], return_tensors="pt").to('cuda')
	inputs['pixel_values'] = inputs['pixel_values'].to(torch.bfloat16)

	# Generate the captions
	generate_ids = llava_model.generate(
		**inputs,
		max_new_tokens=512,
		do_sample=True,
		suppress_tokens=None,
		use_cache=True,
		temperature=0.6,
		top_k=None,
		top_p=0.9,
	)[0]

	# Trim off the prompt
	generate_ids = generate_ids[inputs['input_ids'].shape[1]:]

	# Decode the caption
	caption = processor.tokenizer.decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
	caption = caption.strip()
	print(caption)

How to Prompt JoyCaption

JoyCaption Beta One offers multiple modes of caption generation to suit different needs. Descriptive Caption and Straightforward are the most useful, with the other modes being interesting but a little less stable. The HuggingFace demo has a nice interface for selecting the output mode and extra options, and it outputs the prompt it used. Otherwise, here are all the prompts that JoyCaption Beta One understands:

Descriptive Caption: Writes descriptive captions for the image, either in a formal or casual tone.
- Examples:
  - "Write a long detailed description for this image."
  - "Write a detailed description for this image in {word_count} words or less."
  - "Write a {length} detailed description for this image."
  - "Write a descriptive caption for this image in a casual tone."
  - "Write a descriptive caption for this image in a casual tone within {word_count} words."
  - "Write a {length} descriptive caption for this image in a casual tone."
- Note: Casual tone can be a bit weird and needs more work, often falling back on a style that is akin to a robot pretending to be "hip" with the youth.
Straightforward Caption: A more concise, objective style than Descriptive.
- Examples:
  - "Write a straightforward caption for this image. Begin with the main subject and medium. Mention pivotal elements—people, objects, scenery—using confident, definite language. Focus on concrete details like color, shape, texture, and spatial relationships. Show how elements interact. Omit mood and speculative wording. If text is present, quote it exactly. Note any watermarks, signatures, or compression artifacts. Never mention what's absent, resolution, or unobservable details. Vary your sentence structure and keep the description concise, without starting with “This image is…” or similar phrasing."
  - "Write a straightforward caption for this image within {word_count} words. Begin with the main subject and medium. Mention pivotal elements—people, objects, scenery—using confident, definite language. Focus on concrete details like color, shape, texture, and spatial relationships. Show how elements interact. Omit mood and speculative wording. If text is present, quote it exactly. Note any watermarks, signatures, or compression artifacts. Never mention what's absent, resolution, or unobservable details. Vary your sentence structure and keep the description concise, without starting with “This image is…” or similar phrasing."
  - "Write a {length} straightforward caption for this image. Begin with the main subject and medium. Mention pivotal elements—people, objects, scenery—using confident, definite language. Focus on concrete details like color, shape, texture, and spatial relationships. Show how elements interact. Omit mood and speculative wording. If text is present, quote it exactly. Note any watermarks, signatures, or compression artifacts. Never mention what's absent, resolution, or unobservable details. Vary your sentence structure and keep the description concise, without starting with “This image is…” or similar phrasing."
Stable Diffusion Prompt: Tries to mimic how users typically write Stable Diffusion prompts, with a mixture of natural language and booru-like tags.
- Examples:
  - "Output a stable diffusion prompt that is indistinguishable from a real stable diffusion prompt."
  - "Output a stable diffusion prompt that is indistinguishable from a real stable diffusion prompt. {word_count} words or less."
  - "Output a {length} stable diffusion prompt that is indistinguishable from a real stable diffusion prompt."
- Note: This mode is more stable than it was in Alpha Two, but can still glitch out ~3% of the time.
MidJourney: Similar to Training Prompt mode but more like MidJourney prompts.
- Examples:
  - "Write a MidJourney prompt for this image."
  - "Write a MidJourney prompt for this image within {word_count} words."
  - "Write a {length} MidJourney prompt for this image."
- Note: This mode is still a work in progress and somewhat unstable, occasionally glitching out into a repetition loop (due to limitations of stock Llama 3.1). Use with caution.
Danbooru tag list: Writes a list of Danbooru tags for the image.
- Examples:
  - "Generate only comma-separated Danbooru tags (lowercase_underscores). Strict

Related Skills

node-connect

340.2k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

84.1k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

340.2k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

commit-push-pr

84.1k

Commit, push, and open a PR