Magi

Generate a transcript for your favourite Manga: Detect manga characters, text blocks and panels. Order panels. Cluster characters. Match texts to their speakers. Perform OCR.

Generate Convert Improve

Install / Use

/learn @ragavsachdeva/Magi

About this skill

Quality Score

0/100

README

Magi, The Manga Whisperer

TODO

[ ] Release PopCaptions dataset

Magiv1
Magiv2
Magiv3
Datasets

Magiv1

The model is available at 🤗 HuggingFace Model Hub.
Try it out for yourself using this 🤗 HuggingFace Spaces Demo (no GPU, so slow).
Basic model usage is provided below. Inspect this file for more info.

Magi_teaser

v1 Usage

from transformers import AutoModel
import numpy as np
from PIL import Image
import torch
import os

images = [
        "path_to_image1.jpg",
        "path_to_image2.png",
    ]

def read_image_as_np_array(image_path):
    with open(image_path, "rb") as file:
        image = Image.open(file).convert("L").convert("RGB")
        image = np.array(image)
    return image

images = [read_image_as_np_array(image) for image in images]

model = AutoModel.from_pretrained("ragavsachdeva/magi", trust_remote_code=True).cuda()
with torch.no_grad():
    results = model.predict_detections_and_associations(images)
    text_bboxes_for_all_images = [x["texts"] for x in results]
    ocr_results = model.predict_ocr(images, text_bboxes_for_all_images)

for i in range(len(images)):
    model.visualise_single_image_prediction(images[i], results[i], filename=f"image_{i}.png")
    model.generate_transcript_for_single_image(results[i], ocr_results[i], filename=f"transcript_{i}.txt")

Magiv2

The model is available at 🤗 HuggingFace Model Hub.
Try it out for yourself using this 🤗 HuggingFace Spaces Demo (with GPU, thanks HF Team!).
Basic model usage is provided below. Inspect this file for more info.

magiv2

v2 Usage

from PIL import Image
import numpy as np
from transformers import AutoModel
import torch

model = AutoModel.from_pretrained("ragavsachdeva/magiv2", trust_remote_code=True).cuda().eval()


def read_image(path_to_image):
    with open(path_to_image, "rb") as file:
        image = Image.open(file).convert("L").convert("RGB")
        image = np.array(image)
    return image

chapter_pages = ["page1.png", "page2.png", "page3.png" ...]
character_bank = {
    "images": ["char1.png", "char2.png", "char3.png", "char4.png" ...],
    "names": ["Luffy", "Sanji", "Zoro", "Ussop" ...]
}

chapter_pages = [read_image(x) for x in chapter_pages]
character_bank["images"] = [read_image(x) for x in character_bank["images"]]

with torch.no_grad():
    per_page_results = model.do_chapter_wide_prediction(chapter_pages, character_bank, use_tqdm=True, do_ocr=True)

transcript = []
for i, (image, page_result) in enumerate(zip(chapter_pages, per_page_results)):
    model.visualise_single_image_prediction(image, page_result, f"page_{i}.png")
    speaker_name = {
        text_idx: page_result["character_names"][char_idx] for text_idx, char_idx in page_result["text_character_associations"]
    }
    for j in range(len(page_result["ocr"])):
        if not page_result["is_essential_text"][j]:
            continue
        name = speaker_name.get(j, "unsure") 
        transcript.append(f"<{name}>: {page_result['ocr'][j]}")
with open(f"transcript.txt", "w") as fh:
    for line in transcript:
        fh.write(line + "\n")

Magiv3

The model is available at 🤗 HuggingFace Model Hub.
Basic model usage is provided below. Inspect this file for more info.

v3 Usage

model = AutoModelForCausalLM.from_pretrained("ragavsachdeva/magiv3", torch_dtype=torch.float16, trust_remote_code=True).cuda().eval()
processor = AutoProcessor.from_pretrained("ragavsachdeva/magiv3", trust_remote_code=True)

model.predict_detections_and_associations(images, processor)
model.predict_ocr(images, processor)
model.predict_character_grounding(images, captions, processor)

Datasets

Disclaimer: In adherence to copyright regulations, we are unable to publicly distribute the manga images that we've collected. The test images, however, are available freely, publicly and officially on Manga Plus by Shueisha.

Other notes

Request to download Manga109 dataset here.
Download a large scale dataset from Mangadex using this tool.
The Manga109 test splits are available here: detection, character clustering. Be careful that some background characters have the same label even though they are not the same character, see.

License and Citation

The provided models and datasets are available for academic research purposes only.

@InProceedings{magiv1,
    author    = {Sachdeva, Ragav and Zisserman, Andrew},
    title     = {The Manga Whisperer: Automatically Generating Transcriptions for Comics},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2024},
    pages     = {12967-12976}
}

@InProceedings{Sachdeva_2024_ACCV,
    author    = {Sachdeva, Ragav and Shin, Gyungin and Zisserman, Andrew},
    title     = {Tails Tell Tales: Chapter-wide Manga Transcriptions with Character Names},
    booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV)},
    month     = {December},
    year      = {2024},
    pages     = {2053-2069}
}

@misc{sachdeva2025panelsprosegeneratingliterary,
      title={From Panels to Prose: Generating Literary Narratives from Comics}, 
      author={Ragav Sachdeva and Andrew Zisserman},
      year={2025},
      eprint={2503.23344},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.23344}, 
}

Related Skills

node-connect

349.7k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

109.7k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

349.7k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

349.7k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。

ragavsachdeva

View profile

View on GitHub

GitHub Stars432

CategoryDevelopment

Updated5d ago

Forks22

ragavsachdeva/magi

Security Score

80/100

Audited on Apr 1, 2026

No findings

Magi

Install / Use

README

Magi, The Manga Whisperer

TODO

Table of Contents

Magiv1

v1 Usage

Magiv2

v2 Usage

Magiv3

v3 Usage

Datasets

Other notes

License and Citation

Related Skills