SkillAgentSearch skills...

Parseq

Scene Text Recognition with Permuted Autoregressive Sequence Models (ECCV 2022)

Install / Use

/learn @baudm/Parseq

README

News

<div align="center">

Scene Text Recognition with<br/>Permuted Autoregressive Sequence Models

Apache License 2.0 arXiv preprint In Proc. ECCV 2022 Gradio demo

PWC PWC PWC PWC PWC PWC PWC PWC

Darwin Bautista and Rowel Atienza

Electrical and Electronics Engineering Institute<br/> University of the Philippines, Diliman

Method | Sample Results | Getting Started | FAQ | Training | Evaluation | Citation

</div>

Scene Text Recognition (STR) models use language context to be more robust against noisy or corrupted images. Recent approaches like ABINet use a standalone or external Language Model (LM) for prediction refinement. In this work, we show that the external LM—which requires upfront allocation of dedicated compute capacity—is inefficient for STR due to its poor performance vs cost characteristics. We propose a more efficient approach using permuted autoregressive sequence (PARSeq) models. View our ECCV poster and presentation for a brief overview.

PARSeq

NOTE: P-S and P-Ti are shorthands for PARSeq-S and PARSeq-Ti, respectively.

Method tl;dr

Our main insight is that with an ensemble of autoregressive (AR) models, we could unify the current STR decoding methods (context-aware AR and context-free non-AR) and the bidirectional (cloze) refinement model:

<div align="center"><img src=".github/contexts-example.png" alt="Unified STR model" width="75%"/></div>

A single Transformer can realize different models by merely varying its attention mask. With the correct decoder parameterization, it can be trained with Permutation Language Modeling to enable inference for arbitrary output positions given arbitrary subsets of the input context. This arbitrary decoding characteristic results in a unified STR model—PARSeq—capable of context-free and context-aware inference, as well as iterative prediction refinement using bidirectional context without requiring a standalone language model. PARSeq can be considered an ensemble of AR models with shared architecture and weights:

System NOTE: LayerNorm and Dropout layers are omitted. [B], [E], and [P] stand for beginning-of-sequence (BOS), end-of-sequence (EOS), and padding tokens, respectively. T = 25 results in 26 distinct position tokens. The position tokens both serve as query vectors and position embeddings for the input context. For [B], no position embedding is added. Attention masks are generated from the given permutations and are used only for the context-position attention. L<sub>ce</sub> pertains to the cross-entropy loss.

Sample Results

<div align="center">

| Input Image | PARSeq-S<sub>A</sub> | ABINet | TRBA | ViTSTR-S | CRNN | |:--------------------------------------------------------------------------:|:--------------------:|:-----------------:|:-----------------:|:-----------------:|:-----------------:| | <img src="demo_images/art-01107.jpg" alt="CHEWBACCA" width="128"/> | CHEWBACCA | CHEWBAGGA | CHEWBACCA | CHEWBACCA | CHEWUACCA | | <img src="demo_images/coco-1166773.jpg" alt="Chevron" width="128"/> | Chevrol | Chevro_ | Chevro_ | Chevr__ | Chevr__ | | <img src="demo_images/cute-184.jpg" alt="SALMON" height="128"/> | SALMON | SALMON | SALMON | SALMON | SA_MON | | <img src="demo_images/ic13_word_256.png" alt="Verbandstoffe" width="128"/> | Verbandsteffe | Verbandsteffe | Verbandstelle | Verbandsteffe | Verbandsleffe | | <img src="demo_images/ic15_word_26.png" alt="Kappa" width="128"/> | Kappa | Kappa | Kaspa | Kappa | Kaada | | <img src="demo_images/uber-27491.jpg" alt="3rdAve" height="128"/> | 3rdAve | 3=-Ave | 3rdAve | 3rdAve | Coke |

NOTE: Bold letters and underscores indicate wrong and missing character predictions, respectively.

</div>

Getting Started

This repository contains the reference implementation for PARSeq and reproduced models (collectively referred to as Scene Text Recognition Model Hub). See NOTICE for copyright information. Majority of the code is licensed under the Apache License v2.0 (see LICENSE) while ABINet and CRNN sources are released under the BSD and MIT licenses, respectively (see corresponding LICENSE files for details).

Demo

An interactive Gradio demo hosted at Hugging Face is available. The pretrained weights released here are used for the demo.

Installation

Requires Python >= 3.9 and PyTorch >= 2.0. The default requirements files will install the latest versions of the dependencies (as of February 22, 2024).

# Use specific platform build. Other PyTorch 2.0 options: cu118, cu121, rocm5.7
platform=cpu
# Generate requirements files for specified PyTorch platform
make torch-${platform}
# Install the project and core + train + test dependencies. Subsets: [dev,train,test,bench,tune]
pip install -r requirements/core.${platform}.txt -e .[train,test]

Updating dependency version pins

pip install pip-tools
make clean-reqs reqs  # Regenerate all the requirements files

Datasets

Download the datasets from the following links:

  1. LMDB archives for MJSynth, SynthText, IIIT5k, SVT, SVTP, IC13, IC15, CUTE80, ArT, RCTW17, ReCTS, LSVT, MLT19, COCO-Text, and Uber-Text.
  2. LMDB archives for TextOCR and OpenVINO.

Pretrained Models via Torch Hub

Available models are: abinet, crnn, trba, vitstr, parseq_tiny, parseq_patch16_224, and parseq.

import torch
from PIL import Image
from strhub.data.module impor
View on GitHub
GitHub Stars698
CategoryDevelopment
Updated4d ago
Forks154

Languages

Python

Security Score

100/100

Audited on Mar 23, 2026

No findings