SPTSv2
The official implementation of SPTS v2: Single-Point Text Spotting
Install / Use
/learn @bytedance/SPTSv2README
SPTS v2: Single-Point Scene Text Spotting
The official implementation of SPTS v2: Single-Point Text Spotting. The SPTSv2 which achieves 19× faster inference speed tackles scene text spotting as an end-to-end sequence prediction task and requires only extremely low-cost single-point annotations. Below is the overall architecture of SPTSv2.

Environment
We recommend using Anaconda to manage environments. Run the following commands to install dependencies.
conda create -n sptsv2 python=3.7 -y
conda activate sptsv2
conda install pytorch==1.8.1 torchvision==0.9.1 torchaudio==0.8.1 -c pytorch
git clone git@github.com:bytedance/SPTSv2.git
cd SPTSv2
pip install -r requirements.txt
Dataset
-
CurvedSynText150k [paper]:
- Part1 (94,723) Download (15.8G) (Origin, Google, BaiduNetDisk password: 4k3x)
- Part2 (54,327) Download (9.7G) (Origin, Google, BaiduNetDisk password: a5f5)
-
- Download (0.4G) (Google, BaiduNetDisk password: 5nhw)
-
SCUT-CTW1500 [paper] [source].
- Download (0.8G) (Google, BaiduNetDisk password: 82vs)
-
MLT [paper].
- Download (6.8G) (Origin, Google, BaiduNetDisk password: zqrm)
-
- Download (0.2G) (Google, BaiduNetDisk password: 5ddh)
-
- Download (0.1G) (Google, BaiduNetDisk password: wjrh)
-
Inverse-Text (images): OneDrive, BaiduNetdisk(6a2n).
Please download and extract the above datasets into the data folder following the file structure below.
data
├─CTW1500
│ ├─annotations
│ │ test_ctw1500_maxlen25.json
│ │ train_ctw1500_maxlen25_v2.json
│ ├─ctwtest_text_image
│ └─ctwtrain_text_image
├─icdar2013
│ │ ic13_test.json
│ │ ic13_train.json
│ ├─test_images
│ └─train_images
├─icdar2015
│ │ ic15_test.json
│ │ ic15_train.json
│ ├─test_images
│ └─train_images
|- inversetext
| |- test_images
| └─ test_poly.json
├─mlt2017
│ │ train.json
│ └─MLT_train_images
├─syntext1
│ │ train.json
│ └─syntext_word_eng
├─syntext2
│ │ train.json
│ └─emcs_imgs
└─totaltext
│ test.json
│ train.json
├─test_images
└─train_images
Train and finetune
The model training in the original paper uses 16 GPUs (2 nodes, 8 A100 GPUs per node). Below are the instructions for the training using a single machine with 8 GPUs, which can be simply modified to multi-node training following PyTorch Distributed Docs.
You can download our pretrained weight from Google Drive or BaiduNetDisk, password: 3pcu, or pretrain the model from scratch using the run.sh file. If finetuning, just set --resume and --finetune in run.sh.
Inference and visualization
The trained models can be obtained after finishing the above steps. You can also download the models for the Total-Text, SCUT-CTW1500, ICDAR2013, ICDAR2015 and inversetext datasets from GoogleDrive or BaiduNetDisk password: 2k2m. Then you can use test.sh or predict.py to output results and visualization.

Evaluation
First, download the ground-truth files (GoogleDrive, BaiduNetDisk password: 35tr) and lexicons (GoogleDrive, BaiduNetDisk password: 9eml), and extracted them into the evaluation folder.
evaluation
│ eval.py
├─gt
│ ├─gt_ctw1500
│ ├─gt_ic13
│ ├─gt_ic15
│ └─gt_totaltext
└─lexicons
├─ctw1500
├─ic13
├─ic15
└─totaltext
We provide two evaluation scripts, including eval_ic15.py for evaluating icdar2015 dataset, and eval.py for other benchmarks. The command for evaluating the inference result of Total-Text is:
python evaluation/eval.py \
--result_path ./output/totaltext_val.json \
# --with_lexicon \ # uncomment this line if you want to evaluate with lexicons.
# --lexicon_type 0 # used for ICDAR2013 and ICDAR2015. 0: Generic; 1: Weak; 2: Strong.
Performance
The end-to-end recognition performances of SPTSv2 on five public benchmarks are:
| Dataset | Strong | Weak | Generic | | ------- | ------ | ---- | ------- | | ICDAR 2013 | 93.9 | 91.8 | 88.6 | | ICDAR 2015 | 82.3 | 77.7 | 72.6 |
| Dataset | None | Full | | ------- | ---- | ---- | | Total-Text | 75.5 | 84.0 | | inversetext | 63.5 | 74.9 | | SCUT-CTW1500 | 63.6 | 84.3 |
Citation
@inproceedings{peng2022spts,
title={SPTS: Single-Point Text Spotting},
author={Peng, Dezhi and Wang, Xinyu and Liu, Yuliang and Zhang, Jiaxin and Huang, Mingxin and Lai, Songxuan and Zhu, Shenggao and Li, Jing and Lin, Dahua and Shen, Chunhua and Bai, Xiang and Jin, Lianwen},
booktitle={Proceedings of the 30th ACM International Conference on Multimedia},
year={2022}
}
@article{liu2023spts,
title={SPTS v2: Single-Point Scene Text Spotting},
author={Liu, Yuliang and Zhang, Jiaxin and Peng, Dezhi and Huang, Mingxin and Wang, Xinyu and Tang, Jingqun and Huang, Can and Lin, Dahua and Shen, Chunhua and Bai, Xiang and Jin, Lianwen},
journal={arXiv preprint arXiv:2301.01635},
year={2023}
}
Copyright
This repository can only be used for non-commercial research purpose.
For commercial use, please contact Jiaxin Zhang (zhangjiaxin.zjx1995@bytedance.com).
Acknowledgement
We sincerely thank Stable-Pix2Seq, Pix2Seq, DETR, Swin-Transformer, SPTS and ABCNet for their excellent works.
Related Skills
YC-Killer
2.7kA library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.
best-practices-researcher
The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app
research_rules
Research & Verification Rules Quote Verification Protocol Primary Task "Make sure that the quote is relevant to the chapter and so you we want to make sure that we want to have it identifie
groundhog
398Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).
