HJDataset
A Large Dataset of Historical Japanese Documents with Complex Layouts
Install / Use
/learn @dell-research-harvard/HJDatasetREADME
HJDataset
A Large Dataset of Historical Japanese Documents with Complex Layouts
HJDataset is a Large Dataset of Historical Japanese Documents with Complex Layouts. It contains over 250,000 layout element annotations of seven types. In addition to bounding boxes and masks of the content regions, it also includes the hierarchical structures and reading orders for layout elements for advanced analysis.
Download the dataset
All the annotations are available through this link. However, due to some copyright issues, we could not directly release the images in this dataset. Please fill out this form to send us a request for downloading, and we will send back the links.
Organization of the files
After downloading, we suggest organize the annotation and images in this fashion:
data/
├── train/
├── test/
├── val/
└── annotations/
├── instances_train.json
└── ....
Environment configuration
You can also use the provided conda environment file to configure your own environment.
conda install -f environment.yml
However, when installing Detectron2, you may encounter some problems. Please check their official install instructions and Common Installation Issues for better reference.
Starter code
We provide some starter code in notebooks/.
1-Dataloader and visualization.ipynbillustrates how to use the dataloder class to load and visualize layout elements in HJDataset.2-Training Using Detectron2.ipynbshows how to train segmentation models on the dataset using Detectron2.
Cite our work
If you find the dataset is helpful for your research, please cite our work:
@article{shen2020large,
title={A Large Dataset of Historical Japanese Documents with Complex Layouts},
author={Shen, Zejiang and Zhang, Kaixuan and Dell, Melissa},
journal={arXiv preprint arXiv:2004.08686},
year={2020}
}
Related Skills
node-connect
347.9kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
claude-opus-4-5-migration
108.7kMigrate prompts and code from Claude Sonnet 4.0, Sonnet 4.5, or Opus 4.1 to Opus 4.5
frontend-design
108.7kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
model-usage
347.9kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
