WikipediaAbbreviationData
This data set consists of 24,000 English sentences, extracted from Wikipedia in 2017, annotated to support development of an abbreviation expansion system for text-to-speech synthesis (e.g., a systm tht cn prnounc txt lk ths).
Install / Use
/learn @google-research-datasets/WikipediaAbbreviationDataREADME
Abbreviation data
This repository provides labeled data for training abbreviation expansion models, as described in:
Gorman, K., Kirov, C., Roark, B., and Sproat, R. 2021. Structured abbreviation in context. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 995-1005.
If you use this data in a publication, we would appreciate it if you cite this paper.
Annotation
Sentences were extracted from English Wikipedia articles, then filtered as described in the paper. Annotators were then asked to introduce abbreviations to the sentences.
Organization
The data, with the original 80%/10%/10% split, can be found in the
data directory. The data are text-format Protocol
Buffers using the protocol
described in abbreviation.proto. To load this data
into Python, install the Protocol Buffers compiler protoc, then:
pip install -r requirements.txt
make
Then, see textproto.py.
Authors
This data was collected by Kyle Gorman with help from the annotators and Brian Roark, Richard Sproat, Olivia Redfield, Caterina Golner, and Katherine Wang.
License
See LICENSE.
Contributing
See CONTRIBUTING.
Mandatory disclaimer
This is not an official Google product.
Languages
Security Score
Audited on Apr 2, 2025
