Wikt2pron
A Python toolkit converting pronunciation in enwiktionary xml dump to cmudict format
Install / Use
/learn @abuccts/Wikt2pronREADME
wikt2pron
A Wiktionary Pronunciation Collector
Wikt2pron is a Python toolkit converting pronunciation in enwiktionary xml dump to cmudict format. It supports IPA and X-SAMPA format at present. This project is developed in GSoC 2017 with CMU Sphinx community.
Collected pronunciation dictionaries and related example models can be downloaded at Dropbox.
Requirements
wikt2pron requires:
- Python 3
- regex
- python-mwxml
- beautifulsoup4
Installation
# download the latest version
$ git clone https://github.com/abuccts/wikt2pron.git
$ cd wikt2pron
# install and run test
$ python setup.py install
$ python setup.py -q test
# make documents
$ make -C docs html
Usage
Extract pronunciation from Wiktionary XML dump
First, create an instance of Wiktionary class:
>>> from pywiktionary import Wiktionary
>>> wikt = Wiktionary(XSAMPA=True)
Use the example XML dump in [pywiktionary/data]:
>>> dump_file = "pywiktionary/data/enwiktionary-test-pages-articles-multistream.xml"
>>> pron = wikt.extract_IPA(dump_file)
Here's the extracted result:
>>> from pprint import pprint
>>> pprint(pron)
[{'id': 16,
'pronunciation': {'English': [{'IPA': '/ˈdɪkʃ(ə)n(ə)ɹɪ/',
'X-SAMPA': '/"dIkS(@)n(@)r\\I/',
'lang': 'en'},
{'IPA': '/ˈdɪkʃənɛɹi/',
'X-SAMPA': '/"dIkS@nEr\\i/',
'lang': 'en'}]},
'title': 'dictionary'},
{'id': 65195,
'pronunciation': {'English': 'IPA not found.'},
'title': 'battleship'},
{'id': 39478,
'pronunciation': {'English': [{'IPA': '/ˈmɜːdə(ɹ)/',
'X-SAMPA': '/"m3:d@(r\\)/',
'lang': 'en'},
{'IPA': '/ˈmɝ.dɚ/',
'X-SAMPA': '/"m3`.d@`/',
'lang': 'en'}]},
'title': 'murder'},
{'id': 80141,
'pronunciation': {'English': [{'IPA': '/ˈdæzəl/',
'X-SAMPA': '/"d{z@l/',
'lang': 'en'}]},
'title': 'dazzle'}]
Lookup pronunciation for a word
First, create an instance of Wiktionary class:
>>> from pywiktionary import Wiktionary
>>> wikt = Wiktionary(XSAMPA=True)
Lookup a word using lookup method:
>>> word = wikt.lookup("present")
The entry of word "present" is at https://en.wiktionary.org/wiki/present, and here is the lookup result:
>>> from pprint import pprint
>>> pprint(word)
{'Catalan': 'IPA not found.',
'Danish': [{'IPA': '/prɛsanɡ/', 'X-SAMPA': '/prEsang/', 'lang': 'da'},
{'IPA': '[pʰʁ̥ɛˈsɑŋ]', 'X-SAMPA': '[p_hR_0E"sAN]', 'lang': 'da'}
],
'English': [{'IPA': '/ˈpɹɛzənt/', 'X-SAMPA': '/"pr\\Ez@nt/', 'lang': 'en'},
{'IPA': '/pɹɪˈzɛnt/', 'X-SAMPA': '/pr\\I"zEnt/', 'lang': 'en'},
{'IPA': '/pɹəˈzɛnt/', 'X-SAMPA': '/pr\\@"zEnt/', 'lang': 'en'}],
'Ladin': 'IPA not found.',
'Middle French': 'IPA not found.',
'Old French': 'IPA not found.',
'Swedish': [{'IPA': '/preˈsent/', 'X-SAMPA': '/pre"sent/', 'lang': 'sv'}]}
To lookup a word in a certain language, specify the lang parameter:
>>> wikt = Wiktionary(lang="English", XSAMPA=True)
>>> word = wikt.lookup("read")
>>> pprint(word)
[{'IPA': '/ɹiːd/', 'X-SAMPA': '/r\\i:d/', 'lang': 'en'},
{'IPA': '/ɹɛd/', 'X-SAMPA': '/r\\Ed/', 'lang': 'en'}]
IPA -> X-SAMPA conversion
>>> from pywiktionary import IPA
>>> IPA_text = "/t͡ʃeɪnd͡ʒ/" # en: [[change]]
>>> XSAMPA_text = IPA.IPA_to_XSAMPA(IPA_text)
>>> XSAMPA_text
"/t__SeInd__Z/"
Citation
If you use wikt2pron in your research and want to cite it, please use the following BibTeX:
@misc{xiong2017wikt2pron,
title={Wikt2pron: A Wiktionary Pronunciation Collector},
author={Xiong, Yifan},
howpublished={\url{https://github.com/abuccts/wikt2pron}},
year={2017}
}
Related Skills
node-connect
345.9kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
106.4kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
345.9kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
345.9kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
