Wikt2pron

A Python toolkit converting pronunciation in enwiktionary xml dump to cmudict format

Generate Convert Improve

Install / Use

/learn @abuccts/Wikt2pron

About this skill

Quality Score

0/100

README

wikt2pron

A Wiktionary Pronunciation Collector

Wikt2pron is a Python toolkit converting pronunciation in enwiktionary xml dump to cmudict format. It supports IPA and X-SAMPA format at present. This project is developed in GSoC 2017 with CMU Sphinx community.

Collected pronunciation dictionaries and related example models can be downloaded at Dropbox.

Requirements

wikt2pron requires:

Installation

# download the latest version
$ git clone https://github.com/abuccts/wikt2pron.git
$ cd wikt2pron

# install and run test
$ python setup.py install
$ python setup.py -q test

# make documents
$ make -C docs html

Usage

Extract pronunciation from Wiktionary XML dump

First, create an instance of Wiktionary class:

>>> from pywiktionary import Wiktionary
>>> wikt = Wiktionary(XSAMPA=True)

Use the example XML dump in [pywiktionary/data]:

>>> dump_file = "pywiktionary/data/enwiktionary-test-pages-articles-multistream.xml"
>>> pron = wikt.extract_IPA(dump_file)

Here's the extracted result:

>>> from pprint import pprint
>>> pprint(pron)
[{'id': 16,
  'pronunciation': {'English': [{'IPA': '/ˈdɪkʃ(ə)n(ə)ɹɪ/',
                                 'X-SAMPA': '/"dIkS(@)n(@)r\\I/',
                                 'lang': 'en'},
                                {'IPA': '/ˈdɪkʃənɛɹi/',
                                 'X-SAMPA': '/"dIkS@nEr\\i/',
                                 'lang': 'en'}]},
  'title': 'dictionary'},
 {'id': 65195,
  'pronunciation': {'English': 'IPA not found.'},
  'title': 'battleship'},
 {'id': 39478,
  'pronunciation': {'English': [{'IPA': '/ˈmɜːdə(ɹ)/',
                                 'X-SAMPA': '/"m3:d@(r\\)/',
                                 'lang': 'en'},
                                {'IPA': '/ˈmɝ.dɚ/',
                                 'X-SAMPA': '/"m3`.d@`/',
                                 'lang': 'en'}]},
  'title': 'murder'},
 {'id': 80141,
  'pronunciation': {'English': [{'IPA': '/ˈdæzəl/',
                                 'X-SAMPA': '/"d{z@l/',
                                 'lang': 'en'}]},
  'title': 'dazzle'}]

Lookup pronunciation for a word

First, create an instance of Wiktionary class:

>>> from pywiktionary import Wiktionary
>>> wikt = Wiktionary(XSAMPA=True)

Lookup a word using lookup method:

>>> word = wikt.lookup("present")

The entry of word "present" is at https://en.wiktionary.org/wiki/present, and here is the lookup result:

>>> from pprint import pprint
>>> pprint(word)
{'Catalan': 'IPA not found.',
 'Danish': [{'IPA': '/prɛsanɡ/', 'X-SAMPA': '/prEsang/', 'lang': 'da'},
            {'IPA': '[pʰʁ̥ɛˈsɑŋ]', 'X-SAMPA': '[p_hR_0E"sAN]', 'lang': 'da'}
],
 'English': [{'IPA': '/ˈpɹɛzənt/', 'X-SAMPA': '/"pr\\Ez@nt/', 'lang': 'en'},
             {'IPA': '/pɹɪˈzɛnt/', 'X-SAMPA': '/pr\\I"zEnt/', 'lang': 'en'},
             {'IPA': '/pɹəˈzɛnt/', 'X-SAMPA': '/pr\\@"zEnt/', 'lang': 'en'}],
 'Ladin': 'IPA not found.',
 'Middle French': 'IPA not found.',
 'Old French': 'IPA not found.',
 'Swedish': [{'IPA': '/preˈsent/', 'X-SAMPA': '/pre"sent/', 'lang': 'sv'}]}

To lookup a word in a certain language, specify the lang parameter:

>>> wikt = Wiktionary(lang="English", XSAMPA=True)
>>> word = wikt.lookup("read")
>>> pprint(word)
[{'IPA': '/ɹiːd/', 'X-SAMPA': '/r\\i:d/', 'lang': 'en'},
 {'IPA': '/ɹɛd/', 'X-SAMPA': '/r\\Ed/', 'lang': 'en'}]

IPA -> X-SAMPA conversion

>>> from pywiktionary import IPA
>>> IPA_text = "/t͡ʃeɪnd͡ʒ/" # en: [[change]]
>>> XSAMPA_text = IPA.IPA_to_XSAMPA(IPA_text)
>>> XSAMPA_text
"/t__SeInd__Z/"

Citation

If you use wikt2pron in your research and want to cite it, please use the following BibTeX:

@misc{xiong2017wikt2pron,
  title={Wikt2pron: A Wiktionary Pronunciation Collector},
  author={Xiong, Yifan},
  howpublished={\url{https://github.com/abuccts/wikt2pron}},
  year={2017}
}

Related Skills

node-connect

345.9k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

106.4k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

345.9k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

345.9k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。