Taibun
Taiwanese Hokkien Transliterator and Tokeniser
Install / Use
/learn @andreihar/TaibunREADME
Taibun
<!-- PROJECT SHIELDS -->[![Contributions][contributions-badge]][contributions] [![Live Demo][demo-badge]][demo] [![Tests][tests-badge]][tests] [![Release][release-badge]][release] [![Licence][licence-badge]][licence] [![LinkedIn][linkedin-badge]][linkedin] [![Downloads][downloads-badge]][pypi]
Taiwanese Hokkien Transliterator and Tokeniser
It has methods that allow to customise transliteration and retrieve any necessary information about Taiwanese Hokkien pronunciation.<br /> Includes word tokeniser for Taiwanese Hokkien.
[Report Bug][bug] • [PyPI][pypi]
</div><!-- TABLE OF CONTENTS --> <details open> <summary>Table of Contents</summary> <ol> <li><a href="#versions">Versions</a></li> <li><a href="#install">Install</a></li> <li> <a href="#usage">Usage</a> <ul> <li> <a href="#converter">Converter</a> <ul> <li><a href="#system">System</a></li> <li><a href="#dialect">Dialect</a></li> <li><a href="#format">Format</a></li> <li><a href="#delimiter">Delimiter</a></li> <li><a href="#sandhi">Sandhi</a></li> <li><a href="#punctuation">Punctuation</a></li> <li><a href="#convert-non-cjk">Convert non-CJK</a></li> </ul> </li> <li> <a href="#tokeniser">Tokeniser</a> <ul> <li><a href="#keep-original">Keep original</a></li> </ul> </li> <li><a href="#other-functions">Other Functions</a></li> </ul> </li> <li><a href="#example">Example</a></li> <li><a href="#data">Data</a></li> <li><a href="#acknowledgements">Acknowledgements</a></li> <li><a href="#licence">Licence</a></li> </ol> </details> <!-- OTHER VERSIONS -->
Versions
[![JavaScript Version][js-badge]][js-link]
<!-- INSTALL -->Install
Taibun can be installed from [pypi][pypi]
$ pip install taibun
<!-- USAGE -->
Usage
Converter
Converter class transliterates the Chinese characters to the chosen transliteration system with parameters specified by the developer. Works for both Traditional and Simplified characters.
# Constructor
c = Converter(system, dialect, format, delimiter, sandhi, punctuation, convert_non_cjk)
# Transliterate Chinese characters
c.get(input)
System
system String - system of transliteration.
Tailo(default) - [Tâi-uân Lô-má-jī Phing-im Hong-àn][tailo-wiki]POJ- [Pe̍h-ōe-jī][poj-wiki]Zhuyin- [Taiwanese Phonetic Symbols][zhuyin-wiki]TLPA- [Taiwanese Language Phonetic Alphabet][tlpa-wiki]Pingyim- [Bbánlám Uē Pìngyīm Hōng'àn][pingyim-wiki]Tongiong- [Daī-ghî Tōng-iōng Pīng-im][tongiong-wiki]IPA- [International Phonetic Alphabet][ipa-wiki]
| text | Tailo | POJ | Zhuyin | TLPA | Pingyim | Tongiong | IPA | | ---- | ------- | ------- | ----------- | --------- | ------- | -------- | ----------- | | 台灣 | Tâi-uân | Tâi-oân | ㄉㄞˊ ㄨㄢˊ | Tai5 uan5 | Dáiwán | Tāi-uǎn | Tai²⁵ uan²⁵ |
Dialect
dialect String - preferred pronunciation.
south(default) - [Zhangzhou][zhangzhou-wiki]-leaning pronunciationnorth- [Quanzhou][quanzhou-wiki]-leaning pronunciationsingapore- Quanzhou-leaning pronunciation with [Singaporean characteristics][singapore-wiki]
| text | south | north | singapore | | -------------- | --------------------------- | --------------------------- | -------------------------- | | 五月節我啉咖啡 | Gōo-gue̍h-tseh guá lim ka-pi | Gōo-ge̍h-tsueh guá lim ka-pi | Gōo-ge̍h-tsueh uá lim ko-pi |
Format
format String - format in which tones will be represented in the converted sentence.
mark(default) - uses diacritics for each syllable. Not available for TLPAnumber- add a number which represents the tone at the end of the syllablestrip- removes any tone marking
| text | mark | number | strip | | ---- | ------- | --------- | ------- | | 台灣 | Tâi-uân | Tai5-uan5 | Tai-uan |
Delimiter
delimiter String - sets the delimiter character that will be placed in between syllables of a word.
Default value depends on the chosen system:
'-'- forTailo,POJ,Tongiong''- forPingyim' '- forZhuyin,TLPA,IPA
| text | '-' | '' | ' ' | | ---- | ------- | ------ | ------- | | 台灣 | Tâi-uân | Tâiuân | Tâi uân |
Sandhi
sandhi String - applies the [sandhi rules of Taiwanese Hokkien][sandhi-wiki].
Since it's difficult to encode all sandhi rules, Taibun provides multiple modes for sandhi conversion to allow for customised sandhi handling.
none- doesn't perform any tone sandhiauto- closest approximation to full correct tone sandhi of Taiwanese, with proper sandhi of pronouns, suffixes, and words with 仔exc_last- changes tone for every syllable except for the last oneincl_last- changes tone for every syllable including the last one
Default value depends on the chosen system:
auto- forTongiongnone- forTailo,POJ,Zhuyin,TLPA,Pingyim,IPA
| text | none | auto | exc_last | incl_last | | ---------------- | ----------------------- | ---------------------- | ---------------------- | ---------------------- | | 這是你的茶桌仔無 | Tse sī lí ê tê-toh-á bô | Tse sì li ē tē-to-á bô | Tsē sì li ē tē-tó-a bô | Tsē sì li ē tē-tó-a bō |
Sandhi rules also change depending on the dialect chosen.
| text | no sandhi | south | north / singapore | | ---- | --------- | ------- | ----------------- | | 台灣 | Tâi-uân | Tāi-uân | Tài-uân |
Punctuation
punctuation String
format(default) - converts Chinese-style punctuation to Latin-style punctuation and capitalises words at the beginning of each sentencenone- preserves Chinese-style punctuation and doesn't capitalise words at the beginning of new sentences
| text | format | none | | ------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------- | | 這是臺南,簡稱「南」(白話字:Tâi-lâm;注音符號:ㄊㄞˊ ㄋㄢˊ,國語:Táinán)。 | Tse sī Tâi-lâm, kán-tshing "lâm" (Pe̍h-uē-jī: Tâi-lâm; tsù-im hû-hō: ㄊㄞˊ ㄋㄢˊ, kok-gí: Táinán). | tse sī Tâi-lâm,kán-tshing「lâm」(Pe̍h-uē-jī:Tâi-lâm;tsù-im hû-hō:ㄊㄞˊ ㄋㄢˊ,kok-gí:Táinán)。 |
Convert non-CJK
convert_non_cjk Boolean - defines whether or not to convert non-Chinese words. Can be used to convert Tailo to another romanisation system.
True- convert non-Chinese character wordsFalse(default) - convert only Chinese character words
| text | False | True | | --------- | ----------------------- | ----------------------- | | 我食pháng | ㆣㄨㄚˋ ㄐㄧㄚㆷ˙ pháng | ㆣㄨㄚˋ ㄐㄧㄚㆷ˙ ㄆㄤˋ |
Tokeniser
Tokeniser class performs [NLTK wordpunct_tokenize][nltk-tokenize]-like tokenisation of a Taiwanese Hokkien sentence.
# Constructor
t = Tokeniser(keep_original)
# Tokenise Taiwanese Hokkien sentence
t.tokenise(input)
Keep original
keep_original Boolean - defines whether the original characters of the input are retained.
True(default) - preserve original charactersFalse- replace original characters with characters defined in the dataset
| text | True | False | | ------------ | -------------------- | -------------------- | | 臺灣火鸡肉饭 | ['臺灣', '火鸡肉饭'] | ['台灣', '火雞肉飯'] |
Other Functions
Handy functions for NLP tasks in Taiwanese Hokkien.
to_traditional function converts input to Traditional Chinese characters that are used in the dataset. Also accounts for different variants of Traditional Chinese characters.
to_simplified function converts input to Simplified Chinese characters.
is_cjk function checks whether the input string consists entirely of Chinese characters.
to_traditional(input)
to_simplified(input)
is_cjk(input)
<!-- EXAMPLE -->
Example
# Converter
from taibun import Converter
## System
c = Converter() # Tailo system default
c.get('先生講,學生恬恬聽。')
>> Sian-sinn kóng, ha̍k-sing tiām-tiām thiann.
c = Converter(system='Zhuyin')
c.get('先生講,學生恬恬聽。')
>> ㄒㄧㄢ ㄒㆪ ㄍㆲˋ, ㄏㄚㆶ˙ ㄒㄧㄥ ㄉㄧㆰ˫ ㄉㄧㆰ˫ ㄊㄧㆩ.
## Dialect
c = Converter() # south dialect default
c.get("我欲用箸食魚")
>> Guá beh īng tī tsia̍h hî
c = Converter(dialect='north')
c.get("我欲用箸食魚")
>> Guá bueh īng tū tsia̍h hû
c = new Converter({ dialect: 'singapore' });
c.get("我欲用箸食魚");
>> Uá bueh ēng tū tsia̍h hû
## Format
c = Converter() # for Tailo, mark by default
c.get("生日快樂")
>> Senn-ji̍t khuài-lo̍k
c = Converter(format='number')
c.get("生日快樂")
>> Senn1-jit8 khuai3-lok8
c = Converter(format='strip')
c.get("生日快樂")
>> Senn-jit khuai-lok
## Delimiter
c = Converter(delimiter='')
c.get("先生講,學生恬恬聽。")
>> Siansinn kóng, ha̍ksing tiāmtiām thiann.
c = Converter(system='Pingyim', delimiter='-')
c.get("先生講,學生恬恬聽。")
>> Siān-snī gǒng, hág-sīng diâm-diâm tinā.
## Sandhi
c = Converter() # for Tailo, sandhi none by default
c.get("這是你的茶桌仔無")
>> Tse sī lí ê tê-toh-á bô
c = Converter(sandhi='auto')
c.get("這是你的茶桌仔無")
>> Tse sì li ē tē-to-á bô
c = Converter(sandhi='exc_last')
c.get("這是你的茶桌仔無")
>> T
Related Skills
node-connect
337.3kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
claude-opus-4-5-migration
83.2kMigrate prompts and code from Claude Sonnet 4.0, Sonnet 4.5, or Opus 4.1 to Opus 4.5
frontend-design
83.2kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
model-usage
337.3kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
