MyWord

syllable, word and phrase segmenter for Burmese (Myanmar language)

Generate Convert Improve

Install / Use

/learn @ye-kyaw-thu/MyWord

About this skill

Quality Score

0/100

README

𝗆𝔂𝕎◐ℝ𝗗 Segmentation Tool

syllable, word, sub_word and phrase segmenter for Burmese (Myanmar language)

Introduction
- Features
Rule: Syllable Segmentation with Regular Expression
Syllable Segmentation with "myWord" Segmentation Tool
Theory: Word Segmentation with Viterbi Algorithm
Building Unigram, Bigram Dictionaries for Word Unit
Word Segmentation with "myWord" Segmentation Tool
Theory: Unsupervised Phrase Segmentation with NPMI
Phrase Segmentation with "myWord" Segmentation Tool
Command-line Help
Introduction to "npmi_train" Option
Dictionaries for Word and Phrase Segmentation
- for Word Segmentation
- for Phrase Segmentation
Files and Folder Information
Evaluation of myWord for "Word Segmentation"
- Closed Test
- Open Test
Commands of myWord Segmentation Tool
Contributors
To Do
License
Citation
Reference

Introduction

မြန်မာစာ စာလုံးတွေကို ပေါ့ပေါ့ပါးပါးနဲ့ မြန်မြန်ဆန်ဆန် ဖြတ်ပေးနိုင်ပြီး၊ library တွေ အများကြီးကိုလည်း မှီမနေပဲ Developer တွေက လွယ်လွယ်ကူကူ embedding လုပ်နိုင်ပြီးတော့ ကိုယ့်ဒေတာနဲ့ကိုယ်လည်း extend လုပ်နိုင်တဲ့ word segmentation tool က ဒီနေ့အထိ မရှိသေးဘူးလို့ နားလည်ထားတယ်။ အဲဒီ ကွက်လပ်ကိုဖြည့်နိုင်ဖို့ရည်ရွယ်ပြီးတော့ myWord ကို R&D လုပ်ခဲ့ပြီး release လုပ်ပေးလိုက်ပါတယ်။

myWord Segmentation Tool ကို သုံးပြီးတော့ မြန်မာစာကြောင်းတွေကို "syllable unit", "sub_word", "word unit", "phrase unit" တွေ အဖြစ် ဖြတ်ပေးတဲ့ ပရိုဂရမ်ပါ။ NLP preprocessing/post-editing အလုပ်တွေ၊ မြန်မာစာနဲ့ ပတ်သက်တဲ့ ဒေတာတွေကို စာလုံးဖြတ်ပြီး model ဆောက်ဖို့အတွက် အသုံးဝင်ပါလိမ့်မယ်။

Features:

myWord Segmentation Tool က အဓိက လုပ်ပေးနိုင်တဲ့ အချက်တွေကိုတော့ အင်္ဂလိပ်လိုပဲ ချရေးပေးလိုက်တယ်။

Written with Python programming (so... you can hack easily)
Used unigram, bigram dictionaries built with "manually segmented twelve million words" training corpus (myWord Corpus Ver. 1.0)
Yes, myWord supports "syllable", "sub_word", "word" and "phrase" segmentation
Of course, you can train or build unigram, bigram dictionaries with your segmented corpus
By default, running word segmentation with Viterbi Algorithm
By default, running phrase segmentation with NPMI (Normalized Pointwise Mutual Information) Algorithm
Shared Burmese unigram, bigram dictionaries with MIT License

Rule: Syllable Segmentation with Regular Expression

မြန်မာစာအတွက် syllable segmentation က အရေးကြီးတဲ့ word segmentation unit တစ်ခုပါ။ အထူးသဖြင့် ဒေတာက ကောင်းကောင်းမရှိတာကြောင့်ရော၊ ငြိမ်တဲ့ word segmenter က မရှိတာကြောင့်ရော Machine Translation သုတေသနမှာဆိုရင် syllable segmentation ဖြတ်ပြီးတော့ ဘာသာပြန်တာက word segmentation လုပ်ပြီး training လုပ်တာထက်တောင် ရလဒ်တွေက ပိုကောင်းနိုင်ကြောင်းကို စာတမ်းတွေရေးပြီးလည်း သက်သေပြခဲ့ပြီးပါပြီ။ myWord Segmentation Tool မှာလည်း syllable breaking လုပ်ပေးတဲ့ option ကိုထည့်ထားပါတယ်။

Syllable breaking ကိုလည်း Finite State Model ဆောက်ပြီးဖြတ်တာမျိုး၊ syllable list အဘိဓာန်ဆောက်ပြီး ဖြတ်တာမျိုး စသည်ဖြင့် approach အမျိုးမျိုးနဲ့ သွားလို့ရပေမဲ့ 2014 လောက်မှာ propose လုပ်ခဲ့တဲ့ sylbreak (Link: https://github.com/ye-kyaw-thu/sylbreak) ထဲက Regular Expression (RE) ကိုပဲ သုံးထားပါတယ်။ ဘာကြောင့်လဲဆိုရင် Unicode နဲ့ စာရိုက်ထားတဲ့ မြန်မာစာတွေအတွက်က RE တစ်ကြောင်းတည်းနဲ့ လှလှပပ အလုပ်လုပ် ပေးလို့ပါ။ ပြီးတော့ NLP အလုပ်တွေ အများကြီးအတွက်လည်း လက်ရှိ syllable breaking RE သတ်မှတ်ချက်နဲ့တင် အဆင်ပြေလို့ပါ။ Python code နဲ့ပဲ အလွယ်ရှင်းပြရရင်တော့ အောက်ပါအတိုင်း ဗျည်း (က-အ)၊ အင်္ဂလိပ်စာလုံးနဲ့ အင်္ဂလိပ်ဂဏန်း (a-z,A-Z,0-9)၊ တခြားစာလုံး (ဣဤဥဦတို့လို သရတွေ၊ မြန်မာဂဏန်း၊ သင်္ကေတတချို့)၊ ပါဌ်ဆင့် ဆင့်တဲ့ Unicode သင်္ကေတ နဲ့ အသတ်အက္ခရာ စုစုပေါင်း variable ငါးခုကို သတ်မှတ်လိုက်ပြီးရင် ((?<!" + ssSymbol + r")["+ myConsonant + r"](?![" + aThat + ssSymbol + r"])" + r"|[" + enChar + otherChar + r"]) ဆိုတဲ့ RE ကို pass လုပ်ပေးလိုက်ယုံပါပဲ။ ဒီ RE က ပေးထားတဲ့ Rule ကတော့ ပါဌ်ဆင့် ဆင့်တဲ့ သင်္ကေတ နောက်က လိုက်တဲ့ ဗျည်း မဟုတ်ရင်၊ ပြီးတော့ အသတ်နဲ့တွဲနေတဲ့ ဗျည်း မဟုတ်ရင် အဲဒီဗျည်း စာလုံးရဲ့ ရှေ့မှာ break point လုပ်ပါ။ သို့မဟုတ် အင်္ဂလိပ်စာလုံးတို့ otherChar အနေနဲ့ သတ်မှတ်ထားတဲ့ စာလုံးတွေ ဆိုရင်လည်း အဲဒီစာလုံးတွေရဲ့ ရှေ့မှာ break ထည့်ပါလို့ သတ်မှတ်ထားတာ ဖြစ်ပါတယ်။

myConsonant = r"က-အ"
enChar = r"a-zA-Z0-9"
otherChar = r"ဣဤဥဦဧဩဪဿ၌၍၏၀-၉၊။!-/:-@[-`{-~\s"
ssSymbol = r'္'
aThat = r'်'

#Regular expression pattern for Myanmar syllable breaking
#*** a consonant not after a subscript symbol AND a consonant is not followed by a-That character or a subscript symbol
BreakPattern = re.compile(r"((?<!" + ssSymbol + r")["+ myConsonant + r"](?![" + aThat + ssSymbol + r"])" + r"|[" + enChar + otherChar + r"])", re.UNICODE)

⚠️ မှတ်ချက်။ ။ ဒီနေရာမှာ သတ်မှတ်ထားတဲ့ syllable (မြန်မာလို ဝဏ္ဏ လို့လည်း ခေါ်ပါတယ်) ဆိုတာက ဘာသာဗေဒ အရ ကြည့်မယ်ဆိုရင် မလိုက်နာတဲ့အပိုင်းတစ်ခုရှိပါတယ်။ အဲဒါက "တက္ကသိုလ်" လို ပါဌ်ဆင့် စာလုံးတွေဆိုရင် "တက်" "က" "သိုလ်" ဆိုပြီး မဖြတ်ပဲနဲ့ "တက္က" နဲ့ "သိုလ်" ဆိုပြီး syllable နှစ်လုံးအဖြစ်ပဲ break လုပ်ချသွားတာမျိုးပါ။ အဲဒီလိုလုပ်တာက NLP task တွေအတွက် ပိုပြီးအဆင်ပြေလို့ပါ။ post-editing လို အလုပ်တွေကို စဉ်းစားစရာ မလိုအပ်လို့ ပိုကောင်းတာမို့ပါ။ Downstream application တွေပေါ်ကို မူတည်ပြီးတော့ လက်ရှိ RE Rule ကို ပြင်တာ၊ ဖြည့်စွက်တာမျိုးက RE နားလည်တဲ့သူအတွက်က ကြိုက်သလို update လုပ်သွားကြပါ။ ⚠️

Syllable Segmentation with "myWord" Segmentation Tool

input file က အောက်ပါအတိုင်းရှိတယ်လို့ ဆိုကြပါစို့...

$ cat test.txt 
ကျွန်တော်ကသုတေသနသမားပါ။
နေ့ရောညရောမြန်မာစာနဲ့ကွန်ပျူတာနဲ့ပဲအလုပ် များ ပါ တယ်
မင်းကကောဘာအလုပ်လုပ်တာလဲ။
ပြောပြပါအုံး
ကောဖီလည်းထပ်သောက်ချင်ရင်ပြောကွာ
မန္တလေးမှာဒေါ်အောင်ဆန်းစုကြည်မိန့်ခွန်းပြောမယ်တဲ့။

syllable segmentation လုပ်ဖို့အတွက်က command line argument ကို syllable လို့ ပေးပြီးနောက် <input-file> နဲ့ <output-file> တွေရဲ့ နာမည်တွေကို ရိုက်ထည့်ပေးလိုက်ယုံပါပဲ။

$ python ./myword.py syllable ./test.txt ./test.syllable
$ cat ./test.syllable 
ကျွန် တော် က သု တေ သ န သ မား ပါ ။
နေ့ ရော ည ရော မြန် မာ စာ နဲ့ ကွန် ပျူ တာ နဲ့ ပဲ အ လုပ် များ ပါ တယ်
မင်း က ကော ဘာ အ လုပ် လုပ် တာ လဲ ။
ပြော ပြ ပါ အုံး
ကော ဖီ လည်း ထပ် သောက် ချင် ရင် ပြော ကွာ
မန္တ လေး မှာ ဒေါ် အောင် ဆန်း စု ကြည် မိန့် ခွန်း ပြော မယ် တဲ့ ။

Theory: Word Segmentation with Viterbi Algorithm

လက်ရှိ ဗားရှင်း myWord ရဲ့ word segmentation လုပ်ပုံကတော့ လက်နဲ့ စာလုံးတစ်လုံးချင်းစီ သေသေချာချာ ဖြတ်ထားတဲ့ စာကြောင်းရေ ၅သိန်းကျော်၊ စာလုံးရေ ၁၂သန်းကျော်ရှိတဲ့ training corpus ကနေ ပထမဆုံး ngram အဘိဓာန်နှစ်ခု (unigram, bigram) ကို ကြိုတင်တည်ဆောက်ထားပြီးတော့ Viterbi algorithm နဲ့ decoding လုပ်တဲ့ approach ကို အသုံးပြုထားပါတယ်။ နည်းနည်း အသေးစိတ် ဖြည့်ပြောရရင် input ဝင်လာတဲ့ စာကြောင်းရဲ့ character တစ်လုံးချင်းစီကို bigram, unigram မော်ဒယ်နှစ်ခုနဲ့ score လုပ်သွားပြီးတော့၊ အဖြစ်နိုင်ဆုံး စာလုံးတွဲ ကို Viterbi algorithm ကိုသုံးပြီးတော့ ရွေးတဲ့ နည်းလမ်းပါ။ Score လုပ်တဲ့အခါမှာ အရင်ဆုံး bigram အဘိဓာန်မှာ ရှိရင် အဲဒီ Probability ကို ယူပြီးတော့၊ bigram အဘိဓာန်မှာ အဲဒီ စာလုံးတွဲက ရှာမတွေ့ရင်တော့ unigram အဘိဓာန်က probability ကို ယူပါတယ်။

Viterbi Algorithm က 1967 မှာ Andrew J. Viterbi ဆိုတဲ့ အင်ဂျင်နီယာက propose လုပ်ခဲ့တဲ့ algorithm ပါ။ ဒီ algorithm က Dynamic Programming Algorithm အုပ်စုထဲမှာ ပါပြီးတော့ Automatic Speech Recognition (ASR), Data recording, Part-Of-Speech (POS) tagging, DNA sequencing, Space Communication စသည်ဖြင့် ဧရိယာမျိုးစုံမှာ အသုံးပြုနေကြပါတယ်။ ကျွန်တော်တို့ နေ့စဉ် အသုံးပြုနေတဲ့ မိုဘိုင်းဖုန်းဆက်သွယ်မှုတွေမှာလည်း Viterbi algorithm ကို သုံးထားပါတယ်။ အဲဒါကြောင့် လွန်ခဲ့တဲ့ နှစ်ပေါင်း ၅၀ အတွင်းမှာ တကယ်ကို နာမည်ကြီးတဲ့ algorithm တစ်ခုဖြစ်ပြီး လက်တွေ့ သိထားရင်လည်း ကိုယ့်အလုပ်တွေအတွက် အသုံးဝင်တာမို့ ကွန်ပျူတာသမားတွေ၊ အင်ဂျင်နီယာတွေအနေနဲ့က မသိမဖြစ်လို့ကို ပြောရပါလိမ့်မယ်။ တကယ်တမ်း အသေးစိတ် လေ့လာချင်တဲ့ သူတွေက သူနဲ့ ဆက်စပ်နေတဲ့ Markov Process, Markov Property, Markov Chain, Hidden Markov Model (HMM) တို့ကို အရင်လေ့လာပါလို့ အကြံပေးချင်ပါတယ်။ ဥပမာ တစ်ခုအနေနဲ့ Viterbi ရဲ့ အလုပ်လုပ်ပုံကို ပြောရရင်... ဆိုကြပါစို့ ကျွန်တော်တို့မှာ HMM POS tagging လုပ်ပေးမဲ့ HMM မော်ဒယ်က ဆောက်ထားပြီးသားဆိုရင် စာကြောင်း အသစ်တစ်ကြောင်းကို အဲဒီ HMM မော်ဒယ်ကို pass လုပ်ပြီးတော့ POS tagging လုပ်ခိုင်းတဲ့အခါမှာ စာကြောင်း ထဲမှာ ပါတဲ့ စာလုံး အရေအတွက်ပေါ်ကို မူတည်ပြီးတော့ ဖြစ်နိုင်ချေရှိတဲ့ tag sequence တွေက အများကြီးမို့လို့ အဖြစ်နိုင်ဆုံးသော POS tag sequence ကို မော်ဒယ်က predict လုပ်တဲ့ အခါမျိုးမှာ Viterbi ကို သုံးပြီး best path ကို ရှာလို့ ရပါတယ်။ အဲဒီလို မလုပ်ပဲ brute force searching နဲ့ ရှိသမျှ path အကုန်ကိုတွက်ပြီးမှသာ ရွေးမယ်ဆိုပြီး စဉ်းစားရင် ကွန်ပျူတာ memory က မနိုင်ပါဘူး။ ဘာကြောင့်လဲဆိုတော့ ngram အဘိဓာန်အပြင်၊ HMM မှာက Observed state နဲ့ Hidden state ဆိုပြီး နှစ်ပိုင်းပါတာကြောင့် ကျွန်တော်တို့က transition probability (POS-tag တစ်ခုနဲ့ တစ်ခုအကြား) ရော emmission probability (POS-tag နဲ့ word အကြား) ရော ကို တွက်ကြရမှာမို့ စာလုံးအရေအတွက်က များလာတာနဲ့ အမျှ operation က exponentially တိုးလာမှာမို့ပါ။ ကွန်ပျူတာသမားတွေ အများစု သိထားကြတဲ့ Computational Complexity အနေနဲ့ ကြည့်မယ်ဆိုရင် Brute Force algorithm က O(S^T) ဖြစ်ပြီးတော့ Viterbi algorithm က O(T*S^2) လို့ ဖော်ပြလို့ ရပါတယ်။ ဒီနေရာမှာ ပြောနေတဲ့ "S" က HMM network ထဲမှာ ရှိနေတဲ့ state အရေအတွက် (i.e. the number of states) ကို ကိုယ်စားပြုပြီးတော့၊ "T" ကတော့ စာကြောင်း တစ်ကြောင်းမှာ ရှိတဲ့ စာလုံးအရေအတွက် (i.e. the length of the data sequence) ကို ကိုယ်စားပြုတာကြောင့် Viterbi Algorithm က computat

Related Skills

node-connect

354.0k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

112.2k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

354.0k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

354.0k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。