BnUnicodeNormalizer
Bangla Unicode Normalization
Install / Use
/learn @mnansary/BnUnicodeNormalizerREADME
bnUnicodeNormalizer
Bangla Unicode Normalization for word normalization
install
pip install bnunicodenormalizer
useage
initialization and cleaning
# import
from bnunicodenormalizer import Normalizer
from pprint import pprint
# initialize
bnorm=Normalizer()
# normalize
word = 'াটোবাকো'
result=bnorm(word)
print(f"Non-norm:{word}; Norm:{result['normalized']}")
print("--------------------------------------------------")
pprint(result)
output
Non-norm:াটোবাকো; Norm:টোবাকো
--------------------------------------------------
{'given': 'াটোবাকো',
'normalized': 'টোবাকো',
'ops': [{'after': 'টোবাকো',
'before': 'াটোবাকো',
'operation': 'InvalidUnicode'}]}
call to the normalizer returns a dictionary in the following format
given= provided textnormalized= normalized text (gives None if during the operation length of the text becomes 0)ops= list of operations (dictionary) that were executed in given text to create normalized text- each dictionary in ops has:
operation: the name of the operation / problem in given textbefore: what the text looked like before the specific operationafter: what the text looks like after the specific operation
allow to use english text
# initialize without english (default)
norm=Normalizer()
print("without english:",norm("ASD123")["normalized"])
# --> returns None
norm=Normalizer(allow_english=True)
print("with english:",norm("ASD123")["normalized"])
output
without english: None
with english: ASD123
Initialization: Bangla Normalizer
'''
initialize a normalizer
args:
allow_english : allow english letters numbers and punctuations [default:False]
keep_legacy_symbols : legacy symbols will be considered as valid unicodes[default:False]
'৺':Isshar
'৻':Ganda
'ঀ':Anji (not '৭')
'ঌ':li
'ৡ':dirgho li
'ঽ':Avagraha
'ৠ':Vocalic Rr (not 'ঋ')
'৲':rupi
'৴':currency numerator 1
'৵':currency numerator 2
'৶':currency numerator 3
'৷':currency numerator 4
'৸':currency numerator one less than the denominator
'৹':Currency Denominator Sixteen
legacy_maps : a dictionay for changing legacy symbols into a more used unicode
a default legacy map is included in the language class as well,
legacy_maps={'ঀ':'৭',
'ঌ':'৯',
'ৡ':'৯',
'৵':'৯',
'৻':'ৎ',
'ৠ':'ঋ',
'ঽ':'ই'}
pass-
* legacy_maps=None; for keeping the legacy symbols as they are
* legacy_maps="default"; for using the default legacy map
* legacy_maps=custom dictionary(type-dict) ; which will map your desired legacy symbol to any of symbol you want
* the keys in the custiom dicts must belong to any of the legacy symbols
* the values in the custiom dicts must belong to either vowels,consonants,numbers or diacritics
vowels = ['অ', 'আ', 'ই', 'ঈ', 'উ', 'ঊ', 'ঋ', 'এ', 'ঐ', 'ও', 'ঔ']
consonants = ['ক', 'খ', 'গ', 'ঘ', 'ঙ', 'চ', 'ছ','জ', 'ঝ', 'ঞ',
'ট', 'ঠ', 'ড', 'ঢ', 'ণ', 'ত', 'থ', 'দ', 'ধ', 'ন',
'প', 'ফ', 'ব', 'ভ', 'ম', 'য', 'র', 'ল', 'শ', 'ষ',
'স', 'হ','ড়', 'ঢ়', 'য়','ৎ']
numbers = ['০', '১', '২', '৩', '৪', '৫', '৬', '৭', '৮', '৯']
vowel_diacritics = ['া', 'ি', 'ী', 'ু', 'ূ', 'ৃ', 'ে', 'ৈ', 'ো', 'ৌ']
consonant_diacritics = ['ঁ', 'ং', 'ঃ']
> for example you may want to map 'ঽ':Avagraha as 'হ' based on visual similiarity
(default:'ই')
** legacy contions: keep_legacy_symbols and legacy_maps operates as follows
case-1) keep_legacy_symbols=True and legacy_maps=None
: all legacy symbols will be considered valid unicodes. None of them will be changed
case-2) keep_legacy_symbols=True and legacy_maps=valid dictionary example:{'ঀ':'ক'}
: all legacy symbols will be considered valid unicodes. Only 'ঀ' will be changed to 'ক' , others will be untouched
case-3) keep_legacy_symbols=False and legacy_maps=None
: all legacy symbols will be removed
case-4) keep_legacy_symbols=False and legacy_maps=valid dictionary example:{'ঽ':'ই','ৠ':'ঋ'}
: 'ঽ' will be changed to 'ই' and 'ৠ' will be changed to 'ঋ'. All other legacy symbols will be removed
'''
my_legacy_maps={'ঌ':'ই',
'ৡ':'ই',
'৵':'ই',
'ৠ':'ই',
'ঽ':'ই'}
text="৺,৻,ঀ,ঌ,ৡ,ঽ,ৠ,৲,৴,৵,৶,৷,৸,৹"
# case 1
norm=Normalizer(keep_legacy_symbols=True,legacy_maps=None)
print("case-1 normalized text: ",norm(text)["normalized"])
# case 2
norm=Normalizer(keep_legacy_symbols=True,legacy_maps=my_legacy_maps)
print("case-2 normalized text: ",norm(text)["normalized"])
# case 2-defalut
norm=Normalizer(keep_legacy_symbols=True)
print("case-2 default normalized text: ",norm(text)["normalized"])
# case 3
norm=Normalizer(keep_legacy_symbols=False,legacy_maps=None)
print("case-3 normalized text: ",norm(text)["normalized"])
# case 4
norm=Normalizer(keep_legacy_symbols=False,legacy_maps=my_legacy_maps)
print("case-4 normalized text: ",norm(text)["normalized"])
# case 4-defalut
norm=Normalizer(keep_legacy_symbols=False)
print("case-4 default normalized text: ",norm(text)["normalized"])
output
case-1 normalized text: ৺,৻,ঀ,ঌ,ৡ,ঽ,ৠ,৲,৴,৵,৶,৷,৸,৹
case-2 normalized text: ৺,৻,ঀ,ই,ই,ই,ই,৲,৴,ই,৶,৷,৸,৹
case-2 default normalized text: ৺,৻,ঀ,ঌ,ৡ,ঽ,ৠ,৲,৴,৵,৶,৷,৸,৹
case-3 normalized text: ,,,,,,,,,,,,,
case-4 normalized text: ,,,ই,ই,ই,ই,,,ই,,,,
case-4 default normalized text: ,,,,,,,,,,,,,
Operations
- base operations available for all indic languages:
self.word_level_ops={"LegacySymbols" :self.mapLegacySymbols,
"BrokenDiacritics" :self.fixBrokenDiacritics}
self.decomp_level_ops={"BrokenNukta" :self.fixBrokenNukta,
"InvalidUnicode" :self.cleanInvalidUnicodes,
"InvalidConnector" :self.cleanInvalidConnector,
"FixDiacritics" :self.cleanDiacritics,
"VowelDiacriticAfterVowel" :self.cleanVowelDiacriticComingAfterVowel}
- extensions for bangla
self.decomp_level_ops["ToAndHosontoNormalize"] = self.normalizeToandHosonto
# invalid folas
self.decomp_level_ops["NormalizeConjunctsDiacritics"] = self.cleanInvalidConjunctDiacritics
# complex root cleanup
self.decomp_level_ops["ComplexRootNormalization"] = self.convertComplexRoots
Normalization Problem Examples
In all examples (a) is the non-normalized form and (b) is the normalized form
- Broken diacritics:
# Example-1:
(a)'আরো'==(b)'আরো' -> False
(a) breaks as:['আ', 'র', 'ে', 'া']
(b) breaks as:['আ', 'র', 'ো']
# Example-2:
(a)পৌঁছে==(b)পৌঁছে -> False
(a) breaks as:['প', 'ে', 'ৗ', 'ঁ', 'ছ', 'ে']
(b) breaks as:['প', 'ৌ', 'ঁ', 'ছ', 'ে']
# Example-3:
(a)সংস্কৄতি==(b)সংস্কৃতি -> False
(a) breaks as:['স', 'ং', 'স', '্', 'ক', 'ৄ', 'ত', 'ি']
(b) breaks as:['স', 'ং', 'স', '্', 'ক', 'ৃ', 'ত', 'ি']
- Nukta Normalization:
Example-1:
(a)কেন্দ্রীয়==(b)কেন্দ্রীয় -> False
(a) breaks as:['ক', 'ে', 'ন', '্', 'দ', '্', 'র', 'ী', 'য', '়']
(b) breaks as:['ক', 'ে', 'ন', '্', 'দ', '্', 'র', 'ী', 'য়']
Example-2:
(a)রযে়ছে==(b)রয়েছে -> False
(a) breaks as:['র', 'য', 'ে', '়', 'ছ', 'ে']
(b) breaks as:['র', 'য়', 'ে', 'ছ', 'ে']
Example-3:
(a)জ়ন্য==(b)জন্য -> False
(a) breaks as:['জ', '়', 'ন', '্', 'য']
(b) breaks as:['জ', 'ন', '্', 'য']
- Invalid hosonto
# Example-1:
(a)দুই্টি==(b)দুইটি-->False
(a) breaks as ['দ',
