bnUnicodeNormalizer

Bangla Unicode Normalization for word normalization

install

pip install bnunicodenormalizer

useage

initialization and cleaning

# import
from bnunicodenormalizer import Normalizer 
from pprint import pprint
# initialize
bnorm=Normalizer()
# normalize
word = 'াটোবাকো'
result=bnorm(word)
print(f"Non-norm:{word}; Norm:{result['normalized']}")
print("--------------------------------------------------")
pprint(result)

output

Non-norm:াটোবাকো; Norm:টোবাকো
--------------------------------------------------
{'given': 'াটোবাকো',
 'normalized': 'টোবাকো',
 'ops': [{'after': 'টোবাকো',
          'before': 'াটোবাকো',
          'operation': 'InvalidUnicode'}]}

call to the normalizer returns a dictionary in the following format

given = provided text
normalized = normalized text (gives None if during the operation length of the text becomes 0)
ops = list of operations (dictionary) that were executed in given text to create normalized text
each dictionary in ops has:
- operation: the name of the operation / problem in given text
- before : what the text looked like before the specific operation
- after : what the text looks like after the specific operation

allow to use english text

# initialize without english (default)
norm=Normalizer()
print("without english:",norm("ASD123")["normalized"])
# --> returns None
norm=Normalizer(allow_english=True)
print("with english:",norm("ASD123")["normalized"])

output

without english: None
with english: ASD123

Initialization: Bangla Normalizer

'''
    initialize a normalizer
            args:
                allow_english                   :   allow english letters numbers and punctuations [default:False]
                keep_legacy_symbols             :   legacy symbols will be considered as valid unicodes[default:False]
                                                    '৺':Isshar 
                                                    '৻':Ganda
                                                    'ঀ':Anji (not '৭')
                                                    'ঌ':li
                                                    'ৡ':dirgho li
                                                    'ঽ':Avagraha
                                                    'ৠ':Vocalic Rr (not 'ঋ')
                                                    '৲':rupi
                                                    '৴':currency numerator 1
                                                    '৵':currency numerator 2
                                                    '৶':currency numerator 3
                                                    '৷':currency numerator 4
                                                    '৸':currency numerator one less than the denominator
                                                    '৹':Currency Denominator Sixteen
                legacy_maps                     :   a dictionay for changing legacy symbols into a more used  unicode 
                                                    a default legacy map is included in the language class as well,
                                                    legacy_maps={'ঀ':'৭',
                                                                'ঌ':'৯',
                                                                'ৡ':'৯',
                                                                '৵':'৯',
                                                                '৻':'ৎ',
                                                                'ৠ':'ঋ',
                                                                'ঽ':'ই'}
                                            
                                                    pass-   
                                                    * legacy_maps=None; for keeping the legacy symbols as they are
                                                    * legacy_maps="default"; for using the default legacy map
                                                    * legacy_maps=custom dictionary(type-dict) ; which will map your desired legacy symbol to any of symbol you want
                                                        * the keys in the custiom dicts must belong to any of the legacy symbols
                                                        * the values in the custiom dicts must belong to either vowels,consonants,numbers or diacritics  
                                                        vowels         =   ['অ', 'আ', 'ই', 'ঈ', 'উ', 'ঊ', 'ঋ', 'এ', 'ঐ', 'ও', 'ঔ']
                                                        consonants     =   ['ক', 'খ', 'গ', 'ঘ', 'ঙ', 'চ', 'ছ','জ', 'ঝ', 'ঞ', 
                                                                            'ট', 'ঠ', 'ড', 'ঢ', 'ণ', 'ত', 'থ', 'দ', 'ধ', 'ন', 
                                                                            'প', 'ফ', 'ব', 'ভ', 'ম', 'য', 'র', 'ল', 'শ', 'ষ', 
                                                                            'স', 'হ','ড়', 'ঢ়', 'য়','ৎ']    
                                                        numbers        =    ['০', '১', '২', '৩', '৪', '৫', '৬', '৭', '৮', '৯']
                                                        vowel_diacritics       =   ['া', 'ি', 'ী', 'ু', 'ূ', 'ৃ', 'ে', 'ৈ', 'ো', 'ৌ']
                                                        consonant_diacritics   =   ['ঁ', 'ং', 'ঃ']
    
                                                        > for example you may want to map 'ঽ':Avagraha as 'হ' based on visual similiarity 
                                                            (default:'ই')

                ** legacy contions: keep_legacy_symbols and legacy_maps operates as follows 
                    case-1) keep_legacy_symbols=True and legacy_maps=None
                        : all legacy symbols will be considered valid unicodes. None of them will be changed
                    case-2) keep_legacy_symbols=True and legacy_maps=valid dictionary example:{'ঀ':'ক'}
                        : all legacy symbols will be considered valid unicodes. Only 'ঀ' will be changed to 'ক' , others will be untouched
                    case-3) keep_legacy_symbols=False and legacy_maps=None
                        : all legacy symbols will be removed
                    case-4) keep_legacy_symbols=False and legacy_maps=valid dictionary example:{'ঽ':'ই','ৠ':'ঋ'}
                        : 'ঽ' will be changed to 'ই' and 'ৠ' will be changed to 'ঋ'. All other legacy symbols will be removed
'''

my_legacy_maps={'ঌ':'ই',
                'ৡ':'ই',
                '৵':'ই',
                'ৠ':'ই',
                'ঽ':'ই'}
text="৺,৻,ঀ,ঌ,ৡ,ঽ,ৠ,৲,৴,৵,৶,৷,৸,৹"
# case 1
norm=Normalizer(keep_legacy_symbols=True,legacy_maps=None)
print("case-1 normalized text:  ",norm(text)["normalized"])
# case 2
norm=Normalizer(keep_legacy_symbols=True,legacy_maps=my_legacy_maps)
print("case-2 normalized text:  ",norm(text)["normalized"])
# case 2-defalut
norm=Normalizer(keep_legacy_symbols=True)
print("case-2 default normalized text:  ",norm(text)["normalized"])

# case 3
norm=Normalizer(keep_legacy_symbols=False,legacy_maps=None)
print("case-3 normalized text:  ",norm(text)["normalized"])
# case 4
norm=Normalizer(keep_legacy_symbols=False,legacy_maps=my_legacy_maps)
print("case-4 normalized text:  ",norm(text)["normalized"])
# case 4-defalut
norm=Normalizer(keep_legacy_symbols=False)
print("case-4 default normalized text:  ",norm(text)["normalized"])

output

case-1 normalized text:   ৺,৻,ঀ,ঌ,ৡ,ঽ,ৠ,৲,৴,৵,৶,৷,৸,৹
case-2 normalized text:   ৺,৻,ঀ,ই,ই,ই,ই,৲,৴,ই,৶,৷,৸,৹
case-2 default normalized text:   ৺,৻,ঀ,ঌ,ৡ,ঽ,ৠ,৲,৴,৵,৶,৷,৸,৹
case-3 normalized text:   ,,,,,,,,,,,,,
case-4 normalized text:   ,,,ই,ই,ই,ই,,,ই,,,,
case-4 default normalized text:   ,,,,,,,,,,,,,

Operations

base operations available for all indic languages:

self.word_level_ops={"LegacySymbols"   :self.mapLegacySymbols,
                    "BrokenDiacritics" :self.fixBrokenDiacritics}

self.decomp_level_ops={"BrokenNukta"             :self.fixBrokenNukta,
                    "InvalidUnicode"             :self.cleanInvalidUnicodes,
                    "InvalidConnector"           :self.cleanInvalidConnector,
                    "FixDiacritics"              :self.cleanDiacritics,
                    "VowelDiacriticAfterVowel"   :self.cleanVowelDiacriticComingAfterVowel}

extensions for bangla

self.decomp_level_ops["ToAndHosontoNormalize"]             =       self.normalizeToandHosonto

# invalid folas 
self.decomp_level_ops["NormalizeConjunctsDiacritics"]      =       self.cleanInvalidConjunctDiacritics

# complex root cleanup 
self.decomp_level_ops["ComplexRootNormalization"]          =       self.convertComplexRoots

Normalization Problem Examples

In all examples (a) is the non-normalized form and (b) is the normalized form

Broken diacritics:

# Example-1: 
(a)'আরো'==(b)'আরো' ->  False 
    (a) breaks as:['আ', 'র', 'ে', 'া']
    (b) breaks as:['আ', 'র', 'ো']
# Example-2:
(a)পৌঁছে==(b)পৌঁছে ->  False
    (a) breaks as:['প', 'ে', 'ৗ', 'ঁ', 'ছ', 'ে']
    (b) breaks as:['প', 'ৌ', 'ঁ', 'ছ', 'ে']
# Example-3:
(a)সংস্কৄতি==(b)সংস্কৃতি ->  False
    (a) breaks as:['স', 'ং', 'স', '্', 'ক', 'ৄ', 'ত', 'ি']
    (b) breaks as:['স', 'ং', 'স', '্', 'ক', 'ৃ', 'ত', 'ি']

Nukta Normalization:

Example-1:
(a)কেন্দ্রীয়==(b)কেন্দ্রীয় ->  False
    (a) breaks as:['ক', 'ে', 'ন', '্', 'দ', '্', 'র', 'ী', 'য', '়']
    (b) breaks as:['ক', 'ে', 'ন', '্', 'দ', '্', 'র', 'ী', 'য়']
Example-2:
(a)রযে়ছে==(b)রয়েছে ->  False
    (a) breaks as:['র', 'য', 'ে', '়', 'ছ', 'ে']
    (b) breaks as:['র', 'য়', 'ে', 'ছ', 'ে']
Example-3: 
(a)জ়ন্য==(b)জন্য ->  False
    (a) breaks as:['জ', '়', 'ন', '্', 'য']
    (b) breaks as:['জ', 'ন', '্', 'য']

Invalid hosonto

# Example-1:
(a)দুই্টি==(b)দুইটি-->False
    (a) breaks as ['দ',

BnUnicodeNormalizer

Install / Use

README

bnUnicodeNormalizer

install

useage

Initialization: Bangla Normalizer

Operations

Normalization Problem Examples