LughaatNLP
LughaatNLP: First Urdu language preprocessing library in Pakistan. Tokenization, lemmatization, stop word removal, and normalization for Urdu text. Join us to advance Urdu NLP! #OpenSource #UrduLanguage
Install / Use
/learn @MuhammadNoman76/LughaatNLPREADME
LughaatNLP
Introducing LughaatNLP, the all-in-one Urdu language toolkit designed for NLP tasks in Pakistan. It offers essential features like tokenization, lemmatization, stop word removal, POS tagging, NER, normalization, summarization, and even text-to-speech and speech-to-text capabilities, all crafted exclusively for Urdu. Unlock the power of Urdu NLP effortlessly with LughaatNLP!
<p align="center"> <img src="https://i.imgur.com/6lKyQlo.png" alt="Alt Text" width="500" height="500"> </p>Documentation
Explore the full potential of LughaatNLP through its detailed documentation, which includes practical usage examples, installation guides, and API references.
Google Colab Notebook
Get hands-on with LughaatNLP using the interactive Google Colab Notebook provided in the documentation. This notebook lets you experiment with the toolkit's functionalities directly in your browser.
PyPI Package
Install LughaatNLP effortlessly via its PyPI page to integrate Urdu NLP capabilities into your Python projects seamlessly.
YouTube Tutorial Series
Accelerate your understanding of LughaatNLP's features with the dedicated YouTube tutorial playlist, offering step-by-step guidance on utilizing the toolkit for various Urdu NLP tasks.
Blogs and Articles
Gain insights into advanced techniques and best practices for mastering Urdu text processing through informative articles like:
Geeksforgeeks Link :
Medium Link :
LughaatNLP Blog :
Features
- Tokenization: Breaks down Urdu text into individual tokens, considering the intricacies of Urdu script and language structure.
- Lemmatization: Converts inflected words into their base or dictionary form, aiding in text analysis and comprehension.
- Stop Word Removal: Eliminates common Urdu stop words to focus on meaningful content during text processing.
- Normalization: Standardizes text by removing diacritics, normalizing character variations, and handling common orthographic variations in Urdu.
- Stemming: Reduces words to their root form, improving text analysis and comprehension in Urdu.
- Spell Checker: Identifies and corrects misspelled words in Urdu text, enhancing text quality and readability.
- Part of Speech Extraction: Tags words with their grammatical categories, enabling advanced syntactic analysis.
- Named Entity Recognition (NER): Identify and extract names of entities like persons, organizations, or locations.
- Text Summarization: Generates concise summaries of Urdu text, facilitating quick understanding of lengthy content.
- Text-to-Speech Conversion: Converts Urdu text into spoken audio, enabling accessibility and language learning applications.
- Speech-to-Text Conversion: Transcribes spoken Urdu into written text, facilitating voice-based interactions and data entry.
Installation
You can install the LughaatUrdu library from PyPI using pip:
pip install lughaatNLP
Alternatively, you can manually install it by downloading and unzipping the provided LughaatNLP.rar file and installing the wheel file using pip:
pip install path_to_wheel_file/LughaatNLP-1.0.2-py3-none-any.whl
Required Packages
The LughaatNLP library requires the following packages:
python-Levenshteintensorflownumpyscikit-learnscipygttsSpeechRecognitionPydub
You can install these packages using pip:
pip install python-Levenshtein tensorflow numpy scikit-learn scipy gtts SpeechRecognition pydub
Usage
After installing the library, you can import the necessary functions or classes in your Python script:
#importing Pakages
from LughaatNLP import LughaatNLP
from LughaatNLP import POS_urdu from LughaatNLP import NER_Urdu
from LughaatNLP import TextSummarization
from LughaatNLP import UrduSpeech
# Instance Calling
urdu_text_processing = LughaatNLP()
ner_urdu = NER_Urdu()
pos_tagger = POS_urdu()
speech_urdu = TextSummarization()
speech_urdu = UrduSpeech()
Functions
Normalization
1. normalize_characters(text)
This function normalizes the Urdu characters in the given text by mapping incorrect Urdu characters to their correct forms. Sometimes, single Unicode characters representing Urdu may be written in multiple forms, and this function normalizes them accordingly.
Example:
text = "آپ کیسے ہیں؟"
normalized_text = urdu_text_processing.normalize_characters(text)
print(normalized_text) # Output: اپ کیسے ہیں؟
2. normalize_combine_characters(text)
This function simplifies Urdu characters by combining certain combinations into their correct single forms. In Urdu writing, some characters are made up of multiple parts like ligatures or diacritics. This function finds these combinations in the text and changes them to their single character forms. It ensures consistency and accuracy in how Urdu text is represented.
Example:
text = "اُردو"
normalized_text = urdu_text_processing.normalize_combine_characters(text)
print(normalized_text) # Output: اُردو
3. normalize(text)
This function performs all-in-one normalization on the Urdu text, including character normalization, diacritic removal, punctuation handling, digit conversion, and special character preservation.
Example:
text = "آپ کیسے ہیں؟ میں ۲۳ سال کا ہوں۔"
normalized_text = urdu_text_processing.normalize(text)
print("Normalize all at once together of Urdu: ", normalized_text) # Output: اپ کیسے ہیں ؟ میں 23 سال کا ہوں ۔
4. remove_diacritics(text)
This function removes diacritics (zabar, zer, pesh) from the Urdu text.
Example:
text = "کِتَاب"
diacritics_removed = urdu_text_processing.remove_diacritics(text)
print("Remove all Diacritic (Zabar - Zer - Pesh): ", diacritics_removed) # Output: کتاب
5. punctuations_space(text)
This function remove spaces after punctuations (excluding numbers) and removes spaces before punctuations in the Urdu text.
Example:
text = "کیا آپ کھانا کھانا چاہتے ہیں ؟ میں کھانا کھاؤں گا ۔"
punctuated_text = urdu_text_processing.punctuations_space(text)
print(punctuated_text) # Output: کیا آپ کھانا کھانا چاہتے ہیں؟ میں کھانا کھاؤں گا۔
6. replace_digits(text)
This function replaces English digits with Urdu digits.
Example:
text = "میں 23 سال کا ہوں۔"
english_digits = urdu_text_processing.replace_digits(text)
print("Replace All maths numbers with Urdu number eg(2 1 3 1 -> ۲ ۱ ۳ ۱): ", english_digits) # Output: میں ۲۳ سال کا ہوں۔
7. remove_numbers_urdu(text)
This function removes Urdu numbers from the Urdu text.
Example:
text = "میں 22 ۲۳ سال کا ہوں۔"
no_urdu_numbers = urdu_text_processing.remove_numbers_urdu(text)
print("Remove Urdu numbers from text: ", no_urdu_numbers) # Output: میں 22 سال کا ہوں۔
8. remove_numbers_english(text)
This function removes English numbers from the Urdu text.
Example:
text = "میں ۲۳ 23 سال کا ہوں۔"
no_english_numbers = urdu_text_processing.remove_numbers_english(text)
print("Remove English numbers from text: ", no_english_numbers) # Output: میں ۲۳ سال کا ہوں۔
9. remove_whitespace(text)
This function removes extra whitespaces from the Urdu text.
Example:
text = "میں گھر جا رہا ہوں۔"
cleaned_text = urdu_text_processing.remove_whitespace(text)
print("Remove All extra space between words", cleaned_text) # Output: میں گھر جا رہا ہوں۔
10. preserve_special_characters(text)
This function adds spaces around special characters in the Urdu text to facilitate tokenization.
Example:
text = "میں@پاکستان_سے_ہوں۔"
preserved_text = urdu_text_processing.preserve_special_characters(text)
print("make a space between every special character and word so tokenize easily", preserved_text) # Output: میں @ پاکستان _ سے _ ہوں ۔
11. remove_numbers(text)
This function removes both Urdu and English numbers from the Urdu text.
Example:
text = "میں ۲۳ سال کا ہوں اور میری عمر 23 ہے۔"
number_removed = urdu_text_processing.remove_numbers(text)
print("Remove All numbers whether they are Urdu or English: ", number_removed) # Output: میں سال کا ہوں اور میری عمر ہے۔
12. remove_english(text)
This function removes English characters from the Urdu text.
Example:
text = "I am learning Urdu."
urdu_only = urdu_text_processing.remove_english(text)
print("Remove All English characters from text: ", urdu_only) # Output: ام لرننگ اردو
13. pure_urdu(text)
This function removes all non-Urdu characters and numbers from the text, leaving only Urdu characters and special characters used in Urdu.
Example:
text = "I ? # & am learning Urdu. میں اردو سیکھ رہا ہوں۔ 123"
pure_urdu_text = urdu_text_processing.pure_urdu(text)
print(pure_urdu_text) # Output: میں اردو سیکھ رہا ہوں۔
14. just_urdu(text)
This function removes all non-Urdu characters, numbers, and special characters, just leaving only pure Urdu text even not special character used in urdu.
Example:
text = "I am learning Urdu. میں اردو سیکھ رہا ہوں۔ 123"
Security Score
Audited on Mar 28, 2026
