LughaatNLP

Introducing LughaatNLP, the all-in-one Urdu language toolkit designed for NLP tasks in Pakistan. It offers essential features like tokenization, lemmatization, stop word removal, POS tagging, NER, normalization, summarization, and even text-to-speech and speech-to-text capabilities, all crafted exclusively for Urdu. Unlock the power of Urdu NLP effortlessly with LughaatNLP!

Documentation

Explore the full potential of LughaatNLP through its detailed documentation, which includes practical usage examples, installation guides, and API references.

Google Colab Notebook

Get hands-on with LughaatNLP using the interactive Google Colab Notebook provided in the documentation. This notebook lets you experiment with the toolkit's functionalities directly in your browser.

PyPI Package

Install LughaatNLP effortlessly via its PyPI page to integrate Urdu NLP capabilities into your Python projects seamlessly.

YouTube Tutorial Series

Accelerate your understanding of LughaatNLP's features with the dedicated YouTube tutorial playlist, offering step-by-step guidance on utilizing the toolkit for various Urdu NLP tasks.

Blogs and Articles

Gain insights into advanced techniques and best practices for mastering Urdu text processing through informative articles like:

Geeksforgeeks Link :

LughaatNLP: A Powerful Urdu Language Preprocessing Library

Medium Link :

Introducing LughaatNLP: A Powerful Urdu Language Preprocessing Library

LughaatNLP Blog :

Mastering Urdu Text Processing

Features

Tokenization: Breaks down Urdu text into individual tokens, considering the intricacies of Urdu script and language structure.
Lemmatization: Converts inflected words into their base or dictionary form, aiding in text analysis and comprehension.
Stop Word Removal: Eliminates common Urdu stop words to focus on meaningful content during text processing.
Normalization: Standardizes text by removing diacritics, normalizing character variations, and handling common orthographic variations in Urdu.
Stemming: Reduces words to their root form, improving text analysis and comprehension in Urdu.
Spell Checker: Identifies and corrects misspelled words in Urdu text, enhancing text quality and readability.
Part of Speech Extraction: Tags words with their grammatical categories, enabling advanced syntactic analysis.
Named Entity Recognition (NER): Identify and extract names of entities like persons, organizations, or locations.
Text Summarization: Generates concise summaries of Urdu text, facilitating quick understanding of lengthy content.
Text-to-Speech Conversion: Converts Urdu text into spoken audio, enabling accessibility and language learning applications.
Speech-to-Text Conversion: Transcribes spoken Urdu into written text, facilitating voice-based interactions and data entry.

Installation

You can install the LughaatUrdu library from PyPI using pip:

pip install lughaatNLP

Alternatively, you can manually install it by downloading and unzipping the provided LughaatNLP.rar file and installing the wheel file using pip:

pip install path_to_wheel_file/LughaatNLP-1.0.2-py3-none-any.whl

Required Packages

The LughaatNLP library requires the following packages:

python-Levenshtein
tensorflow
numpy
scikit-learn
scipy
gtts
SpeechRecognition
Pydub

You can install these packages using pip:

pip install python-Levenshtein tensorflow numpy scikit-learn scipy gtts SpeechRecognition pydub

Usage

After installing the library, you can import the necessary functions or classes in your Python script:

#importing Pakages
from LughaatNLP import LughaatNLP 
from LughaatNLP import POS_urdu from LughaatNLP import NER_Urdu
         from LughaatNLP import TextSummarization
         from LughaatNLP import  UrduSpeech

# Instance Calling 
urdu_text_processing = LughaatNLP() 
ner_urdu = NER_Urdu()
pos_tagger = POS_urdu()
speech_urdu = TextSummarization()
speech_urdu = UrduSpeech()

Functions

Normalization

1. `normalize_characters(text)`

This function normalizes the Urdu characters in the given text by mapping incorrect Urdu characters to their correct forms. Sometimes, single Unicode characters representing Urdu may be written in multiple forms, and this function normalizes them accordingly.

Example:

text = "آپ کیسے ہیں؟"
normalized_text = urdu_text_processing.normalize_characters(text)
print(normalized_text)  # Output: اپ کیسے ہیں؟

2. `normalize_combine_characters(text)`

This function simplifies Urdu characters by combining certain combinations into their correct single forms. In Urdu writing, some characters are made up of multiple parts like ligatures or diacritics. This function finds these combinations in the text and changes them to their single character forms. It ensures consistency and accuracy in how Urdu text is represented.

Example:

text = "اُردو"
normalized_text = urdu_text_processing.normalize_combine_characters(text)
print(normalized_text)  # Output: اُردو

3. `normalize(text)`

This function performs all-in-one normalization on the Urdu text, including character normalization, diacritic removal, punctuation handling, digit conversion, and special character preservation.

Example:

text = "آپ کیسے ہیں؟ میں ۲۳ سال کا ہوں۔"
normalized_text = urdu_text_processing.normalize(text)
print("Normalize all at once together of Urdu: ", normalized_text)  # Output: اپ کیسے ہیں ؟ میں 23 سال کا ہوں ۔

4. `remove_diacritics(text)`

This function removes diacritics (zabar, zer, pesh) from the Urdu text.

Example:

text = "کِتَاب"
diacritics_removed = urdu_text_processing.remove_diacritics(text)
print("Remove all Diacritic (Zabar - Zer - Pesh): ", diacritics_removed)  # Output: کتاب

5. `punctuations_space(text)`

This function remove spaces after punctuations (excluding numbers) and removes spaces before punctuations in the Urdu text.

Example:

text = "کیا آپ کھانا کھانا چاہتے ہیں ؟ میں کھانا کھاؤں گا  ۔"
punctuated_text = urdu_text_processing.punctuations_space(text)
print(punctuated_text)  # Output: کیا آپ کھانا کھانا چاہتے ہیں؟ میں کھانا کھاؤں گا۔

6. `replace_digits(text)`

This function replaces English digits with Urdu digits.

Example:

text = "میں 23 سال کا ہوں۔"
english_digits = urdu_text_processing.replace_digits(text)
print("Replace All maths numbers with Urdu number eg(2 1 3 1 -> ۲ ۱ ۳ ۱): ", english_digits)  # Output: میں ۲۳ سال کا ہوں۔

7. `remove_numbers_urdu(text)`

This function removes Urdu numbers from the Urdu text.

Example:

text = "میں  22 ۲۳ سال کا ہوں۔"
no_urdu_numbers = urdu_text_processing.remove_numbers_urdu(text)
print("Remove Urdu numbers from text: ", no_urdu_numbers)  # Output: میں 22 سال کا ہوں۔

8. `remove_numbers_english(text)`

This function removes English numbers from the Urdu text.

Example:

text = "میں ۲۳ 23 سال کا ہوں۔"
no_english_numbers = urdu_text_processing.remove_numbers_english(text)
print("Remove English numbers from text: ", no_english_numbers)  # Output: میں ۲۳ سال کا ہوں۔

9. `remove_whitespace(text)`

This function removes extra whitespaces from the Urdu text.

Example:

text = "میں   گھر   جا   رہا   ہوں۔"
cleaned_text = urdu_text_processing.remove_whitespace(text)
print("Remove All extra space between words", cleaned_text)  # Output: میں گھر جا رہا ہوں۔

10. `preserve_special_characters(text)`

This function adds spaces around special characters in the Urdu text to facilitate tokenization.

Example:

text = "میں@پاکستان_سے_ہوں۔"
preserved_text = urdu_text_processing.preserve_special_characters(text)
print("make a space between every special character and word so tokenize easily", preserved_text)  # Output: میں @ پاکستان _ سے _ ہوں ۔

11. `remove_numbers(text)`

This function removes both Urdu and English numbers from the Urdu text.

Example:

text = "میں ۲۳ سال کا ہوں اور میری عمر 23 ہے۔"
number_removed = urdu_text_processing.remove_numbers(text)
print("Remove All numbers whether they are Urdu or English: ", number_removed)  # Output: میں سال کا ہوں اور میری عمر ہے۔

12. `remove_english(text)`

This function removes English characters from the Urdu text.

Example:

text = "I am learning Urdu."
urdu_only = urdu_text_processing.remove_english(text)
print("Remove All English characters from text: ", urdu_only)  # Output:  ام لرننگ اردو

13. `pure_urdu(text)`

This function removes all non-Urdu characters and numbers from the text, leaving only Urdu characters and special characters used in Urdu.

Example:

text = "I ?  # & am learning Urdu. میں اردو سیکھ رہا ہوں۔ 123"
pure_urdu_text = urdu_text_processing.pure_urdu(text)
print(pure_urdu_text)  # Output: میں اردو سیکھ رہا ہوں۔

14. `just_urdu(text)`

This function removes all non-Urdu characters, numbers, and special characters, just leaving only pure Urdu text even not special character used in urdu.

Example:

text = "I am learning Urdu. میں اردو سیکھ رہا ہوں۔ 123"

LughaatNLP

Install / Use

README

LughaatNLP

Documentation

Google Colab Notebook

PyPI Package

YouTube Tutorial Series

Blogs and Articles

Features

Installation

Required Packages

Usage

Functions

Normalization

1. `normalize_characters(text)`

2. `normalize_combine_characters(text)`

3. `normalize(text)`

4. `remove_diacritics(text)`

5. `punctuations_space(text)`

6. `replace_digits(text)`

7. `remove_numbers_urdu(text)`

8. `remove_numbers_english(text)`

9. `remove_whitespace(text)`

10. `preserve_special_characters(text)`

11. `remove_numbers(text)`

12. `remove_english(text)`

13. `pure_urdu(text)`

14. `just_urdu(text)`

LughaatNLP

Install / Use

README

LughaatNLP

Documentation

Google Colab Notebook

PyPI Package

YouTube Tutorial Series

Blogs and Articles

Features

Installation

Required Packages

Usage

Functions

Normalization

1. normalize_characters(text)

2. normalize_combine_characters(text)

3. normalize(text)

4. remove_diacritics(text)

5. punctuations_space(text)

6. replace_digits(text)

7. remove_numbers_urdu(text)

8. remove_numbers_english(text)

9. remove_whitespace(text)

10. preserve_special_characters(text)

11. remove_numbers(text)

12. remove_english(text)

13. pure_urdu(text)

14. just_urdu(text)

1. `normalize_characters(text)`

2. `normalize_combine_characters(text)`

3. `normalize(text)`

4. `remove_diacritics(text)`

5. `punctuations_space(text)`

6. `replace_digits(text)`

7. `remove_numbers_urdu(text)`

8. `remove_numbers_english(text)`

9. `remove_whitespace(text)`

10. `preserve_special_characters(text)`

11. `remove_numbers(text)`

12. `remove_english(text)`

13. `pure_urdu(text)`

14. `just_urdu(text)`