Introducing LughaatNLP: A Powerful Urdu Language Preprocessing Library

Muhammad Noman
4 min readApr 12, 2024

LughaatNLP is an exciting new open-source Python library designed to streamline natural language processing tasks for the Urdu language. Developed by Muhammad Noman, a student at Iqra University in Pakistan, this comprehensive toolkit offers a wide range of features tailored specifically for Urdu text preprocessing.

In the field of natural language processing (NLP), preprocessing text data is a crucial step that significantly impacts the performance of downstream tasks such as text classification, sentiment analysis, and information extraction. LughaatNLP aims to simplify this process by providing a suite of tools to handle various aspects of Urdu text normalization, tokenization, lemmatization, stop word removal, stemming, spell checking, part-of-speech tagging, and named entity recognition.

Key Features of LughaatNLP:

  1. Tokenization: Accurately tokenize Urdu text into individual words, numbers, and punctuations while accounting for the intricacies of the Urdu script and language structure.
  2. Lemmatization: Convert inflected Urdu words to their base or dictionary form, enhancing text analysis and comprehension.
  3. Stop Word Removal: Remove common Urdu stop words to focus on meaningful content during text processing.
  4. Normalization: Standardize Urdu text by removing diacritics, normalizing character variations, handling common orthographic variations, and preserving special characters used in Urdu.
  5. Stemming: Reduce Urdu words to their root or stem form, further improving text analysis and comprehension.
  6. Spell Checking: Identify and correct misspelled words in Urdu text, enhancing text quality and readability.
  7. Part-of-Speech Tagging: Assign grammatical categories (e.g., nouns, verbs, adjectives) to words in Urdu text, facilitating syntactic analysis and understanding of sentence structures.
  8. Named Entity Recognition: Identify and categorize named entities (e.g., persons, organizations, locations) within Urdu text, enabling information extraction and semantic analysis.

Installation and Usage:

LughaatNLP is available on PyPI (https://pypi.org/project/lughaatNLP/) and can be easily installed using pip:

pip install lughaatNLP

Import Libraries and Create an instance of an object

from LughaatNLP import LughaatNLP
from LughaatNLP import POS_urdu
from LughaatNLP import NER_Urdu
urdu_text_processing = LughaatNLP()
ner_urdu = NER_Urdu()
pos_tagger = POS_urdu()

Example 1: Normalization All in one

"آپ کیسے ہیں؟ میں 23 سال کا ہوں۔" = text
normalized_text = urdu_normalizer.normalize(text)
# اپ کیسے ہیں ؟ میں ۲۳ سال کا ہوں ۔ <= output

Example 2 : Lemmatization

"میں کتابیں پڑھتا ہوں۔" = sentence
lemmatized_sentence = urdu_normalizer.lemmatize_sentence(sentence)
#میں کتاب پڑھنا ہوں۔ <= output

Example 3: Stop Words Removing

"میں اس کتاب کو پڑھنا چاہتا ہوں۔" = text
filtered_text = urdu_normalizer.remove_stopwords(text)
# کتاب پڑھنا چاہتا ہوں۔ <= output

Example 4: Spell Checker


'سسب سےا بڑاا ملکا ہے' = sentence
corrected_sentence = spell_checker.corrected_sentence_spelling(sentence, 60)
# output => This correct spelling of sentence itself

Example 5: Tokenization


"میں پاکستان سے ہوں۔" = text
tokens = urdu_normalizer.urdu_tokenize(text)
# ['میں' ,'پاکستان ' ,'سے' ,'ہوں۔'] <= output

Example 6: Part of Speech


"میرے والدین نے میری تعلیم اور تربیت میں بہت محنت کی تاکہ میں اپنی زندگی میں کامیاب ہو سکوں۔" = sentence
predicted_pos_tags = pos_tagger.pos_tags_urdu (sentence)

print(predicted_pos_tags)

#output => [{'Word': 'میرے',' POS_Tag': 'G'},
#{'Word': 'والدین',' POS_Tag': 'NN'},
#{'Word': 'نے',' POS_Tag': 'P'},
#{'Word': 'میری',' POS_Tag': 'G'},
#{'Word': 'تعلیم',' POS_Tag': 'NN'},
#{'Word': 'اور',' POS_Tag': 'CC'},
#{'Word': 'تربیت',' POS_Tag': 'NN'},
#{'Word': 'میں',' POS_Tag': 'P'},
#{'Word': 'بہت',' POS_Tag': 'ADV'},
#{'Word': 'محنت',' POS_Tag': 'NN'},
#{'Word': 'کی',' POS_Tag': 'VB'},
#{'Word': 'تاکہ',' POS_Tag': 'SC'},
#{'Word': 'میں',' POS_Tag': 'P'},
#{'Word': 'اپنی',' POS_Tag': 'GR'},
#{'Word': 'زندگی',' POS_Tag': 'NN'},
#{'Word': 'میں',' POS_Tag': 'P'},
#{'Word': 'کامیاب',' POS_Tag': 'ADJ'},
#{'Word': 'ہو',' POS_Tag': 'VB'},
#{'Word': 'سکوں',' POS_Tag': 'NN'},
#{'Word': '۔',' POS_Tag': 'SM'}]

Example 7: Name Entity Relation

"اس کتاب میں پاکستان کی تاریخ بیان کی گئی ہے۔" = sentence
word_tag_dict= ner_urdu.ner_tags_urdu (sentence)
print(word_tag_dict)
# output => {'اس':' O', 'کتاب':' O', 'میں':' O', 'پاکستان':' U-LOCATION', 'کی':' O', 'تاریخ':' O', 'بیان':' O', 'گئی':' O', 'ہے'
# : 'O', '۔':' O'}

and more …………….

see other function on Documentation:

(https://drive.google.com/file/d/1guI7VSlrnSQDBk40qcqvEOX_TKbCJd9b/view)

After installation, you can import the necessary functions or classes in your Python script and start processing Urdu text right away. The library provides detailed documentation (https://github.com/MuhammadNoman76/LughaatNLP/) with usage examples and explanations for each function.

Contributing and Future Plans:

LughaatNLP is an open-source project, and contributions from the community are welcome. If you encounter any issues or have suggestions for improvements, you can open an issue on the GitHub repository (https://github.com/MuhammadNoman76/LughaatNLP/) or submit a pull request.

Muhammad Noman has ambitious plans for the future development of LughaatNLP, including adding features such as Urdu language translation, chatbot models, text-to-speech and speech-to-text capabilities, and text summarization. To support the implementation of these features, resources such as servers and GPUs for training are required. Muhammad Noman is currently collecting funds to support the development and maintenance of this library.

Conclusion:

LughaatNLP represents a significant step forward in enabling natural language processing for the Urdu language. By providing a comprehensive set of tools for Urdu text preprocessing, this library aims to facilitate the development of various NLP applications and research projects involving Urdu text data. Whether you’re a researcher, developer, or enthusiast interested in Urdu language processing, LughaatNLP is an invaluable resource worth exploring.

To learn more about LughaatNLP, visit the GitHub repository (https://github.com/MuhammadNoman76/LughaatNLP/) or check out Muhammad Noman’s YouTube playlist (https://www.youtube.com/playlist?list=PL4tcmUwDtJEIHZhAZ3XP9U6ZJzaS4RFbd) dedicated to the library. Don’t hesitate to reach out to Muhammad Noman via email (muhammadnomanshafiq76@gmail.com) or LinkedIn (https://www.linkedin.com/in/muhammad-noman-shafiq-5982b62ab/) with any questions or feedback.

--

--