Natural Language Processing in a nutshell

Naethreepremnath
Python’s Gurus
Published in
4 min readMay 3, 2024

Have you ever wondered how computers can translate text from one language to another? Or generate the next word of your text message that you are sending your friend? Or how a movie review can be classified as positive, negative or neutral? It is amazing how computers are understanding human speech and texts and responding to them. But have you wondered what is happening behind the scenes?

Photo by Drew Beamer on Unsplash

All of this is possible because of NLP(Natural Language Processing) which is a branch of artificial intelligence. Human language also known as natural language can be understood by computers and this ability is known as NLP. NLP is not something new, as it has existed for more than 50 years however it keeps evolving at an exceptional rate and you too could contribute. But first, let’s understand more about how it works.

How does NLP work?

To understand this, we need to understand two main phases of NLP: Data pre-processing and Algorithm development.

  1. Data pre-processing

Data pre-processing involves cleaning of data so that machines can analyze them. It involves several steps aimed at cleaning, transforming, and organizing the data. Here’s a more detailed explanation of a few techniques that are commonly used:

  • Tokenization — Tokenization breaks down the text into smaller units known as tokens. This allows the machine to analyze each units individually. Note that tokens can be words, phrases, or even characters, depending on the specific task or requirements.
  • Stop word removal — Stop words are words that are used commonly which do not hold significant value. Eg: ‘the’, ‘is’, ‘and’. These common words are removed so that other words which hold more importance are remained which can be analyzed by the machine. However remember that in some cases stop word removal could cause incorrect conclusions. Eg: When stop word removal is applied to “Spiderman was not a good movie”, the word ‘not’ will be removed which can result in the review being positive which is incorrect.
  • Stemming — Stemming removes prefixes and suffixes allowing only the base/root form to remain. E.g.: analyzing becomes analyze, movies become movie.
  • Lemmatization — Lemmatization is similar to Stemming, however it is a bit more complex. Lemma is the dictionary form of a verb. Unlike stemming, lemmatization looks into the context or the word and part of speech(noun, adjective, verb) to produce valid lemmas. Eg: went becomes go.

NOTE: If you are confused about what to choose remember that stemming is a simpler and more aggressive approach, often used for tasks where speed is a priority, while lemmatization provides more accurate results at the cost of computational resources and complexity. Keeping this in mind, the choice is yours!

Other than the above, the usual ways of handling missing values, noise(html tags, emojis) remain the same.

The data has to undergo proper data-preprocessing, if not the accuracy of the analysis by the machine will be low.

2. Algorithm development

There are 2 types: Rule-based system, Machine learning-based system

  • Rule-based system — Relies on pre-defined rules and patterns to analyze and process data hence transparent and interpretable, however may struggle with ambiguity and adapting to new situations.
  • Machine learning-based system — Uses algorithms that learn from data to automatically identify patterns, make predictions or classifications. This offers flexibility and can handle complex patterns. However computational resources are needed for this and one might also find it difficult to understand the inner workings due to its complexity.

NOTE: Hybrid approaches are also used combining both the above types.

Applications of NLP

  • Sentiment analysis: Involves determining the sentiment behind a text. E.g. movie reviews, customer feedback can be categorized as positive, negative or neutral.
  • Text classification: Data can be classified into different labels. Eg: emails can be categorized as spam, social media content can be categorized to identify what posts violates community guidelines, news can be classified relevant to its topic(politics, finance, sports) etc.
  • NER(Named Entry Recognition): NER identifies and extracts important names of people, organizations, locations, dates from text.
  • Machine translation — Translates text from one language to another. Eg: Google Translate.

I could keep going on because that’s how much NLP is used widely. But I will stop now considering that my objective was to introduce NLP in a nutshell.

In conclusion, Natural Language Processing (NLP) serves as the bridge between human language and machine understanding, revolutionizing the way we interact with technology. From sentiment analysis to machine translation, NLP plays a role in various aspects of our daily lives, making communication more efficient and insightful than ever before. As NLP continues to evolve at a rapid pace, the possibilities for its applications are endless, promising a future where machines comprehend and respond to human language with unprecedented sophistication. So, the next time you marvel at a translated webpage or witness a movie review categorized with precision, remember the remarkable role that NLP plays in shaping our digital world.

Python’s Gurus🚀

Thank you for being a part of the Python’s Gurus community!

Before you go:

  • Be sure to clap x50 time and follow the writer ️👏️️
  • Follow us: Newsletter
  • Do you aspire to become a Guru too? Submit your best article or draft to reach our audience.

--

--

Naethreepremnath
Python’s Gurus

BSc(Hons) in Data Science, University of Colombo (Reading) | Public Speaker | Writer