Introduction to Natural Language Processing

Harsh Bardhan Mishra
Analytics Vidhya
Published in
5 min readMar 8, 2020

Natural Language Processing or simply abbreviated as NLP is a highly specialized branch of Artificial Intelligence and Machine Learning that deals with understanding the human natural language. Natural Language Processing has paved a new path into the ever-growing use-cases of Artificial Intelligence in Real-Time Applications by expanding the horizon in the field of Speech Recognition, Sentiment Analysis, Topic Segmentation and much more. Today Natural Language Processing is aiming to help machines understand how human speak and understand their vocabulary and intent which can be then used in a variety of real-world applications with a primary focus on the user itself.

Natural Language Processing has since long considered as computationally expensive since Human Natural Langauge is neither uniform nor subtle, especially with grammatical errors and improper sentence structure adding to the woes of NLP researchers. Today with the availability of a wide variety of Libraries and Datasets, performing NLP has become a cake-walk since they catalogue everything from grammatical mistakes and word synonyms.

However, Natural Language Processing is a deep and highly specialized area and one that is still a hotbed of research especially with newer algorithms such as LSTM (Long-Short Term Memory), Recurrent Neural Network, Convolutional Neural Network which are considered far efficient in analyzing and representing Human Language.

Before moving onto the technicalities of Natural Language Processing, we still need some basics to cover. All Human Texts and Speech consist of stopwords which are nothing but media entities, commonly-spoken words, punctuation, slangs and more which contribute to the noise that is present in the text. Before analyzing any text for processing, our first step would be to standardize our text by Text Preprocessing. Text Preprocessing can be simply defined as the entire procedure that is utilized to make the text noise-free and fit for Data Analysis.

After Text Preprocessing, we must move onto the Normalization which can be simply defined as the process of converting high dimensional features into low dimensional spaces. Normalization is seen as an optimum solution to various woes that are associated with high-level data. A good example of why Normalization is necessary can be understood that an English Text might consider multiple variations of the same word, which could be its Participle, Paste/Present/Future Tense and more. Lexicon Normalization helps tackle the problem by reducing the words to its roots.

The data is later shuffled to decrease the variance and to ensure that the model does not overfit the data. This process is followed by one of the most important processes in Natural Language Processing: Feature Extraction. Since Machine Learning Models only understand Numerical Data, the features are extracted from the text and vectorized which is then converted to a numerical form that the Machine Learning Algorithm can understand. Text Features can be extracted via various techniques some of which are N-Grams, TF-IDF Model and more. In this process, the NLP Program tokenizes the text and features are extracted and saved.

Now the Textual Data has been preprocessed and converted into Numerical Format, now is our time to apply a Machine Learning Algorithm. In the case of a Supervised Learning Problem, the most popular Use-Cases are Prediction Problems, like Gender Classification which we are going to discuss in this article. In case of an Unsupervised Learning Problem, segmentation and clustering can be applied to the vectorized data. Now that we are aware of the basics of Natural Language Processing, let us move onto a popular Natural Language Processing Problem which will be to predict the gender of a person using their name.

Predicting the Gender of a Person using Machine Learning

In this problem, we are going to predict the Gender of a Person using Machine Learning and Natural Language Processing. We are going to utilize an Open-Source Dataset and Naive Bayes Machine Learning Algorithm, which is a popular Machine Learning Algorithm for Classification Purposes. Let us start:

  1. We will first Import the necessary packages: NLTK (Natural Language Tool-Kit) which is a popular library available for Natural Langauge Processing and Naive Bayes Classifier from Scikit Learn Machine Learning Library.

2. Next, we will download our dataset from NLTK which consists of large number of names that are labelled as “Male” and “Female”. We will import the dataset in our code and then we will upload the dataset in a data frame. The Dataframe is randomly shuffled to eliminate overfitting.

3. Now we will come to one of the most important part of any Machine Learning Algorithm Implementation: Train-Test Split. Train-Test Split is to divide the dataset into a proportion to allow some part of the data to be used for training the model while the other part of data to be used for testing the accuracy of the model. Here we will split our dataset into three parts: Train Features, Devtest Features and finally the Test Features.

4. Now we will implement the Naive Bayes Machine Learning Algorithm bypassing the Training Dataset to the algorithm. We will print the accuracy of our model on Test Features and Devtest Features.

5. Our Model has been developed. Let us test it know:

It seems our model is working fine. With more data and proper feature engineering along with more robust Machine Learning Algorithms, we can improve the accuracy of our model which can be then moved onto production. The model can be now saved as a Pickle File which will help us save the pain of retraining the model the next time we are in need of it.

Conclusion

Natural Language Processing has been used in varied files, some of which are listed below:

  1. Chatbot Development and Customer Service.
  2. Customer Segmentation and Review Analysis.
  3. Sentiment Analysis and Fake News Detection.
  4. Spam Classification and Text Summarizer.

--

--

Harsh Bardhan Mishra
Analytics Vidhya

🎓 Sophomore || 👨🏻‍💻 Web Developer || Microsoft β Student Partner || 🤖 ML Enthusiast || 💻 Blogger || 🐍 Pythoneer