Chapter 9 : Natural Language Processing.

so far we have talked about machine learning and deep learning algorithms which can be used in any field. One of the main fields where ML/DL algorithms are used is Natural language processing(NLP) so from now onwards lets talk about the NLP.

NLP is a big area, probably bigger than Machine learning cause the concept of language is really intense so we are not gonna focus on it completely but we focus on the small area where it meets machine learning and deep learning.

let’s understand the natural language processing in our space.

We can break that into 2 sections

  1. :

The system should be able to understand the language(parts of speech, context , syntax , semantics, interpretation and etc…)

This can typically be done with the help of machine learning( Although problems are there).

Not much difficult to do and gives good accuracy results.


The system should be able to respond / generate text (text planning, sentence planning, producing meaningful phrases and etc…)

This can be done with the help of deep learning as deep understanding is required( Although problems are there).

much difficult to do and the results may not be accurate.

These are the couple of applications where we focus

  1. Text classification and clustering
  2. Information retrieval and extraction
  3. Machine translation(one language to another)
  4. Question and answering system
  5. spelling and grammar checking
  6. Topic modeling and sentiment analysis
  7. Speech recognition

I will try to explain and complete all the topics in next following stories , in this story we learn the basic fundamentals for text/document which is common for many applications.

Note: Assume now that Text , data, document,sentence and paragraph all are same.

Each word in the text has a meaning where the text may or may not have a meaning.

in machine leaning we take features right? so here each word is a feature(unique).

Ex :

Text : , , are the features for this input.

First apply (a text is divided into token), we can use open source tools like NLTK to get tokens from the text.

checkout this example

so here we have the repeated twice as tokens but we only take once so the features for this text are → , , , , , .

but wait the words and mean same , these are called inflectional forms. we need to remove these

removing these inflectional endings is called

so now the features for this text are → , , , , ,.

we can even think deep and say the word is similar to the word

there is a concept called Steeming

so if we apply steeming then

so now the features for this text are → , , , , ,.

there are couple of words which occur very frequently in every language and don’t have much meaning , these words are called

The stop words in English are

So stop words should be removed from our text

Note: we covert the text into lower case before tokenization to avoid duplicates.

the final features for this text are → , ,

That’s understandable and cool.

Here we get a clean text so we work on it and generate the features but in real world we don’t get the clean data , we always get the raw data which has a lot of unwanted things(symbols, links, hashtags, numbers, spaces and etc..)

We need to clean the clean the data, this process can be called

let’s take a tweet from twitter as it is a real world data and apply

This text contains a bit of noise like punctuation’s, a link and etc.. check the below images for comparison

Text features comparison

so how can original tweet be normalized???

we remove the unwanted things in data using regular expression.

these are a small set of statements, there can be many more , depending upon the data we need to normalize the text to a certain extent.

Let’s take a toy dataset which has 2 training examples (documents/texts)

1 → I love programming

2 → Programming also loves me

After normalization, lemmatization, and stopwords,

Our features are gonna be , , We can store them in a file and call it a or a

Okay we got the features from text. Features values are = words

we can not feed the words to the computer / model / ml algorithm, The features values must be numbers.

so we take the count for every word in every document as a value.

Note: these numbers are after changing the documents.

so instead of feeding “I love programming” or “love programming” [After changing] , we feed [1 1 0] as a vector.

I call this process

x1 x2 x3
1  1  0 --> just like our previous ml data and we can do what ever
1  1  1     we wanna do.

here is the code for this

We just converted each document into a vector.

6th cell is a count vectorization, it gives the count of how many times a word appeared in the entire dataset.

we can achieve the same results using scikit learn count vectorizer.

The most appeared words called

Hope you understand the code.

There is still a lot of things to learn and point here , in the next story i will cover the remaing topics which are TFIDF and word2vec.

Let me know your thoughts/suggestions/questions.

That’s it for this story, we will see next time until then Seeeeeee Yaaaaaaaa!

Full code is on my .

Deep Math Machine

This is all about machine learning and deep learning (Topics cover Math,Theory and Programming)

Madhu Sanjeevi ( Mady )

Written by

Writes about Technology (AI, ML, DL) | Writes about Human Mind and Computer Mind. interested in ||Programming || Science || Psychology || NeuroScience || Math

Deep Math Machine

This is all about machine learning and deep learning (Topics cover Math,Theory and Programming)