Chapter 9 : Natural Language Processing.

so far we have talked about machine learning and deep learning algorithms which can be used in any field. One of the main fields where ML/DL algorithms are used is Natural language processing(NLP) so from now onwards lets talk about the NLP.

Image for post
Image for post

NLP is a big area, probably bigger than Machine learning cause the concept of language is really intense so we are not gonna focus on it completely but we focus on the small area where it meets machine learning and deep learning.

Image for post
Image for post

let’s understand the natural language processing in our space.

Natural language processing

The main goal here is , we wanna make the computer understand the language as we do and we wanna make the computer respond as we do.

We can break that into 2 sections

  1. Natural language understanding:

The system should be able to understand the language(parts of speech, context , syntax , semantics, interpretation and etc…)

This can typically be done with the help of machine learning( Although problems are there).

Not much difficult to do and gives good accuracy results.

2. Natural language generation:

The system should be able to respond / generate text (text planning, sentence planning, producing meaningful phrases and etc…)

This can be done with the help of deep learning as deep understanding is required( Although problems are there).

much difficult to do and the results may not be accurate.

so where do we use ML in NLP???

These are the couple of applications where we focus

  1. Text classification and clustering
  2. Information retrieval and extraction
  3. Machine translation(one language to another)
  4. Question and answering system
  5. spelling and grammar checking
  6. Topic modeling and sentiment analysis
  7. Speech recognition

I will try to explain and complete all the topics in next following stories , in this story we learn the basic fundamentals for text/document which is common for many applications.

Note: Assume now that Text , data, document,sentence and paragraph all are same.

What is a text ??

A text is a set words sequentially written.

Each word in the text has a meaning where the text may or may not have a meaning.

in machine leaning we take features right? so here each word is a feature(unique).

Ex :

Text : I love programmingI , love, programming are the features for this input.

How do we derive the features??

First apply Tokenization (a text is divided into token), we can use open source tools like NLTK to get tokens from the text.

Image for post
Image for post

checkout this example

Image for post
Image for post

so here we have the programming repeated twice as tokens but we only take once so the features for this text are → I , love, programming, and, also, loves, me.

but wait the words love and loves mean same , these are called inflectional forms. we need to remove these

removing these inflectional endings is called lemmatization

Image for post
Image for post

so now the features for this text are → I , love, programming, and, also,me.

we can even think deep and say the word programming is similar to the word program

there is a concept called Steeming

Image for post
Image for post
pythonspot.com(picture)

so if we apply steeming then

Image for post
Image for post

so now the features for this text are → I , love, program, and, also,me.

there are couple of words which occur very frequently in every language and don’t have much meaning , these words are called Stop words.

The stop words in English are

So stop words should be removed from our text

Note: we covert the text into lower case before tokenization to avoid duplicates.

Image for post
Image for post

the final features for this text are → love, program, also.

That’s understandable and cool.

Image for post
Image for post

Here we get a clean text so we work on it and generate the features but in real world we don’t get the clean data , we always get the raw data which has a lot of unwanted things(symbols, links, hashtags, numbers, spaces and etc..)

We need to clean the clean the data, this process can be called data normalization.

let’s take a tweet from twitter as it is a real world data and apply tweet normalization

Image for post
Image for post

This text contains a bit of noise like punctuation’s, a link and etc.. check the below images for comparison

Text features comparison

so how can original tweet be normalized???

we remove the unwanted things in data using regular expression.

Image for post
Image for post

these are a small set of statements, there can be many more , depending upon the data we need to normalize the text to a certain extent.

Image for post
Image for post

Let’s take a toy dataset which has 2 training examples (documents/texts)

1 → I love programming

2 → Programming also loves me

Image for post
Image for post

After normalization, lemmatization, and stopwords,

Our features are gonna be love, programming, also. We can store them in a file and call it a dictionary or a lexicon.

Okay we got the features from text. Features values are = words

we can not feed the words to the computer / model / ml algorithm, The features values must be numbers.

so we take the count for every word in every document as a value.

Image for post
Image for post
Note: these numbers are after changing the documents.

so instead of feeding “I love programming” or “love programming” [After changing] , we feed [1 1 0] as a vector.

I call this process document vectorization.

   X
x1 x2 x3
1 1 0 --> just like our previous ml data and we can do what ever
1 1 1 we wanna do.

here is the code for this

Image for post
Image for post

We just converted each document into a vector.

6th cell is a count vectorization, it gives the count of how many times a word appeared in the entire dataset.

we can achieve the same results using scikit learn count vectorizer.

Image for post
Image for post

The most appeared words called the top bag of words.

Hope you understand the code.

There is still a lot of things to learn and point here , in the next story i will cover the remaing topics which are TFIDF and word2vec.

Let me know your thoughts/suggestions/questions.

That’s it for this story, we will see next time until then Seeeeeee Yaaaaaaaa!

Full code is on my Github.

Deep Math Machine learning.ai

This is all about machine learning and deep learning…

Madhu Sanjeevi ( Mady )

Written by

Writes about Technology (AI, ML, DL) | interested in ||Programming || Science || Psychology || Math https://www.linkedin.com/in/madhusanjeeviai

Deep Math Machine learning.ai

This is all about machine learning and deep learning (Topics cover Math,Theory and Programming)

Madhu Sanjeevi ( Mady )

Written by

Writes about Technology (AI, ML, DL) | interested in ||Programming || Science || Psychology || Math https://www.linkedin.com/in/madhusanjeeviai

Deep Math Machine learning.ai

This is all about machine learning and deep learning (Topics cover Math,Theory and Programming)

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store