Natural Language Processing(NLP) Basics:
Field that deals with the ability of computers to understand, analyses, manipulate and potentially generate human language. Here, by human language we can say any language use for daily communications i.e. say Nepali, English, Spanish and many more.
In general when we write the word ‘Natural’ python will not know what that means it just sees the collection of 7 characters. And NLP is that field which will make computers understand what ‘Natural’ actually means and then manipulation and generation of natural languages is later.
NLTK:
It stand for Natural Language Tool Kit.
NLTK is an essential tool for handling natural language processing tasks in Python. NLTK is widely popular and used as it gives a jump start to build any NLP projects by providing many basic tools which can be further combined to accomplish our project goal.
Installing NLTK on Linux ( UBUNTU):
Here , i assumed python to be already installed within the system.
Then: pip install nltk
For checking: open python then try import nltk
We use jupyter notebook for our coding for these section.
Now we check if we have all our packages are installed: via nltk.download() on notebook following window pops out.
If all installed, everything is OK for use. But if not installed then, select all and download it.
We can also check the packages download as: dir(nltk) on our notebook.
Reading in text data:
Text data typically will be on semi-structured or unstructured.
What does Unstructured Mean?
They are binary data with no delimiters and no indication of rows.
Now let’s go on coding here i used ‘SMSSpamCollection.tsv’ data sets.
If header not assigned to none then, first row will be automatically assigned to header so it is compulsory. Sep attribute gives idea on how to separate text.
Regular Expression:
A regular Expression, or regex in short is a text string used for describing a certain search pattern.
Eg:
- if ‘nlp’ is an expresion given then in a text it will search for nlp and show if found. say if text is ‘ I love nlp’ then it returns nlp.
- if ‘[j-q]’ is an expression then what it does is return individual alphabets between j to q within the text. say if ‘nlp’ are the alphabets then return ’n’, ’l’, ‘p’
- if ‘[j-q]+’ then it mean more than one alphabet can be considered i.e. whole ‘nlp’ in above case.
- if ‘[0–9]+’ then it mean of digits between 0–9 . say 2019
- if ‘[j-q0–9]+’ then combines above both i.e. ‘nlp2020’ if without space then if space then it finds out two sequence ‘nlp’ and ‘2020’
Above are just five examples but there are infinite patterns that we could come up with. And regex’s gives the power and flexibility to search almost any kind of pattern one can imagine.
Regex’s are particularly useful when dealing with text data as it is usually unstructured and we can use it to try to create some structure within data. It can be used for following:
- Identify white space between words and tokens.
- Let python know how to split up certain text.
- Identify delimiter between column.
Using Regular Expression:
Python’s
re
package is the most commonly used regex resource. More details can be found on https://docs.python.org/3/library/re.html.
Here in the code we can see we created three different text.
First text is simple separated by single space. So, for splitting we can simply using regular expression ‘\s’ where s means single white space so it splits if single white space is found.
Then in second text we can see it is separated by multiple space. So ‘\s’ wont work. Hence what we used is ‘\s+’ which mean single or multiple white space .Still it fails for text 3.
Since text 3 contain many special alphabet, we use ‘\W+’ as it splits whenever it encounter single or multiple non-alphabets.
Machine Learning Pipeline:
Pipeline means steps to be completed for reaching specific goal. Following is the pipeline for NLP:
- Raw text -model can’t identify word.
- Tokenize- tell model what to look at
- Clean Text- Remove stop words/ punctuation,stemming etc
- Vectorize- As model still view that as only string, we convert them to numeric value to feed to algorithm
- Machine Learning Algorithm- fit/train model
NOTE: step 2 and 3 are also know as preprocessing steps.
Pre-processing:
Cleaning up the text data is necessary to highlight attributes that you’re going to want your machine learning system to pick up on. Cleaning (or pre-processing) the data typically consists of a number of steps:
- Remove punctuation
- Tokenization
- Remove stopwords
- Lemmatize/Stem
The first three steps are covered in this blog. As they’re implemented in pretty much any text cleaning pipeline. Lemmatizing and stemming are covered in the next blog as they’re helpful but not critical and everyone doesn’t do it.
Remove Punctuation:
We know in human language punctuation doesn’t hold much meaning or gives meaning of the sentence given. So, for our convenience we remove punctuation.
Tokenization:
Here, we split each word in text and place it to list so that machine can learn from each word.
Removing Stopwords:
Stopwords actually mean most common words which doesn’t affect the exact meaning of the sentence. So, removing these make machine easier to learn and also makes learning faster and convenient.