101 of Text Analysis and Classification

All you need to know to start rocking in Text Analysis with Python

Published in

Analytics Vidhya

8 min readApr 25, 2020

We are living in an interesting era. Organizations are evolving to lavage on data in their operations. Most, though not all, are running away from intuitive decision making and embracing data-informed decisions. Industries are waking up to the reality that data is the new fuel they need to power their operations and market strategies for immense profitability. McKinsey-Global consultancy giants- indicates that “data-driven organizations are 23 times more likely to acquire customers, six times as likely to retain those customers, and 19 times as likely to be profitable as a result.” I have attended several data summits and interacted with several business executives. I have been agile to note them sporadically throw some quotes like; “In God we trust, all others bring data”, ”data is the new oil”,” Without big data, you are blind and deaf and in the middle of a freeway”, “Data beats emotions”, ”Above all else, show the data” and many more. The point is, we are no longer in the era where data science was an enigma. The big differentiator in the market today is the mastery of operations and customer data. Whoever understands customers data is better positioned to win customers, heart.

In your quest for becoming a data scientist in this era of rapidly evolving industries, fluency in handling both structured and unstructured data is both mandatory and inevitable. I love to think of data as a highway and data science as the automobile to get me to the destination- solve a business problem. The more conversant you are with the roads, the smoother and faster your journey will be. Remember, the key goal of data science is to construct the means for extracting business-focused insights from data. It goes without saying, fancy data always beats fancy algorithms.

What’s the difference between structured and unstructured data?

Structured data is the most common type of data. As you can deduce from the name, structured data is the type of data that is well-formatted and highly organized. This data type conforms to tabular format with clear and labelled columns and rows. In often cases, the term structured is used to refer to what is commonly known as quantitative data. Working with structured data and running some analytical algorithms is very straightforward. The thousands, if not millions, of free learning resources online.

On the other hand, unstructured data is information not neatly organized or formatted. Wikipedia defines unstructured data as information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy but may contain data such as dates, numbers, and facts as well. Common examples of unstructured data include PDFs, Word files, audio, video files or No-SQL databases.

This blog is going to introduce you to some valuable analytical and machine learning techniques you need to master for a smooth introduction in text analysis. Have Fun, Enjoy😊😊

What is Natural Language processing?

I Can simply define Natural Language processing, commonly dubbed as NLP, as the branch of data science that aids computers to process, analyze and manipulate the human’s natural Languages. NLP seeks to bridge the gap between human communication and computer understanding. The evolution of NLP is dated back in the 1950’s though it has rapidly advanced in the modern era due to an increased interest of human to machine communications and availability of big data and enhanced algorithms.

Though the field is very interesting, I have to admit that it is not an easy task to train a machine to understand how we communicate.

Uses of NLP

The application of NLP extensively cut across all industry that generates and consumes big data. Basically, NLP is the technology behind all the virtual assistants, speech-to-text, chatbots, computer-based translations and many more.
Below are some common uses of NLP in the modern business world. You can read more on the same here.

Chatbots
Sentiment analysis
Hiring and recruitment
Advertising
Market intelligence

In this blog, we shall be working with a dataset of messages which are classified as Spam or Ham. The goal is to explore the data then create an accurate classification model that identifies if a message is a spam or not. You can download the dataset here!

1. Loading the required Packages

Natural Language Toolkit (NLTK)-This is the leading module for natural language processing in python. It provides easy-to-use interfaces and very rich in libraries and functions.
StopWords- Stopwords are commonly used words which don’t add much meaning to a sentence or our model, “the”, “a”, “is”, “on”. Stopword module contains all the English stopwords which we shall train our model to ignore.
Sklearn- Sklearn is the most popular machine learning and predictive analytics module in python. It offers a wide range of functionality in preparing features, creating supervised and unsupervised models and measuring the performance of the model.
WordCloud- This is the package we shall use to create some visual representations of our data text.
Pandas and Numpy- We require the two packages to word with data frames.
Matplotlib and Seaborn- This are plotting packages in python. I have comprehensively discussed them in this cool blog here!

2. Loading and preparing the data

We use pandas to load and tidying up the dataset.

## Importing the dataset
df=pd.read_csv("spam.csv",encoding="latin-1")
df.head()

Well, there some few things we need to do to make our data organized and ready for analysis.

get rid of unnamed columns with NaN values
Rename the columns
Then add a column of labels where 0 represents ham and 1 represents spam

## Droping the unceccesary columns and renaming 
df.drop(["Unnamed: 2","Unnamed: 3","Unnamed:4"],axis=1,inplace=True)
df.columns=["Class","Text"]
df["Lable"]=df.Class.map({"ham":0,"spam":1})
df.head()

3. Exploring the data

Now that our data is organized, we can perform some Exploratory analysis.
Let us begin by having a look at frequency summaries of each class by using the describe function.

df.groupby('Class').describe(include='O')

We have 4825 rows classified as ham messages with 4516 non-duplicates. Spam messages are 747 with 653 unique. We can use seaborn countplot function to visualize the frequencies.

Counting the frequent words in every

We want to see which words are common in ham messages and those which are common in spam messages. Before we create the word frequent summary, we need to clean up the data. By cleaning i mean;

Break all the sentences into words
get rid of punctuation marks
convert all the words to lower case
Get rid of stopwords and all words with less than two character.

We create a words_cleaner function with one argument of data being cleaned and returns a set of clean words.

def words_cleaner(data):
    words=[re.sub('[^a-zA-Z]', ' ',i) for i in df['Text']]
    words=[i.lower() for j in data for i in j.split()
    words=[i for i in words if not i in   set(stopwords.words('english'))]
    words=[i for i in words if len(i)>2]
    return words

Now we can extract all the words in ham messages and create a data frame of their frequencies.

Ham_texts=df.query("Class=='ham'")['Text']
ham_words=words_cleaner(Ham_texts)
ham_words_freq=nltk.FreqDist(ham_words)
ham_words_freq=pd.DataFrame(ham_words_freq.most_common(10),
                            columns=['Top_Words','Frequency'])

Perfect! Now we have a dataframe of top 10 most used words in Ham messages. We can now create a bar graph to visualize the frequencies.

plt.figure(figsize=(8,6))
sns.set_style('whitegrid')
ax=sns.barplot(x='Top_Words',y='Frequency',data=ham_words_freq)

Yeah, I know, that cool! 😎Now, you can do the same for Spam messages.

Creating MindMaps

To get a whole overview of word frequency in both classes, we can leverage on word clouds. It’s pretty easy to create word clouds in python, we just need to create a function with two arguments -data and background colour- then return a word cloud.

Ham Messages word cloud

Now that we have a function for creating word clouds, we only need one line of code to return word clouds. in word clouds, the bigger the word, the frequent it is.

wc(words_cleaner(Spam_texts),’black’)

Spam Messages Wordcloud

wc(words_cleaner(Spam_texts),'black')

Spam messages mostly contain words like Free, Call, Text, Mobile, Claim and Call Now.

Machine Learning

Creating a model to classify a message to either spam or ham.

Before creating the classifier, we need to clean our features/independent variable i.e(Text column). To clean our features, we shall follow the following simple procedure:

Remove all punctuation marks in each Text message
Convert the text message to lower case
Split each message into single words
Remove all the stopwords
Stem the words using PorterStemmer function. This involves cutting each word to its root form. e.g, the stem for these words: [car, cars, car’s, cars’] is car. A Stem for [ loving,`love, lovely loverble] is Lov
We join back the cleaned words into sentences.

To achieve this, we shall create a function that loops through our data by cleaning each message at a time, then returns an array of cleaned words.

A Function to clean all the features in our data

See how the dataset looks after cleaning

First 3 messages of raw data and the same after cleaning.

I think the difference is clear! 😊

Training the model.

We shall use Naive Bayes Classifier which is proven to offer statistically satisfying results in text classifications especially email filtering.
Below is a snipe of features preparation(Vectorizing CountVectorizer)and model train.

cv=CountVectorizer() 
X=cv.fit_transform(X)
X_train,X_test,y_train,y_test=train_test_split(X,y,
                                               test_size=0.25,
                                               random_state=50)
classifier=MultinomialNB() 
classifier.fit(X_train,y_train)
y_pred=classifier.predict(X_test)

Testing the Accuracy of the model

accuracy_score(y_pred,y_test)

Our model is 97.77% accurate. That’s a commendable performance.
We can also generate a confusion matrix to zoom in to the performance of the model. The leading diagonal values indicate the correctly predicted test value by our model.

confusion_matrix(y_test,y_pred)

Confusion matrix

You can access the entire script here.

Recommendation

This blog was inspired by a user request auto-responsive programme I helped engineer for Esque Kenya. They are the leading providers of school management and logistics solution. You can check on them here.

If you enjoyed this blog, don’t forget to 👏🏼👏🏼👏🏼 and to follow for more.

Stay safe!