Text-Based Communication Analysis with Machine Learning

Darshana Samaranayake
Analytics Vidhya
Published in
7 min readDec 31, 2020

This study takes a deeper look into the historical text messages by applying various machine learning techniques. The purpose of the study is to apply Natural Language Processing (NLP) techniques to identify communication trends, evaluate the effectiveness of the existing process, and to provide any insight from historical data.

Methodology

For the analysis of the data, I used Python as the programming language along with numerous software libraries such as pandas, math, numpy, sklearn, matplotlib and etc. In addition, I have used Natural Language Processing (NLP) which is a machine learning (ML) techniques that help to understand, interpret, and manipulate human language using artificial intelligence.

Data Exploration

The focus here is to explore the large set of unstructured data and uncover any initial patterns, characteristics, and points of interest. It is not expected to reveal every bit of information the dataset holds in this step, but rather to help create a broad picture of important trends and major points to study in greater detail.

Finding the In and Out Message trends for Spring and Summer

The current data set contains a limited number of columns. In order to get the message counts by month, we need to extract the year and month from the time field.

df_text=pd.read_csv('../Data/clean_text.csv')df_text.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62513 entries, 0 to 62512
Data columns (total 7 columns):
message_type 62513 non-null object
text 62513 non-null object
time 62513 non-null object
ID 62513 non-null object
group_list 62513 non-null object
term 62513 non-null object
sent_by 62513 non-null object
dtypes: object(7)
memory usage: 3.3+ MB
#Extracting year and month from 'time' column and append the dataframe with new columns.df_text['year'] = pd.DatetimeIndex(df_text['time']).year
df_text['month'] = pd.DatetimeIndex(df_text['time']).month
df_text.head()

Now we can count the number of inbound and outbound messages by month.

# Getting the message count by message type and month
# reset_index() gives a column for counting, after groupby uses year and category
df_cnt = (df_text.reset_index()
.groupby(['month','message_type'], as_index=False)
.count()
# rename isn't strictly necessary here, it's just for readability
.rename(columns={'index':'count'})
)
#sorting the values by month
df_cnt.sort_values(by = 'month', ascending = True)

Now it's time to use matplotlib to present the data.

import matplotlib.pyplot as pltdf_all_cnt =df_cnt#adding month name for better presentation
df_all_cnt['month_t'] = df_all_cnt['month'].apply(lambda x: calendar.month_abbr[x])
fig, ax = plt.subplots(figsize=(5, 3), dpi=150)# key gives the group name (i.e. category), data gives the actual values
for key, data in df_all_cnt.groupby('message_type'):
data.plot(x='month_t', y='count', ax=ax, label=key)

# Hide the right and top spines
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
#draw the grid
ax.grid( linestyle='-', linewidth=0.2)
ax.legend()
ax.set_xlabel('Month')
ax.set_ylabel('Nu. Of Messages')
ax.set_title('In and Out Messages By Month - Both Programs')

In the same way, we can use the other columns in the dataset as x and y-axis to populate different types of graphs to show various aspects of the same dataset.

Text Analysis with NLP

NLP is an excellent method to analyze and interpret textual data such as student responses. It is an ML technique that machines use to understand the human language like text, speech, emoji, and vastly used in the industry these days (i.e Siri and Alexa).

Cleaning, removing stop words and stemming

It is important to clean your data by removing all the unnecessary words, symbols, and any text that will not help you to do meaningful analysis. Below functions will help us with cleaning, removing the stopwords but we also need a few libraries to help with that.

from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')
from gensim.parsing.preprocessing import STOPWORDS
from gensim.parsing.preprocessing import remove_stopwords
set(stopwords.words('english'))
from nltk.tokenize import word_tokenize
from contractions import contractions_dict
import unicodedata
def remove_punctuation(text):
text = ''.join([i for i in text if not i.isdigit()])
return re.sub("[!@#$+%*:()/|,;:'-]", ' ', text)
def removebrackets(text):
return re.sub('[\(\[].*?[\)\]]', ' ', text)
def remove_accented_chars(text):
return unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
def remove_special_chars(text, remove_digits=False):
pattern = r'[^a-zA-Z0-9\s]' if not remove_digits else r'[^a-zA-Z\s]'
return re.sub(pattern, '', text)
def remove_stopwords(text):
# text_tokens = word_tokenize(text)
# text = remove_stopwords(text)

from gensim.parsing.preprocessing import STOPWORDS
all_stopwords_gensim = STOPWORDS.union(set(['thank', 'need', 'yes', 'okay']))text_tokens = word_tokenize(text)
tokens_without_sw = ' '.join([word for word in text_tokens if not word in all_stopwords_gensim])

return tokens_without_sw
def stemming (text):

ps = nltk.porter.PorterStemmer()
return ' '.join([ps.stem(word) for word in text.split()])

stopword_list = stopwords.words('english')
tokens = nltk.word_tokenize(text)
tokens = [token.strip() for token in tokens]
return ' '.join([token for token in tokens if token not in stopword_list])
def lemmatize(text):
text = nlp(text)
return ' '.join([word.lemma_ if word.lemma_ != '-PRON-' else word.text for word in text])
def expand_contractions(text, contraction_mapping=contractions_dict):

contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())),
flags=re.IGNORECASE|re.DOTALL)
def expand_match(contraction):
match = contraction.group(0)
first_char = match[0]
expanded_contraction = contraction_mapping.get(match)\
if contraction_mapping.get(match)\
else contraction_mapping.get(match.lower())
expanded_contraction = first_char+expanded_contraction[1:]
return expanded_contraction

expanded_text = contractions_pattern.sub(expand_match, text)
return re.sub("'", "", expanded_text)

After cleaning our data, we can populate the word cloud for our analysis. Word Cloud is one of the most effective ways to represent this data, which indicates the importance of the text by size, depth of the color, and the boldness of the word.

from wordcloud import WordCloud, STOPWORDS, ImageColorGeneratorwc = WordCloud(stopwords=STOPWORDS,max_font_size=200, max_words=1000000, background_color="white", width=1000, height=1000).generate(' '.join(df_filtered['text_clean']))
plt.figure(figsize=(20,20))
plt.imshow(wc)
plt.axis("off")
plt.show()
Word Cloud based on the historical data analysis

After populating the word cloud, you can identify any other texts that you do no need for your analysis. You can simply extend the current stopwords library by adding those identified words as follows.

def clean_text_again(text):from gensim.parsing.preprocessing import STOPWORDSall_stopwords_gensim = STOPWORDS.union(set(['thank', 'need', 'yes', 'okay','ok','thanks','morning','hello','sure','hi', 'know', 'got','yesterday']))text_tokens = word_tokenize(text)
tokens_without_sw = [word for word in text_tokens if not word in all_stopwords_gensim]
return ' '.join([word for word in text_tokens if not word in all_stopwords_gensim])

Word Cloud extended — Ngram exploration

Ngrams are simply contiguous sequences of n words. Looking at most frequent n-grams can give us a better understanding of the context.

The below function presents the most frequent non-stop words in a bar chart.

import seaborn as snsdef plot_top_non_stopwords_barchart(text):

new= text.str.split()
new=new.values.tolist()
corpus=[word for i in new for word in i]
counter=Counter(corpus)
most=counter.most_common()
x, y=[], []
for word,count in most[:40]:
if (word not in stop):
x.append(word)
y.append(count)

plt.figure(figsize=(15,8))
sns.barplot(x=y,y=x,palette="colorblind")
Most frequent non-stop words

However, in order to understand the context of the non-stop words and their relationships; we need to populate the graph with the words that tend to follow them immediately. I changed the above function to use 3 immediate words to draw the Trigram as follows.

def plot_top_ngrams_barchart(text, n=3): #n is the number of immediate words you need.

new= text.str.split()
new=new.values.tolist()
corpus=[word for i in new for word in i]
def _get_top_ngram(corpus, n=None):
vec = CountVectorizer(ngram_range=(n, n)).fit(corpus)
bag_of_words = vec.transform(corpus)
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word, sum_words[0, idx])
for word, idx in vec.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
return words_freq[:10]
top_n_bigrams=_get_top_ngram(text,n)[:10]
x,y=map(list,zip(*top_n_bigrams))
sns.barplot(x=y,y=x)
Trigram with most frequent words

Sentiment Analysis

Sentiment analysis is a widely used tool with an NLP toolkit that classifies the sentiment of textual data such as student response in an automated way. First, we can find the distribution of sentiment using a polarity histogram. A polarity is a floating-point number that lies in the range of -1 to 1 where 1 means a positive statement and -1 means a negative statement.

from textblob import TextBlob

def plot_polarity_histogram(text):

def polarity(text):
return TextBlob(text).sentiment.polarity

df_x = pd.DataFrame()
polarity_score =text.apply(lambda x : polarity(x))
df_filtered['polarity_score']=df_filtered['text'].\
apply(lambda x : polarity(x))
polarity_score.hist()
Overall sentiment polarity

We can further analyze the sentiment to see the distribution of positive, negative, and neutral sentiments. The below function categorizes the sentiment and presents a bar plot.

from textblob import TextBlob
import matplotlib.pyplot as plt
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import nltk
def sentiment_vader(text, sid):
ss = sid.polarity_scores(text)
ss.pop('compound')
return max(ss, key=ss.get)
def sentiment_textblob(text):
x = TextBlob(text).sentiment.polarity

if x<0:
return 'Negative'
elif x==0:
return 'Neutral'
else:
return 'Positive'

def sentiment_x(x):
if x<0:
return 'Negative'
elif x==0:
return 'Neutral'
else:
return 'Positive'

def plot_sentiment_barchart(text, method):
if method == 'TextBlob':
sentiment = text.map(lambda x: sentiment_textblob(x))
elif method == 'Vader':
nltk.download('vader_lexicon')
sid = SentimentIntensityAnalyzer()
sentiment = text.map(lambda x: sentiment_vader(x, sid=sid))
else:
raise ValueError('Textblob or Vader')

df_filtered['polarity']=df_filtered['polarity_score'].\
apply(lambda x : sentiment_x(x))

plt.figure(figsize=(5, 2), dpi=100)
plt.bar(sentiment.value_counts().index,
sentiment.value_counts())
Overall sentiment categorization

Sentiment analysis technology is at its infancy. Thus, predictions should be evaluated accordingly and with human intervention. But with a large amount of clean and labeled data large enough neural networks will increasingly become accurate and will be a great tool in the long term.

--

--

Darshana Samaranayake
Analytics Vidhya

Data Analyst (IBM Cognos | SQL | Python | Data Analytics | Data Science | Machine Learning | NLP)