Master The Skill of Text Analytics And Be Successful

Introduction

If you work in analytics or data science field like we do, you are familiar with the fact that data arrives at incredible speed and data scientists and analysts are trained to handle tabular data that is most numerical and categorical. But today the majority of available business data is unstructured and text heavy such as books, articles, website text, blog posts, social media posts etc. Corporate executives make critical business rules based largely on the identified use of keywords and phrases in free-form text fields provided by prospects, customers and partners, as well as the frequency and proximity of their use in relation to each other. It’s our dream and nightmare: There maybe too much data to deal with.

That’s why we wrote this post on text analysis, to help you stay up to speed with turning text into numbers and implement some common text mining techniques. Eventually we hope you will be able to apply powerful algorithms to your organizations’ large document text databases.

Hacker News is one of our favorite sites to catch up on technology and startup news, but navigating the minimalistic website can be sometimes tedious. Therefore, my plan in this post is to introduce you how this social news site can be analyzed, in as non-technical a fashion as I can, as well as presenting some initial results, along with some ideas about where we will take it next.

The Hacker News data set I downloaded includes one million Hacker News article titles from September 2013 to June 2017.

To begin, let’s look at the visualization of the most common words in Hacker News title.

Some Initial Simple Exploration

Figure 1 (Source: deepPiXEL.ai)

Figure 1 shows the most frequent words that appeared on Hacker News titles from September 2013 to June 2017.

For the most part, we would expect it is a fairly standard list of common words in Hacker News titles. The top word is “hn”, because “ask hn”, “show hn” are part of the social news site’s structure. The second most frequent words such as “google”, “data”, “app”, “web”, “startup” and so on are all within our expectation for a social news site like Hacker News.

How to apply this methodology to your business? If you are doing troubleshooting on customer trouble tickets and without having to read through thousands of them, you’d like to be able to search for common words or phrases with a count in a particular field of the trouble ticket.

Simple Sentiment Analysis

Let’s address the topic of sentiment analysis. Sentiment analysis detects the sentiment of a body of text in terms of positive or negative. When used, particularly at scale, it can show you how people feel towards the topics that are important to you — particularly your brand and product offerings.

We can analyze word counts that contribute to each sentiment. From the Hacker news articles, we have found how much each word contributed to each sentiment.

Figure 2 (Source: deepPiXEL.ai)

Immediately, we found some problems here, “cloud” and “slack” are classified as negative words by the lexicons I use. Actually, “cloud” means “cloud computing” and “slack” is a software company in the context of the Hacker News. Therefore, our domain knowledge comes into play and helps to make the final judgement in this case.

Word cloud is a good idea to identify trends and patterns that would otherwise be unclear or difficult to see in a tabular format. We can also compare most frequent positive and negative words in word cloud.

Figure 3 (Source: deepPiXEL.ai)

How to apply this methodology to your business? At the heart of your customer support efforts is the happiness of your customers. Naturally, you want them to love your products or services. Whether you are analyzing feedback forms, chat transcripts, emails or social media, Sentiment Analysis will help you hear the true voice of the customer and how they feel about your products or services.

Relationship Between Words

We often want to understand the relationship between words in a document. What sequences of words are common across text? Given a sequence of words, what word is most likely to follow? What words have the strongest relationship with each other? Therefore, many interesting text analysis are based on the relationships. When we exam pairs of two consecutive words, it is often called “bigrams”.

Figure 4 (Source: deepPiXEL.ai)

From Figure 4, we can see that the winner of most common bigram in Hacker News data goes to “machine learning” and the second is “silicon valley”.

How to apply to your business? The challenge in analyzing text data, is in understanding what the words mean. The use of the word “deep” has different meaning if it is paired with the word “water” as opposed to the word “learning”. As a result, a simple summary of word counts in text data will likely be confusing unless the analysis relate it to the other words that also appear without assuming an independent process of word choice.

Networks of Words

Words networks analysis is one method for encoding the relationships between words in a text and constructing a network of the linked words. This technique is based on the assumption that language and knowledge can be modeled as networks of words and the relations between them.

Figure 5 (Source: deepPiXEL.ai)

For Hacker news data, as demonstrated in Figure 5, we can visualize some details of the text structure. For example, we can see pairs or triplets that form common short phrases (“social media network” or “neural networks”).

How to apply this methodology to your business? This type of network analysis is mainly showing us the important nouns in a text, and how they are related. The resulting graph can be used to get a quick visual summary of the text, read the most relevant excerpts.

We see text analytics as tools and methods born out of customer relationship management. deepPiXEL has already done a bit of helpful research on applied text analytics to the contents of organizations’ service requests. We have learned that text analytics is a journey and we are seeing more and more organizations turning to text analytics in order to retain more customers longer. There is no structured survey data does a better job predicting customer behavior as well as actual voice of customer text comments and messages!

We hope you found this short article helpful. Once organizations have the capability to automatically derive insights from text analytics, they then can translate the insights into actions.

What are your best-known tips on text analytics? Share with us in the comment below.