Helping bots ‘get it’

Understanding natural language topics in chat messages

Inna Shteinbuk
Digg Data
9 min readMar 17, 2017

--

During my time at betaworks last summer, I worked on extracting principal topics from messages that Digg’s Facebook Messenger Bot received. This post explains a topic extraction method I created specifically for chat messages, which we later used to develop RIO — a topic mining engine that uses reinforcement to tune its algorithm and automatically predict tags for chat messages and news articles.

Conversational bots today understand user messages only when rules are hard-coded to match the exact message pattern. Such scripted bots find it hard to scale and can be frustrating for users who want to have a basic natural language exchange with the bot. Every bot provides a service (for Digg this is news), but users don’t necessarily restrict their messages to that.

One of the open challenges in NLP around bots is to figure out if a user message is just a compliment or subjective reflection, rather than a service-related request. It would be very useful if instead of hard-coded rules, we could automatically attach some topics to user messages so that the bot would at least know what the user is talking about. Even if the bot wasn’t yet programmed to respond to the message, if it could give a contextual fallback response based on the topic of the user’s message, the interaction would be more valuable.

When you chat conversationally with Digg’s chatbot, you’ll notice that it doesn’t always understand the exact topic you’re talking about. As I mentioned before, this is a widespread issue in the capability of all current chatbots. Diggbot too isn’t fully conversational, and actually responds better when you speak to it in keywords.

Going through the chat logs, I saw tons of different kinds (and lengths) of messages that users left behind. They ranged from serious queries about the election and other news topics to inappropriate comments. In some instances, Diggbot didn’t respond the right way. Below is a snapshot from one of my most recent interactions with Diggbot. I was delighted with the story it sent me, but notice how the bot lacks understanding of my compliment. When users want to be conversational, Diggbot just doesn’t get it.

Analyzing Chat Messages

To analyze the chat messages, I excluded button presses and specific service related queries (like “trending”) so I was left with free form and conversational text. From this data, I found we could split the types of chat messages that people engage with Diggbot through into 3 groups. I created the chart below to illustrate the groups.

Group 1 consists of greetings or responses to the bot like oh cool . Group 2 consists of actual queries about news — the service that Digg provides, while group 3 is purely conversational but unrelated to news, sometimes a little weird, messages like I want to be your boyfriend. Although Group 3 is a lot larger for other bots like Poncho (which portray the persona of being your friend), the fact that Group 3's size for Digg is non-trivial goes to show that humans may have an inclination to chat casually with bots despite a focused service that the bot provides.

We found that group 1 consisted of at least 14% of chat queries, meaning that even though Diggbot tells users to speak in keywords, some users try to be conversational. If we could determine beforehand which one of these 3 groups a chat message falls into, Diggbot would become a little smarter (i.e. refrain from responding to a query like my day was great today with a news article).

Topic Prediction Pipeline

We designed a pipeline, consisting of a Phrase and State Generator that takes chat messages and extracts a topic when the user is actually requesting to use the bot’s primary service — i.e. wants to see an article about some news topic.

  • Phrase Generator identifies whether the message belongs to Group 1 or 2, and then calculates the most important phrase(s) for the Group 2 message. The phrase is then passed on to State Generator to extract a topic (state). The state space includes all Digg tags (~80 unique tags) and interjections.
  • The State Generator depends largely on Digg’s internal API, which allows access to more than 100,000 editor-tagged articles. A state is a tuple of the predicted topic and confidence.
Examples of some Digg tags. You can see the latest articles of a certain tag, e.g. “funny” by going to http://digg.com/channel/funny

We ignore categorizing messages in Group 3 because (1) the topic space is just too vast, and (2) since news can literally be about anything, the words in group 3 will always overlap with the words that compose group 2, sprouting a confusing dilemma of whether to respond with just a news article or artificially generate commentary about the news.

Filters in Phrase Generator:

Step 1: Interjection Classifier

We start by filtering out chat messages that are interjections (Group 1) by training an interjection classifier. We went through around 5900 chat messages, and labeled the interjections. Some examples of interjection messages include lolllll, oh, great or whats up? .

For the interjection classifier, we used a Linear Support Vector Machine, using Bag of Word features and some other binary features such as whether the chat contained emoji, profanities, or parts-of-speech (POS) labeled interjections. Traditional POS Taggers break down on noisy chat queries; an example being that internet slang like lollll or hiiii are often labeled as Nouns. CMU’s Twitter POS Tagger works pretty well with labeling this type of text correctly, so I used it for the features in our classifier.

We obtained an accuracy of 94% after splitting the data into ⅔ training and ⅓ testing set. This means we were very accurate in spotting messages that were just remarks, exclamation or salutation and didn’t request news.

Step 2 + 3: Preprocessing + Phrase Extraction

In the preprocessing step, all POS interjections included in the query are excluded from further analysis. For example, lol kanye west becomes kanye west. During phrase extraction, we extract the noun phrase that will be used to search Digg CMS, which contains editorially-tagged news articles.

The main limitation here is that we can’t be 100% sure that we extracted the right phrase, in which case the error might propagate down the pipeline and cause a misclassification.

Filters in State Generator

The state generator part of the pipeline is a series of filters used to parse a phrase and recover the most relevant Digg tag for it. It takes advantage of Digg’s CMS (~110k news articles), which powers Digg’s internal API and contains four years worth of indexed and queryable data using ElasticSearch. Here are the four steps in the pipeline:

This pipeline is also an integral part of RIO — a topic mining algorithm (which I will introduce in my next blog post).

Step 1: Search Digg Articles

Tag distribution for “the beatles”

Let’s say our input phrase is The Beatles. After making a call for “the beatles” in Digg’s internal API, we can get the topic for this phrase by simply choosing the highest count in a frequency distribution of Digg tags associated with the result. The normalized distribution of the resulting tags, confirms that “the beatles” maps most closely to the music. This example tells us that if we extract the right noun phrases from the text, we’ll be able to correctly map a message to a Digg tag.

The Monet Problem: However, when we search a topic that does not come up as much in news media, like Monet, the frequency distribution of Digg tags is not as telling.

Tag distribution for “monet”

If we follow the method mentioned in the previous step, monet would be mapped to internet, instead of art (distribution graph to the left). This example illustrates one of the problems of editor/media bias. If Digg editors don’t feature a threshold number of stories about some topic, the distributions become skewed towards irrelevance.

This is also due to the fact that Digg’s API uses a partial match search criteria on ElasticSearch. What happened here is that the Monet search was mapping to articles about “monetization” as well. We found that searching phrases with a full match criteria returns too few articles to extract meaningful information from them.

In other instances, the underlying ElasticSearch data might not reflect a certain phrase, so we won’t get any results back. For example, there were no articles returned about the NY Giants. Unless we add NY Giants stories associated with their ground truth tag (i.e. sports) to ElasticSearch, we can’t hope to associate the NY Giants phrase with sports. However, this problem can be solved by simply adding focused channels like this one for sports to ElasticSearch with a sports tag. A solution to “the Monet problem” is explained next.

Step 2: N-gram Analysis

An alternative to analyzing a single frequency distribution (like the one shown above) is to split the returned articles into two distributions; one with retrieved articles that actually contain the n-gram in their title or description from the ElasticSearch results (Hits), and the other which don’t (No-Hits). Then, we can re-evaluate these two distributions:

Distribution of tags for articles with no-hits and hits on “monet”. The difference between the two can be used to hone in on the actual tag.

Step 3: Entropy Analysis + Confidence Assignment

Mathematically, the Information(I) contained in a distribution(X) can be calculated by subtracting the maximum entropy from the entropy(H) of the distribution:

We calculate I for both distributions (Hits and no Hits) and then determine the next steps using empirically defined thresholds for “low” and “high” values of I. For example, if I_hits is high and I_no-hits is low, we take the best tag in the Hits distribution. On the other hand, if both I are low, we check if there is an intersection between the top tags of these 2 distributions to assign a topic with low confidence. But there might be no intersection, in which case we must do further analysis, explained in Step 4.

Step 4: Collocation Analysis

Digg editors usually tag each article with 5 or 6 relevant tags. Using a collection of Digg articles from the last four years, we did a collocation analysis of tags. This tells us which tags usually occur together, and helps us discern the plausibility of tag distributions. Based on tag co-occurrence, we assigned a score to all pairwise tags. For example, art and business co occur a lot less than culture and internet and therefore have a smaller collocation score. We assume that the right tag to assign will be one of the top occurrences in the distribution. Using these top tags from both distributions, we find all pairwise collocation scores, and then choose the most unique tag , or one with the smallest co-occurrence score as the final topic.

After this sequence of filters, we are able to assign a state i.e. either (topic + confidence) or an interjection to the chat message. Doing this would reduce the number of times Diggbot responded with “Sorry! I didn’t quite catch that…”, and gives Diggbot more strategies to choose from — like recommending related content that the user may enjoy given his/her previous queries or a topical deflection.

While my method was built and tested on Digg bot’s chat logs, it can generally be extended to predict topics for user messages in any bot.

Check out my next blog post (out soon) on RIO, where I discuss further on how to use this pipeline to evaluate the effectiveness of our automatic tagging system. The pipeline predicts topics for Digg’s trending stories, which are selected from the 6 million unique links that Digg aggregates from Twitter every day.

--

--

Inna Shteinbuk
Digg Data

Engineer at Civil, Cornell Tech Alumna. Previously Data+ML at Digg and Betaworks.