Lessons Learned Doing Real World NLP For A B2B Chatbot

A couple of weeks ago, we launched Talla the Task Assistant, it’s a subset of the full Talla functionality coming later this year, and focuses on natural language management of to-do/task lists in Slack (and Hipchat soon!) Our goal was to get 50 companies signed up the first month. Three weeks in, we have signed up over 700 companies across 30,000 total users. The main thing we have learned is that NLP is !#*&) hard.

I’ve split this is up into sections for easy navigation: What Talla Does, Our NLP Team and Tools, and Different Errors and Challenges.

What Talla Does

Talla the Task Assistant can do a few things, like set reminders, add tasks to a list, remove tasks from a list, show a task list, and reschedule task due dates. Talla can do all of these things for individual or group task lists. Talla can also assign tasks or reminders to another team member. Below are some snippets of Talla in action.

Our NLP Team and Tools

Our data science team consists of 6 people (some are consultants, some are FTEs). Two of the 6 have PhDs. One has a MS in NLP and Machine Learning, and one did NLP at Google. So, it’s not like we are starting from scratch. This team had done pretty well on the Kaggle Allen A.I. competition.

We tried a bunch of tools, like IBM’s Watson, Cortical.io, Wit.ai, and a few other platforms. Most of these either didn’t really work, or, were built for very basic NLP, so we ended up doing most of it ourselves. Our NLP stack is is written in Python and uses NLTK for text tokenization, part-of-speech tagging and classification; entity extraction with MITIE and parsedatetime; and a bit of duct tape and regular expressions to tie it all together.

We are very concerned about privacy, so our backend automatically flags and shows errors that only a few people can access. We can analyze these failed interactions to determine how to build better NLP models.

Errors and Challenges

So, what did we see when we launched? Well, Talla performed better than I expected, but man is NLP tough. Our error rate is in the low single digits, but, that is still too high. Some of the errors weren’t NLP related. For example, if you travel, Slack doesn’t automatically change your time zone as set in your settings, although, you will see your time zone adjusted in the UI message timestamps. This created a few problems where people set reminders and tasks with incorrect due dates, that seemed like they should have been correct.

Users also tried to use some features that don’t exist. They said things like “Show me my overdue tasks.” That kind of query is coming (it’s a technology called NLIDB), but we aren’t at production quality with it just yet. They also tried to mark multiple tasks due at once “mark 1 and 3 done”, which we don’t support.

The NLP errors we saw fell into 3 key categories. The most common was failing to extract the right time. For example, when people used 9.00 as a time, instead of saying 9:00 or 9:00 am, we didn’t parse it correctly. This one is interesting because we can’t fix it just but adding a way to catch the period, because it could cause problems parsing something like “remind me to give Jon $10.00 tomorrow at 4pm.” Numbers with decimals could represent lots of things (including time) so, we have to use more context to solve this one.

Putting a comma between date and time was another common problem. This was an easy fix but, it’s something we didn’t anticipate. So when a user says “set a reminder to call Bob August 1st, 10am, we don’t parse it correctly because we don’t see the entire “August 1st, 10am” piece as the full time entity.

To parse the date and time, we use a python library called parsedatetime, and our own developed logic and inference around interpretation of those dates and times, but it doesn’t catch everything. For example, one problem we are seeing in Talla is when people give tasks a due date of “by the end of the day,” which we don’t parse correctly.

The problem with chatbots and NLP is that, if the bot is too simple, it isn’t that useful. But the more powerful the bot gets, the more complex the NLP can become, and quickly. This present an interesting dynamic for chatbot builders and users. Users don’t want 50 chatbots, one for each thing they have to do, so that leads me to believe the chatbot space will consolidate into an oligopoly. But, that consolidation will rapidly increase bot complexity and NLP failures, leading to user dis-satisfaction, which will slow the bot consolidation. The user just can’t win here.

This leads me to a proposed hypothesis. Bots will consolidate into an oligopolistic structure that maximizes the difference between NLP assumptions per bot. What I mean is, if you assume bots will end up in an oligopolistic structure, there are many ways bots could consolidate: around related functional verticals, around data types, around legal/privacy/access issues for data types, around use cases, and many many more. What will determine the ultimate structure is, the NLP assumptions that you can make around a specific bot.

So as you think about avoiding Botageddon, and you wonder, will there be a Finance bot, a Marketing bot, and a Product bot within an organization (functional area oligopoly), or will there be a Common Sense bot, a Documentation Bot, BusinessIntelligenceBot, and a WikiBot (data types and data access oligopoly), the answer is — whichever bot configuration puts the NLP assumptions between bots at the greatest distance. This allows each bot to be maximally accurate for its domain.

In summary, NLP is hard, but if you like it, Talla is hiring.

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.