Bruce Wilcox
12 min readJul 14, 2018

Machine Learning Needs Help

Machine Learning (ML) has done amazing things: teaching cars to drive, recognizing objects in images, outplaying humans in games like Go. But current bot platforms (places to define bots to perform tasks) induce companies to build crappy chatbots. It’s not the fault of ML per se. It’s more the fault of the platforms and how developers interact with them.

Bot Platforms

The big tech companies all have bot platforms and they are all based on ML. They promise that anyone can create a bot just by feeding it sample sentences, no programming needed. It’s true. Anyone can build a crappy bot just by feeding it sentences. A good bot will take almost any reasonable expression of a request and handle it. A crappy bot requires the human user learn to make a request in a limited language. It’s training the human, not the AI.

Bot platforms exploded onto the scene in 2016. I tested those original bots — they were uniformly crappy. Two years later, I tried them again. Consider FaceBook’s Weather Channel bot. A well-respected website with a highly limited task domain. It could work merely by detecting a city input regardless of what else is said. And, in fact, it sort of does. Give it a city and it will give you the ability to see the forecast day by day (thereby not having to actually interpret any date information). With two years to improve, you’d expect the bot would be pretty good. So consider these samples I tried recently:

What’s the weather in Seattle

The basic request works. It would be really awful if it didn’t.

What will the weather be in Seattle next monday?

Works, but you have to scroll by day to get to your request. It would be great if it went to it directly.

What’s the weather in Seattle next monday

Strangely this fails, though the individual components worked in the earlier 2 sentences.

What’s the weather in Seattle next week

Gives you the scrollable by day list starting with today

What’s the weather in Seattle next month

Gives you just the current weather with no scrolling.

Chicago — Ok, Chicago, IL? + menu

What a weird interface. I have to select get weather to proceed. But it works.

And Chicago? — Ok, Chicago Sanitary and Ship Cana, IL? + menu

Can’t spell Canal properly? And why do I want a sanitary canal anyway. Where’s the city?

Boston — works

And Boston? — fails

How can and Boston not work?

What’s it like in boston — fails

Yeah, I didn’t really expect the idiom to work, but it’s not unreasonable to hope.

Will it rain in Boston? — works

Is it raining in Seattle — fails.

Odd that it can’t handle is it raining

What is the weater in boston — fails

That last input is a problem. Developers treat ML as a complete natural language solution. It’s not. Every single website should strap on a spell checker. Typos are common, and to ML they are completely novel words. And when they are matched as values in parameter detection, if you misspell them back to the user, it looks really awful.

ML’s weakness — training data

The problem with ML for natural language is the illusion that you just feed it a few sentences to train it. Thereafter you will get logs and can sort failed sentences for retraining. Few developers know you need roughly a thousand training sentences per intent. A banking app chatbot might have 100 things it can do. That’s 100,000 training sentences needed. And few developers have that many sentences lying around. So they release a crappy bot, users get a failed experience initially, and the developer gets a bunch more sentences to train with. Rinse, wash, repeat. Actually, users get a crappy experience for a long time, as witnessed by my test of the weather bot above. And developers don’t get off easily either. Reading logs of failed detection is one thing. But developers also have to read thru “successful” results to confirm they were truly successful. This is a time-intensive human tedious task.

Consider the experience of JustAnswer. which connects customers to affordable professional help. Their chatbot, Pearl, acts as an intake assistant, gathering initial data from the user so the expert doesn’t have to. For legal intake, JustAnswer uses ML to classify which of roughly 20 legal specialties the user is talking about. They trained ML with around 8000 sample inputs (about a third of what I would claim is needed). This results in wildly erratic confidence results. If Pearl asks what state the user lives in and the user replies Ky., ML is over 90% confident that means the user needs the family specialty. Really? I guess it had seen that word once in training for family and never in any other specialty.

All of the big successes you hear about with ML have millions of sample training inputs. One can hardly be surprised that things don’t work so well when you scale down to thousands. ML is great when you have the labelled data. Most developers underestimate just how much training data is needed.

Bot Platforms’ weakness — Dialog management

While ML may be good for question/answer and request/act bots, the dialog management systems of these platforms are horribly clunky for writing actual conversations. You have to create labeled nodes and join all nodes together to create a flow. And perhaps name a context so that node and context names can be passed thru their RESTful interfaces. Naming and joining become an onerous task when you have lots of nodes. Authoring friction limits how responsive your chatbot can be because you won’t want to create many paths. So bots with limited interactivity become the norm.

Lesser problems with ML-only bot platforms

ML tells you how confident it is of its pick. But selecting what confidence level to accept is a challenge. In JustAnswer’s medical classifier, we get:

I have terminal cancer: oncology: 89%

My son has a fever: pediatrics: 89%

I am having delusions: mental health: 89%

Which suggests 89% is a good level, except we will see 11% false positives if we accept that. And going in the wrong direction medically can really spook the user. But raising the required confidence level means making almost no use of ML. It would take a bunch of A/B testing to find the sweet spot of earnings vs. confidence. But humans looking at the above inputs have 100% confidence in their classification. So how many sample sentences are needed to get to that level? Or maybe the question is what kind of training sentences are needed? I don’t know.

Intents typically come with parameters like Seattle and tomorrow in our weather intent. You are limited to specific parameter types that the platform defines and ones you define in simple ways. You can create an enumerated list, like a list of body parts (head, foot, hand) for treat my head injury. Or you can train it with expected words that wrap before and after the parameter, so you could train it to find any word in the same place. You could train it to learn treat my ____ injury. But you don’t get any control over false recognition, so it will detect minor as a body part in treat my minor injury

With a bot platform you have to name what bot you want to talk to. This is a discovery problem. On some platforms you find the bot to then click on it. With Alexa, you had to name two bots each time saying something like “Alexa, ask TidePooler …” on every interaction. Of course to name the skill, first you have to search a skills list to find what you wanted. And when you want to change bots, you have to find the new bot. What a painful user experience. Fortunately, Alexa is moving to make continuing to talk with the same bot easier so you won’t have to name it on continuing interactions.

And ML can only handle one intent in an input. You can’t say Play Rolling Stones and then the Beatles, much less Tell me tomorrow’s weather and then play me some Beatles.

With bot platforms, you send all your customer’s input data and your sample sentences remotely. Even if you trust Google or Microsoft with your data (and certainly you don’t trust Facebook), they get to improve their speech recognition and ML systems and you don’t. And are you safe from a malevolent employee? And if you are not paranoid about that, since you send data over the internet — of course someone can eavesdrop on the conversation. Your encryption is HTTPS, and the history of all encryption is that it is only a matter of time before it is broken and you need better encryption but don’t yet know it. Banks and the like are completely unhappy sending their data outside their premises.

All these issues make me amazed that bots have become popular at all. It just shows how hungry people are to interact with machines by voice and not GUI. The transition to voice will be more profound than that from command line to GUI.

There is help available for ML

Stewart Brand once said: my mode is to look at where the interesting flow is and look the other direction…a cheap heuristic to find originality is don’t look where everybody else is looking. Look the opposite way. Everyone is looking at ML and applying it everywhere. I’ve gone the other way. I created ChatScript, an open source rule-based scripting system and engine for natural language. It is used by various companies around the world.

ChatScript provides a powerful dialog manager with a pattern matching system oriented specifically for natural language. This can be used to bootstrap an ML-based system, using ML to guide the ChatScript dialog manager and using ChatScript’s pattern matching to create an initial draft of a chatbot that can interact with users and collect enough data to then train the ML.

ML and ChatScript both ultimately do the same thing. They use patterns to classify user inputs. With ChatScript the human developer uses their understanding of language to figure out what the pattern should be and can write extremely sophisticated patterns concisely. The developer requires no external training samples to build a bot. My ChatScript chatbots have won the Loebner Prize 4 times for best conversational chatbot. You can ask “what does your mother do” and “where does she live”. My Rose chatbot has 9000 FAQ responses in addition to gambits to lead a conversation. She does this with only 13,000 rules. But it would take millions of sentences to properly train an ML to do the same job.

ChatScript leverages how words relate to each other, particularly in ontologies and concepts. An ontology is a hierarchy relationship, like a collie is a dog is a carnivore is an animal. Concepts are lists of related things, like pets, fruits, all human occupations, etc. ChatScript starts by inheriting the ontology of WordNet (a dictionary ChatScript uses) and developers can add their own by defining concepts as lists of words and phrases or other concepts. This means you can write a rule like this:

s: (I *~2 ~like *~2 ~meat) I’m vegan, myself.

which detects only statements (s:) from the user like I absolutely adore good meat or I crave red juicy hamburger or I have the hots for chicken and then the bot replies I’m vegan, myself. This rule will ignore questions like do I like meat. The ~meat concept from WordNet has almost 200 entries. Times 20 for ~like and you’ve got 4000 combinations. Add in the wildcard *~2 that allows up to two other words to show up in places. And that the rule will detect grammatically faulty sentences like me really likes some steak and myself wanted some more bacon. How many training sentences would ML need to properly detect this statement intent and ignore a similar question intent? Hundreds of thousands?

ChatScript has automatic spell checking. Additionally, other bot platforms discard punctuation and casing in order to achieve a more uniform input. While some users type in all lower case or all upper case and mess up punctuation, a fair number of users provide useful clues to meaning in their punctuation and casing, and ChatScript can take advantage of that.

ChatScript has built-in dialog management that is easy to author. You write interactive topics independently, drop them into your source folder, and ChatScript will compile them in and make them accessibly without you having to wire individual nodes together or even usually even name the nodes.

ChatScript is fast and small. 8-year-old iPhone and Android phones can run ChatScript locally. A popular entertainment ChatScript chatbot has been downloaded over 60 million times and can carry on twenty solid hours of conversation with teenage girls (and did for some of them) in hundreds of topics. Authoring that took 5 person months (16,000 rules). One would never have written that using the dialog tools of any of the other bot platforms.

Kore.AI’s bot platform

While all other bot platforms depend exclusively on ML, Kore.AI has one that combines ML and ChatScript to support Enterprise bots (serious customers like major banks with millions of users). 3rd party developers use fundamental meaning to define bot intents for the platform. Fundamental meaning is essentially pidgin English that could be recognized by a high school student. This means reducing a sample input into the minimal data needed, typically as a command. Given the sample: please tell me what the weather is in Seattle next tuesday — that distills to tell weather as the fundamental intent. Seattle and next tuesday are parameters, which you can name using predefined types just like the ML platforms, but you can also write patterns and idioms for custom detection, including rejecting idiomatic phrases which are false parameter detections.

Once you name the intent as a fundamental meaning, you add a list of synonyms for each word. Tell: explain, list, explicate, enumerate, talk, discuss… Weather: conditions, rain, snow, temperature … And a list of idiomatic synonym phrases: how is it, what is it like … Imagine how many training sentences you’d need to handle the combinatorics implied by these independent lists of words and phrases. You could imagine the bot platforms handling this by making a tool to generate sentences from synonym lists. They just haven’t yet.

Once a bot has a list of fundamental meanings, you have no discovery problem. You just say what you want to do and a universal bot finds the appropriate fundamental meaning in various other bots. Kore uses similar fundamental meanings for detecting user-defined parameters.

Because their platform understands how English sentences are constructed, it can also handle multiple intents in a sentence like what’s the weather in Boston and schedule a trip there for next tuesday. That’s two different intents in two different bots, and passing information given for the first bot as part of the data for the second. You can’t do any of that with ML bots.

Third Wave AI

ChatScript represents 1st Wave AI (hand-crafted systems). ML represents 2nd Wave AI (machine learning where humans are required to label all sample inputs). 3rd Wave systems need to integrate features of the first two waves and go beyond that, learning on their own and in the moment. We don’t have 3rd wave tech yet. But it should not be ML (2nd wave) versus ChatScript (1st wave). Most serious ChatScript developers use both. ChatScript is particularly effective as a bootstrap system; ML gets trained from the results. ChatScript is then used to combine ML data and control the conversation, validating the ML results and deciding what thresholds to accept at different moments.

JustAnswer uses ML to classify specialties within a profession. ML is more reliable in some specialties than others, so ChatScript Pearl accepts different levels of confidence depending on the specialty she is currently in and what one ML wants to pivot to. Strictly using ChatScript, if the user comes into the consumer_electronics category in the TV specialty and Pearl asks what is wrong with your TV and the user replies It’s not my tv, it’s my phone, Pearl will detect the user’s denial and change over to a phone conversation. ML and ChatScript work together.

Kore’s ChatScript bot passes ML the entire input and individual clauses of it in separate calls (being based in natural language, ChatScript has a notion of clause that ML doesn’t). For each call ML returns a proposed intent or parameter or nothing. These are all or nothing judgments, no levels of confidence or multiple values. The bot then takes the ML-suggested intents and decides how they modify ChatScript’s judgment. For the parameter recommendations, ChatScript has the range in the sentence where it was found, so it then focuses patterns on that so the bot can confirm whether or not there is a reasonable parameter there. A training sentence of the form “I want to go tuesday to” might be told that tuesday is a date parameter. And similarly if someone said “I want to go lundi to” ML might suggest that lundi is a date and ChatScript can then try to validate it or reject it.

The Virtual Patient project from Ohio State University uses an interesting combo system. They started with ChatScript. Then built an ML equivalent. ChatScript was slightly better but the systems failed in different ways. So then they built a binary classifier to judge which result was better on any given input and that significantly improved their success rate.

The future

Large volume ML training data is beyond most developers and dialog management tools of bot platforms are typically clunky and discourage extensive dialog. ML and these platforms will get there eventually. But ChatScript is here now. And combining both yields powerful chatbots.

Bruce Wilcox

https://en.wikipedia.org/wiki/Bruce_Wilcox I am Director of Natural Language Strategy for JustAnswer.com, former DNLS for Kore.AI. 4x Winner of Loebner Prize.