So You Want to Write a Chatbot?
You may have been reading about artificial intelligence, progress in machine learning, and maybe even deep learning. Leaps of progress for recognising objects in photographs, playing Go and text-processing. Now, you want to use this technology to write a chatbot.
That’s great! So great in fact, that the dream of interacting with the computer by simply writing natural language predates digital computers completely. In 1950, Alan Turing proposed the imitation game, where a human judge and a computer chat via a teleprinter. Turing’s argument was a bit more subtle, but the setup is now known as the Turing test, and if the computer can fool the human judge to think it also is human, it has passed the test and is officially intelligent. Lets not ponder the fact that a chatbot has become the one true test of AI too much right now and move on.
In 1966, Joseph Weizenbaum created ELIZA, a computer-program psychologist. People could chat with ELIZA, and it used simple pattern matching rules to echo parts of the statements back (“I don’t like my job”, “why don’t you like your job”). Or it fell back on generic responses if this fails (“please go on”). Here is an example of me chatting with a modern version of ELIZA built into emacs:
Weizenbaum created ELIZA to “demonstrate that the communication between man and machine was superficial”, but it sort of back-fired. Lots of people found conversations with ELIZA very rewarding and imagined deep understanding and empathy on part of the computer, although nothing of the sort was happening. Weizenbaum wrote:
“I had not realised… that extremely short exposures to a relatively simple computer program could induce powerful delusional thinking in quite normal people.”
Whereas ELIZA was specifically made to use cheap tricks, smoke and mirrors to mimic intelligence, researchers soon turned to more sincere efforts.
In 1970, Terry Winograd created SHRDLU. SHRDLU could talk about and carry out actions in a simple blocks-world, a tabletop scene with a collection of blocks in different colours and shapes. Here is a bit of the standard example conversation:
SHRDLU was much more sophisticated than ELIZA. Incoming text was parsed and actually understood, and a rich dialog about things in the blocks world was possible. SHRDLU could reason about classes of things and the rules of world (“can a pyramid support a pyramid?”), as well as individual blocks (“where is the red cube?”). It could keep track of context, what objects you were talking about right now (“Where is the red cube? It is on top of the blue cube. Put it in the box”). You could name objects, and classes or collections of objects. You could interrogate the history of the world or the actions of the computer (“where was the red box before?”, “how many objects did you touch”). It was really quite amazing.
Now the question is — if it was so amazing in 1970 — how come I am sitting here writing about how to do chatbots 47 years later?
Why is this really tricky?
I do a lot of complicated tasks in cooperation with my computer today — I edit photos in Photoshop, create vector graphics in Inkscape, manage my files in Finder, or even write this text in a text-editor. In each case, the computer offers me a set of useful abstractions. In Photoshop, I think in terms of adjustment layers, selections, masks, etc. Sometimes these abstractions are leaky, and I have to worry about colour-spaces, or image resolution, but very seldom do I have to go down to the actual “bottom” representation, a massive 1D array of RGB colour-values for each pixel.
In the Finder I worry about file sizes, extensions, names and folders, permissions, etc., but unless things go seriously badly I never have to worry about how these things actually get saved to a disk (again essentially a massive 1D array of bytes).
All of these programs offer useful abstractions that let me do things that I would never be able to do at the lowest level. However, common to all of them is that all the work happens on the computer’s terms. I need to learn the computer’s language, and express myself in terms of the metaphors the computer understands to do anything. And this is fine! There is a learning curve, but it’s an acceptable tradeoff.
A chatbot reverses this — the interaction happens at the same level as human-to-human communication. Instead of me lowering myself to the level of the Finder, I expect the Finder-chatbot to work with natural language, which is the medium I’ve practised and inhabited for decades.
This is hard!
Natural Language is a mess
From a programmers point of view, natural language is total chaos. People rely on layers and layers of shared assumptions and common-sense background knowledge to communicate. They say “what places are open for lunch”, and they mean lunch as a time specification (roughly 11:00–14:00). They mean lunch as in a meal, so they want to eat, and probably a restaurant and not a supermarket. They say “make an appointment with my sister for lunch”, and mean the sister in the same city, not the one working in another country. They use synonyms just to mess with you “I love/adore/cherish coffee” , they rephrase entire sentences to mean the same thing but have no shared words: “Where can I have lunch nearby”? “Any places for a bite to eat around here?”
Even if the user isn’t actively out to trick you, natural language has layers and layers of ambiguity all on it’s own (“John kissed his wife, and so did Sam” — which wife did Sam kiss?)
Of course, lots of this can be fixed. We can code in information about lunch the time, and lunch the meal. We can provide lists of synonyms, but all this introduces a domain-dependenc. Our lunch-booking bot will know nothing about the weather, nothing about Pokémon, and nothing about politics, etc.
This was the main reason the success of SHRDLU failed to spread to other domains. SHRDLU had a very simple closed world. It had a list of nouns, verbs and adjectives that made sense in that world, and a parser that could recognise various types of questions, requests to do actions, etc. But, in the end it wasn’t very robust. Adapting the parser to larger, more complicated domains quickly became an unmaintainable mess. Also, the blocks world is wonderfully concrete and simple, an object is either a cube or it isn’t, it’s either red or it isn’t, etc.
The real world isn’t like this.
The Eliza Effect
In ELIZA, people read layers of complexity into the conversation, and people did things like ask Weizenbaum to leave the room so they could talk to ELIZA in private. Language is deeply soaked in meaning for us, and every human has years of practise reading very subtle cues in how things are phrased and how words are used. Even if you know better, it is very hard to stop doing this when talking to a computer. Even if you know that words to the computer is mostly empty tokens, it’s hard to not imbue them with the same semantics you would when talking to another person.
At first sight this may look great: people are easy to fool with simple syntactic tricks! You don’t even have to do the hard things to create a chatbot!
However, often it’s setting yourself up to fail: if people believe your bot is human-like, they will treat it like a human and they will expect it to understand jokes, irony and flirting (once IKEA retired their virtual assistant, Anna, after 10 years, they said 50% of the questions she got were sexual in nature!). They will use ambiguous, sloppy language. They will assume a world of common background knowledge that is shared between humans, but not with your bot.
There is just so much rope to hang yourself with!
Often it’s setting yourself up to fail: if people believe your bot is human-like, they will treat it like human, they will expect it to understand jokes, irony and flirting.
Failing gracefully and recovering
A good thing about human communication is that, if you can swallow a bit of pride, saying “I don’t know” or “I don’t understand” is perfectly fine. For a bot this is good: since we’re way off some generic intelligence technology, there will always be things out of scope of your bot. It is however bad, since knowing what you don’t know is often challenging. It is made worse by humans being super at dynamically adapting:
“I don’t know what this golf-cart you speak of is.”
— “It’s like a small electric vehicle to be driven on the golf-course.”
“OK, fine.”
This type of dynamically acquiring novel concepts is very hard for a computer.
So how do we do it?
As usual, programmers are lazy. The holy grail is as little work as possible — no hard-coded dialogue-trees, no domain adaptation, no rules. Ideally, this would work like a very independent, creative and hard-working human assistant: here’s the book of customer data, here’s a list of business processes, now pick up this phone when it rings and answer questions.
We call this an end-to-end machine learning system: raw data comes in, raw output comes out.
Just to be clear — we are nowhere near this scenario today. But, people are making some progress into using machine learning to learn to carry out the three core processes in a chatbot brain: parsing and understanding the input, modelling the conversation and deciding what to say next, and finally actually writing the output sentences.
If you want to sit down and implement a chatbot today, here are a few approaches.
The Dialogue Interface
This one is simple, just don’t do the actual “chat” part — i.e. don’t allow any natural language input at all. Instead, you can offer standard UI components, drop-downs, buttons, sliders, etc.
The system can still visually look like a chatbot. You can have bubbles and a conversation-like interaction, pick an option, get another choice, enter something, etc. You have to code up the dialog-tree yourself, but once done the bot can respond sensibly depending on what the user picked.
This may seem like cheating, but the easy implementation and low risk of things going wrong means that many of today’s deployed “conversational interfaces” are just this! The KLM airline has a very useful bot on Facebook Messenger that keeps you up to date on your flight times, sends you your boarding card QR code, etc. and it accepts no text input. Likewise, the Quartz News app only offers buttons for interaction.
Pros:
- Avoiding all problems of handling natural language
- There is no machine learning involved, so no black-box to struggle with. The bot does as it is told, but no more
- No danger of the user moving outside the target domain
- Discoverability: all the options are right there
- Still delivers many of the benefits of dialog interfaces: it feels friendly, the user is guided through the process, never overwhelmed by choice
Cons:
- Why bother with a chatbot at all? Maybe a well designed form/wizard is better?
- Scope (obviously) limited to what you code
- You must hardcode everything
Scripted Intents
One step up, we allow actual free text input. Natural Language Processing (NLP) is used to classify intents and extract parameters. For instance “hi”, can map to the greeting intent, and “please transfer €50 to Bob” maps to the transfer intent, with parameters amount set to €50 and receiver set to Bob.
This is what many chatbot platforms like init.ai, or the IBM Watson Conversation service offers.
Pro:
- Actual chat, feels sophisticated
- Relatively easy to realise, many technology providers
- Requires relatively little training-data
Con:
- Classification is not natural language understanding, still limited by the intents you program support for
- Discoverability is difficult: if your bot only supports weather reports, stock prices, and bus-times, why open up to a world where the user can ask anything?
- There is so much rope to hang yourself with: if you let people ask anything they will, and the initial feeling of sophistication can quickly give way to disappointment
The approach above can be trained with relatively modest amount of training data. So little, that you can easily generate it by hand. However, if you have a large corpus of logs from an existing (presumably human) chatbot-operator, you can do a bit better!
Learn to pick the right reply
Imagine a chatbot for doing some sort of software support. Probably the vast majority of questions you get you’ve already answered, and they require no special handling.
There is probably also a long-tail of more complicated questions, but if you can automate the top 90% of incoming volume a lot has already been gained.
This is exactly what people do with machine learning. Given a large existing chat-log and an input from a user, they try to find the most similar question/answer they have already seen. This is for instance the approach Facebook wants people to take with their bAbl Dialog task.
Pros:
- Works well in places with a large number of repetitive questions, as a way of automating the lookup in a FAQ database.
Cons:
- Awkward failure mode: the bot will reply with random nonsense
- Hard to detect failure: if the question didn’t match anything we’ve seen before, how do we know?
- A black box model (typically), hard to tweak or adjust to customer needs.
A Knowledge base approach
Knowledge bases and explicit representations of formal knowledge have had a few set backs. From the disappointment after the first SHRDLU excitement to the AI expert-system winter of the 80s. But the ghost of the semantic web is haunting the corridors of the modern web. When you ask Google “how old is Barack Obama”, you get not just hits from the web, but the actual answer right there, straight out of the Google Knowledge Graph.
People have made great progress mapping natural language questions to structured queries since the days of SHRDLU. Machine learning-based approaches are now much more robust, and much less domain-specific. Embedding of the actual knowledge graphs into neural networks, where inference can be combined with fuzzy contradictory interpretations is an active research area.
Pros:
- Can actually deliver a system that can answer questions the chatbot developers did not anticipate
- Amusing failure modes:
Cons:
- You need an actual knowledge base (duh). If your goal is to make a natural language API interface, it is less useful.
- Cutting edge research, maybe too difficult/poorly understood for your production environment
Conclusion
The above is an attempt to categorise possible approaches to writing a chatbot by the amount of machine learning used, or conversely the amount of manual hand-coding labour required.
There are of course many other ways to classify chatbots. We have also completely ignored the different challenges posed by different types of bots: an idle-chitchat bot for entertainment affords very different solutions to an educational bot answering questions about WWII, or to a customer support bot that is essentially a natural language interface to a reset-password or change-email API.
Any working, deployed system will probably end being a hybrid of all these things. Possibly as a hybrid with human operators that take over when the bot is out of its depths, either explicitly (I am confused, please wait for my colleague Bob to take over) or behind the scenes, like Facebook’s M assistant.
The real pie in the sky here is that your chatbot should not be limited to the answers or questions you programmed into it. Like Photoshop provides me with adjustment layers but I am free to use them in ways not anticipated by the programmers, the chatbot should provide you with a language in which to do things, but afford novel combinations and constructions you, as the developer today, cannot imagine.
I feel slightly that chatbots are the skeuomorphism of 2018: “look, the system can talk! It’s just like chatting to a person!” is like saying “look it’s leather! It’s just like your desktop calendar”. Though actually, it is something completely different. I still think they have their place. In particular, in cases where users may be overwhelmed by choices, chatbots offer a friendly, guided process. And if you are up-front about being a limited bot, you can manage expectations and avoid disappointment. I find the Google Now Assistant, which doesn’t pretend to be more than an intelligent search field, works well.
Another benefit is that chatbots do not require people to learn a new (graphical) interface, they can be embedded in WhatsApp, Allo or Facebook Messenger, which people already know how to use (that does not mean the chatbot doesn’t have it’s own limitations in what language it understands or what actions it can perform. That also forms an interface of sorts).
Finally, once coupled with voice recognition, chat based interfaces are great if you are driving or otherwise engaged with your hands or visual attention.
Just remember that the technology to realise Star Trek’s Data isn’t quite there yet, and you’ll be fine!