Machine learning & language complexity: why chatbots can’t talk… yet

Letters — What do these symbols even mean for a machine?

First and foremost, let’s make sure we are on the same page! Defining terms is never a waste of time, for this reason, I will try to explain what AI, ML and DL are.

AI <> ML <> DL 🗺

1) Basic mapping

What is the difference between AI (artificial intelligence), ML (machine learning) and DL (deep learning)?
Analogies and information mapping are great tools to help us understand concepts. I will use an information map I discovered on DL4J.

“You can think of DL, ML and AI as a set of Russian dolls nested within each other. DL is a subset of ML, which is itself a subset of AI.

2) AI (artificial intelligence)

John McCarthy, one of the godfathers of AI, defined artificial intelligence as follows:

“the science and engineering of making intelligent machines.”

The thing is, AI is a suitcase word for any computer program that does something smart. And the concept of smartness is, itself, rather challenging to define. AI can sometimes refer to science fiction, the future but, 20 years ago, it also referred to spam-filtering systems placed in our mailboxes. AI represents all of this at the same time.
In other words, there are different types of AI, and they are not all as smart as they sound. In fact, they are not smart in most cases.

3) ML (machine learning)

ML is composed of two words which are also quite tricky to define. The concepts they entail can have a lot of different contexts and meanings. You can picture machine learning from three main perspectives, as follows:

  • the idea that there are generic algorithms that can tell you something interesting about a set of data without you having to write any custom code specific to the problem;
  • an application with the ability to automatically learn and improve from experience without being explicitly programmed;
  • the development of a generic algorithm that you feed data with and that builds its own logic based on the data.

4) DL (deep learning)

A neural network can be built like a club sandwich with layers stacked one on top of the other. In and between each layer, groups of artificial neurons (a neuron is a simple computational unit, e.g. a chain of multiplications and additions) are stimulated according to what happens to other connected neurons. It is quite similar to what happens in our own brain when neurons are stimulated by other neurons, or at least it is inspired by our understanding of how our head works.

There was a deep learning paradigm shift when, in 2012 Hinton and two of his students showed that deep neural networks (a neural network with more than two or three layers of neurons) beat state-of-the-art in image recognition systems.


Machine Learning for Language

As mentioned above, deep neural networks have dominated pattern recognition. They blew the previous state-of-the-art out of the water for many computer vision tasks, but also voice recognition and of course, language-related tools (i.e. for translation, summarization, etc.).

In this article, I would like to focus on maybe one of the most fascinating area of deep learning for language: word embeddings, a method which uses mathematical vectors to represent a word.

Word embeddings map words in some language to high-dimensional vectors (e.g. 300 dimensions). The vectors are built from the idea that in a text, words with similar semantical meaning will have similar contexts and will appear in similar sentences close to each other. The drawing below (from Christopher Manning’s lecture at Stanford University) illustrates how the model is trained to predict the context distribution.

Word2vec skip-gram prediction

After completion of training, you can get a sense of the word embedding space by visualizing the vectors with t-SNE (a popular technique to visualise high-dimensional vectors).

Left: Country/Capital City; Right: Company/CEO

Let’s take a popular example with the words King, Man, Queen and Woman.

All of this knowledge simply comes from looking at lots of words in context with no other information provided about their semantics.
Another advantage of such technique is that you go from a space with dimension equal to the vocabulary size (~100k) to a dense 300D space.

High-dimensional and sparse vs. low-dimensional and dense: word embeddings reduce the complexity of a model and are more computationally efficient.

A vector decomposition can look like this:

--------------------------King — Man + Woman = Queen
--------------------------

Chatbots can’t talk… yet!

Chatbots can’t talk yet… but today humans can anywhere at any time.

With machine learning we have new tools allowing human beings to help each other out. Among many possibilities, here are a few examples of what human beings can do with machine learning:

  • receive questions they are the most skilled to answer;
  • seamlessly evaluate the opportunity of a request;
  • easily reuse previous advice made for similar recommendation needs;
  • detect similar patterns in customer requests, and act accordingly;
  • discover recurring requests related to UI misunderstandings, process difficulties or frustrating issues.

Thanks to Alexa, Siri and countless of chatbots and automated customer support lines, computers are gradually learning to talk. Although the trouble is they are still very easily confused.

1) How well can chatbots actually talk

There are two main branches of bots: task-oriented bots and chit-chat bots:

  • Task-oriented bots are the most common ones: personal assistants helping users perform certain tasks — they combine rules and statistical components.
  • Chit-chat bots don’t have a specific goal, they focus on natural responses — they use variants of seq2seq models which convert sequences from one domain to another.

Andrew Ng (co-founder of Coursera & former CSO of Baidu) explains it well:

Most of the value of deep learning today is in narrow domains where you can get a lot of data. Here’s one example of something it cannot do: have a meaningful conversation. There are demos, and if you cherry-pick the conversation, it looks like it’s having a meaningful conversation, but if you actually try it yourself, it quickly goes off the rails.

In fact, anything that’s a bit too much open domain (like fixing a technical problem or finding the perfect pair of shoes for a hike in the Alps) is beyond what we can currently do. Instead, in the meantime, we can use these systems to assist human workers who then offer and correct their responses.
That’s much more feasible.

2) Why chatbots have yet to learn how to talk

When they interact with others, people tend to express the same intent with different words, potentially over several sentences with different word orders. Talking to chatbots can sometimes be challenging — current chatbot solutions don’t allow diversity. Therefore, you’d better format your dialogue in order to be understood. This is frustrating.

Today, AI has barely better results than sheer coincidence when deciphering the ambiguity in sentences like [from the Winograd Schema Challenge, a test for machine intelligence]:

“The city councilmen refused the demonstrators a permit because they feared violence.” (Who feared the violence?)

Human beings do this so easily we rarely even notice that the ambiguity exists. It’s not that easy for AI yet. Put differently, AI can prevent you from going outside without your umbrella because of the weather forecast, or can book a flight ticket for you, but they are far from being ready to help you fix a technical problem.

3) How chatbots can possibly learn to talk

With deep learning, the goal is to learn authentic variations in the way people express different thoughts by using bottom-up data flows, from real examples. But we need to do more research as current deep learning faces two major drawbacks. Current deep learning faces two major drawbacks:

  1. A lack of model interpretability (i.e. why did my model make that prediction?).
  2. The amount of data that deep neural networks require in order to learn (i.e. they are data hungry).

First, research into so-called one-shot learning may address deep learning’s data hunger. Second, deep symbolic learning, or enabling deep neural networks to manipulate, generate and otherwise cohabit with concepts expressed in character strings, could help solve explainability.


The fundamental challenges related to reasoning (e.g. analogical reasoning, hypothesis-based reasoning), transfer learning, interpretability, robustness against adversarial examples and hierarchical representations may be addressed by combining symbolic reasoning with deep neural networks and deep reinforcement learning. The machine learning community and deep learning researchers tend to increasingly move in that direction.

If you are interested in reading more about these different topics, you will find more resources in the publications below: