Current Chatbot Conversation Designs

(Part 1: Data Analysis)

6 min readSep 13, 2018

(Insights from the analysis of the Loebner Prize 2017 & 2018 chatbot transcripts)

Introduction

The insights contained within this article highlight the current strengths and weaknesses of current chatbot conversational designs and provide a clear-cut strategy for improving your own bot. Enjoy!

Methodology

In this two-part article, chatbot transcripts from the 2018 and 2017 Loebner Prize were classified and analysed to discover which bots currently perform the best (or worst) in different areas of human conversation (part 1). These bots were then researched and reverse engineered to uncover which AI methodologies and techniques are responsible for the best (and worst) performances in each area of conversation (part 2), thus allowing others to better decide which strategies to adopt (or avoid) for their own bot designs.

The Loebner Prize is the first Turing test (a method of inquiry to determine whether or not a computer is thinking like a human) held annually, to this day. The competition is supported by AISB, the world’s first AI society founded at Bletchley Park — where Alan Turing (father of the Turing test) worked as a code-breaker during WW2.

Dataset

The finalist selection quiz consists of 20 questions with a maximum of 2 points per answer (and a minimum of 0). The best bots across both years scored 27/40 points — a wide gap between the average human’s score on the same test (approx. 39/40 points).

Image source: http://aisb.org.uk/events/loebner-prize#Process16

The full transcripts used in this analysis can be found here and here. (NB: throughout this article, questions referenced from the transcripts will be in the following format: “[Q3, 2018]” — wherein “Q3” means “question 3” and “2018” is the year of the Loebner prize competition)

Question Classifications

Using the finalist selection transcripts for the Loebner Prize 2018 and 2017 , the questions have been classified into several categories (loosely clustered around the core problem to be solved). Each category has also been ranked according to its level of difficulty (determined by the amount of sophistication and artificial intelligence required to actually solve the underlying problem). The question categories are:

1) Personal Opinions [*]

2) Personal Facts [**]

3) General Facts [***]

4) Recommendations [***]

5) Requests [***]

6) Remembering [***]

7) Disambiguation [****]

8) Common-Sense [*****]

Personal Opinions [Difficulty: *]

Personal Opinion-type questions are things like greetings, pleasantries, social-customs or norms. They are questions to which the bot is limited to a fairly specific set of replies from which it should not deviate (e.g. human: “what’s up?” bot: “nothing much/not much/nothing/i’m fine thanks/etc”). If the bot were to provide a more creative or factually-correct answer to these types of questions, it would quickly estrange the user from the conversation (e.g. human: “what’s up?” bot: “up is the opposite of down”. human: “ugh. Another bot”). Example questions include: Good Afternoon [Q1, 2018]. Do you consent to having this conversation recorded? [Q2, 2018]. How are you feeling right now? [Q8, 2018]. Thank you for participating in this test. Is there anything else you’d like to add? [Q20, 2017]

2. Personal Facts (about bot) [Difficulty: **]

Personal fact-type questions are factual questions (often with only a single correct answer) that are focused solely around the bot’s personality. Example questions include: Which languages can you use? [Q5, 2018]. How old are you? [Q10, 2018]. What will you do later today? [Q12, 2018]. Who is your favourite artist? [Q14, 2018]. Do you have any legs? [Q16, 2018]. Hello, my name is Andrew. What’s your name? [Q1, 2017]. Why don’t you tell me a little more about yourself? [Q2, 2017]. Will you tell me about your dreams [Q4, 2017]

3. General Facts [Difficulty: ***]

General Fact questions are similar to personal facts, in that the question require definitive answers, however they are not limited to a single, closed-domain topic like the bot’s personality. Instead, these questions can be about anything in general. Example questions include: Do you know how to make toast? [Q4, 2018]. Where should one look for love? [Q7, 2018]. Who said “I have a dream”? [Q9, 2018]. Do you understand Winograd Schemas? [Q18, 2018]. I am a researcher in Artificial Intelligence at Goldsmiths University, do you know what that is? [Q3, 2017]. How do you recommend I make tea? [Q11, 2017]. What do you get if you bake dough? [Q12, 2017]. Now im going to ask some Winograd schemas, do you know what they are? [Q16, 2017].

4. Recommendations [Difficulty: ***]

Recommendation questions are similar to General Fact-type questions, because they can be on any topic. However, the there are often multiple possible answers for these questions, and the bot will have an additional task of ranking the candidate answers according to some predefined preference. Example questions include: Can you recommend me a film? [Q17, 2018] Can you tell me about a film you haven’t seen? [Q7, 2017]. What do you think of Trump? [Q8, 2017]. Where in the world would you like to visit? [Q13, 2017].

5. Requests [Difficulty: ***]

Requests are questions which require the bot to call upon additional (sometimes external) functions (e.g. to check the date-time, to calculate a mathematical equation, to perform an internet search, etc). Example questions include: How many letters are in the word ‘abracadabra’? [Q6, 2018]. Can you rephrase that? [Q5, 2017]. Anything else? [Q9, 2017]. What is the answer to “Add 24957 to 70765”? [Q10, 2017]. Do you have the time? [Q14, 2017]. With which type of question do you have most difficulty? [Q19, 2017]

6. Remembering (Information from Conversations) [Difficulty: ***]

Remembering questions ask for pieces of information that were learnt by the bot during the conversation with the user (e.g. human: “my name is Allen and i’m 23 years old. I’m a work-o-holic.” bot: “hi Allen! I’m here to support if you need me”. human: “how old am I?”. bot: “why you are 23, Allen”). Example questions include: Have we met before? [Q3, 2018].

7. Disambiguation [Difficulty: ****]

Disambiguation questions contain some words which make the meaning slightly ambiguous (i.e. which entity does “it”, “they”, “this”, “that” refer to?).

Winograd Schemas are sentences specially crafted to confuse chatbots using this type of ambiguity.

Example questions include: If a chicken roosts with a fox they may be eaten. What may be eaten? [Q19, 2018]. I had to go to the toilet during the film because it was too long. What was too long? [Q20, 2018]. I was trying to open the lock with the key, but someone had filled the keyhole with chewing gum, and I couldn’t get it out. What couldn’t I get out? [Q17, 2017]. The trophy doesn’t fit into the brown suitcase because it’s too small. What is too small? [Q18, 2017].

8. Common-Sense (Inferring Information that is Not in Memory) [Difficulty: *****]

Common-sense reasoning questions require the bot to infer additional, implicit pieces of information from the explicit knowledge it already has in memory. Example questions include: When might I need to know how many times a wheel has rotated? [Q11, 2018]. What is the third angle in a triangle with internal angles of 90 degrees and 30 degrees? [Q13, 2018]. What do you hold when you shake hands? [Q15, 2018]. What is my occupation? [Q6, 2017]. What does it take to Tango? [Q15, 2017]

Sample Answers

Analysis

On the whole, all the bots from the Loebner competition performed consistently well …up until the 2nd question category, above which the bots terribly underperformed. For level 3 difficulty questions and higher, chatbots would tend to specialise in one area of conversational question types. Unfortunately, there was no ‘one bot to rule them all’ (none of the bots dominated all question categories).

For example, Colombina proved best at answering personal opinion questions, whereas Mitsuku came out ontop for General fact-type questions. It was often the case that individual bots were very polar in their performances, and being the best in one area left it as the worst in another (e.g. Uberbot was the best for requests, but the worst for recommendations. Momo was best at disambiguation, yet worst when it came to common-sense!).

Coming up in Part 2: Chatbot Techniques and Strategies

Winning (& Losing) Chatbot Strategies & A.I. Techniques Revealed!

In the next section, the inner workings of the best (and worst) bots are explained to reveal what clever NLU tips and A.I. tricks are under their hoods.