Mitsuku wins Loebner Prize 2018!
The Loebner Prize 2018 was held in Bletchley Park, England on September 8th this year and Mitsuku won it for a 4th time to equal the record number of wins. Only 2 other people (Joseph Weintraub and Bruce Wilcox) have achieved this. In this blog, I’ll explain more about the event, the day itself and a few personal thoughts about the future of the contest.
What is the Loebner Prize?
The Loebner Prize is an international contest run by the AISB (The Society for the Study of Artificial Intelligence and the Simulation of Behaviour) where chatbots compete with each other to find the most humanlike. This is an incredibly difficult challenge, as recreating human conversation is such a complex task and instead of being an assistant or having a goal, human conversation covers a vast multitude of topics.
Just to make it even more challenging, no internet access is allowed. This is to avoid the possibility of cheating by having a human operator pretending to be the bots and typing their responses.
How Do You Enter?
Any chatbot from around the world is welcome to compete. However, it must follow certain rules available on the AISB website. The main ones being that the bots have to use a special protocol to talk to the judges and the bot must work as a standalone piece of software on a local network.
Once the deadline for entries has passed, there is a qualifying round to determine the 4 best chatbots. This is done by asking each of the bots the same 20 questions and scoring them on the humanlike nature of their answers. The questions vary from a simple, “How old are you?” to the much more complex, “When might I need to know how many times a wheel has rotated?”, which even I would struggle with!
Being humanlike is not the same as being intelligent though. If I were to ask you what the population of Norway is and you gave me the exact answer, I wouldn’t think that was very humanlike. A more human response would be something like, “no idea” and although this is certainly more human, it is neither intelligent or useful. This is one of the criticisms of the contest but it’s a great platform to encourage and develop natural language understanding.
To give you an idea as to the standard of the entries and the difficulty of the challenge, someone unofficially entered SIRI into the contest in 2013 and it would have finished in 14th place. However, SIRI is designed to be an assistant rather than a conversational partner.
The transcripts for each bot are available here and the top 4 chatbots were:
1 — Tutor — 27 points
2 — Mitsuku — 25 points
3 — Uberbot — 22 points
4 — Colombina — 21 points
These progressed to the final on September 8th to try and win the Loebner Prize. The format of the contest is based on the Turing Test where a judge types messages to two entities at the same time, one is a real person, called a human confederate, and the other is a computer program. It’s the judge’s task to try and determine which is which. There are four judges and four rounds of questioning.
To win the silver medal and a prize of $25,000, a program must fool at least half of the judges that it was a real person, so not only does it have to be humanlike but it has to be more humanlike than the actual real person that the judge is talking to! An almost impossible challenge but if any bot manages to do this, the contest moves into an audio/visual stage where the winner would get the gold medal and $100,000. There are no details about this stage, as it isn’t likely to ever happen. The prize that we can realistically expect to see awarded at each event is a bronze medal to the bot that is most humanlike and a $4,000 award. The three other entrants receive $1,500, $1,000 or $500 depending on what position they finish.
The Loebner Prize Final at Bletchley Park
I arrived at Bletchley Park just after 10am, signed in to the visitor’s centre and made my way to the contest rooms. There are always at least two rooms used in the competition. One contains the computer programs and the human confederates, the other has the judges and a public viewing area containing interactive exhibits. It’s important to keep the judges and the confederates separated, as an easy way to find the human would be to say something like, “lift your arm up” and see who did it.
The first task I always like to do is to find the computer where I need to install Mitsuku, just to make sure everything is ok. Will Rayer, the developer of Uberbot was already on site configuring his entry. As usual, each device was a Windows 7 laptop so I installed Mitsuku in a few minutes and made sure the software loaded before waiting for the rest of the competition network to be set up.
Once I’d installed Mitsuku’s software, I installed the software for Tutor on a second laptop. Ron Lee was unable to attend in person for the final, so I’d offered to help him as Mitsuku and Tutor are both hosted at Pandorabots and use the same software.
After checking that both programs started ok, there was no more I could do and so I went to the room containing the judges to say hello to Bertie Muller and Nir Oren from the AISB who were busy setting things up. Two laptops were available for the public to use. Will had brought his own equipment to demonstrate Uberbot and so I set up Mitsuku’s Turing Test on one of the machines, and a interface to talk to the “robot” version of Mitsuku on another.
Andrew Martin from the AISB had configured Colombina and so at around 11am, everything was set up and we were ready to test. All the programs were started and waited for some test input from Nir who had gone into the judging room. This is always a tense moment. Will the bots and judge programs communicate with each other or will the screens remain blank? Fortunately, all worked ok and a “hello” message appeared on each of the 4 consoles for the bots to respond to. The bots responded successfully. Just to make sure all was ok, a few more messages were exchanged, the round was ended, a new one started and again all the bots worked ok. All tests indicated that everything looked good for the final, so we broke for lunch. During lunch, I gave an interview to a TV crew from a Dutch news channel and to Harry Ridgewell, a journalist from Wikitribune. It’s always nice to feel like a rock star for the day!
The Main Event
It’s just before 1:30pm. All bots, judges and human confederates are in place and ready for action. Naturally, I’m in the room with the computers and so can’t hear Bertie introducing the contest, explaining what it was all about and making sure everyone knew what they were doing. He briefly visited our room to ensure the confederates were ok and at 1:30 a message saying new round appeared on our consoles. The contest takes part over 4 rounds, so each judge can talk to each of the humans and programs. We’re up and running for the 2018 Loebner Prize!
The bots were started and the humans got ready to type. This part of the contest is always nerve wracking, as although we’ve tested the bots earlier, there’s always a chance that something may go wrong now that 4 judges, 4 humans and 4 chatbots are all about to start communicating with each other at the same time. I needn’t have worried though. After a few seconds a message saying, “hi how are you today?” appeared on Mitsuku’s console. Great! I just need her to respond now. She didn’t disappoint and soon sent a message back saying “Ah. Pretty good thanks How about you?”. We were up and running. As I was ensuring Tutor was working ok, I checked his computer and saw an interaction had appeared:
Judge: Hello there!
Tutor: Hi! What’s up?
All was working fine. I could see messages coming back and forth between all the bots and judges. Phew! I can now relax. Just like a coach at Wimbledon, I had done all I could do to train my athlete but now it was up to her to perform.
The first judge’s tactic with Mitsuku was to have a general conversation. She responded ok but could have been better in the early part of the round, as she misunderstood a few things the judge said. However, I was reasonably pleased with her performance.
All the bots worked well in this round.
In round 2, again all the bots worked well. The judge I had this time round was the classic Loebner Prize judge. Not so much conversation but more interrogating and testing Mitsuku. Questions like:
What is bigger, a boat or a car?
John is tired. Who is tired?
What’s the next number in the sequence 1 2 4 8
Mitsuku handled the vast majority of them with no trouble. Her common sense database really shone in this round and I was very pleased with her performance.
I also took a closer look at Colombina in this round, as it was the first time this bot had taken part in the finals and I could see it was using the old version of the Loebner Prize Protocol with a bridge program to connect it to the new version.
Originally, the computers communicated with the judges 1 character at a time, so the judge could see each character being typed. This was done so the developers could add in human attributes like change of typing speed, deliberate back spacing, correcting typos etc. However, the main criticism of this protocol was that the chatbots never knew when the judges had finished typing, as there was no requirement for them to press enter after each message. Many of the developers assumed that if the judge hadn’t pressed a key for 3 or 4 seconds, chances are that the message was complete and the bot should process it but this didn’t always work. Messages were often split in two or the judge would stop typing to the bot and then continue later. The flaws in this protocol were especially noticeable in the 2014 contest where computer issues split almost every message and passed gibberish to the bots.
The new protocol works like a regular messaging app where you type a message and then press enter so it looks like the image below. Could you be a Loebner Prize judge? The judge’s messages are in blue, the computer’s are in green but which side do you think is a computer and which is a real person?
All the bots responded well in this round too. A surprise in this round is that Colombina’s developer, Savva Kuznetsov, had arrived from Moscow and introduced himself. It was great to see that 3 of the 4 developers were on site, especially as this is probably the last Loebner Prize in its current format.
Round 3 wasn’t quite as successful as the previous two. Mitsuku, Tutor and Colombina all worked well but Uberbot wasn’t receiving or sending any messages. Upon investigation, Will found out that it was due to a miscommunication that round 3 had started but Uberbot was still waiting for the new round signal. As the round was underway, the signal wasn’t sent again and Uberbot waited in vain for it. There appeared no way for Will to manually send the signal and so Uberbot missed out on round 3 altogether.
Mitsuku’s judge in this round seemed very critical of chatbots. Deliberately trying to catch the bot out, asking many contextual questions and misspelling things on purpose.
i think it might reign
i was born in whales
Mitsuku coped well though and managed to respond ok to most of this line of questioning.
Round 4 started as normal. All bots were responding to the judges and Uberbot was connected again. However, after about 10 or 15 minutes into the round, a message came up on Colombina’s screen saying that it couldn’t connect to the network. I mentioned this to one of the organisers who came over to take a look but by this time, Bertie had come into the room to say that everything had stopped. The judges couldn’t talk to any of the humans or the programs. No messages were being sent or received.
Unfortunately, the communications couldn’t be established again and so round 4 finished approximately 10 minutes earlier than planned. Looking on the bright side, all bots had the same amount of time in round 4 and luckily it was the final round, so didn’t need resolving for the rest of the contest.
That was it. The judging part of the contest had finished and the organisers tallied up the scores before announcing the results of who had won.
The scoring system had changed this year. Each judge ranked each of the 4 bots in order from best to worst and gave them a score.
Best — 5 points
2nd — 3 points
3rd — 2 points
4th — 1 point
The scores from the four judges were added together and a percentage was calculated. I’m not sure how the percentage was arrived at but I assume that as the maximum score was 20, it was a percentage of that, so a bot scoring 5 points out of the possible 20 received a final mark of 25%.
There are two parts of the Loebner Prize I get nervous about. The start of round 1 where I don’t know if Mitsuku will work and the announcement of the results. This year was no different.
Bertie took to the stage to announce the outcome of the contest. Unsurprisingly, no bot had managed to fool any of the judges and so the silver medal was not awarded. This was to be expected. 25 minutes to try and fool someone is incredibly difficult and so the award, as usual, went to the most humanlike program.
1 — Mitsuku — 33% — $4000 and the bronze medal
2 — Tutor — 30% — $1500
3 — Colombina — 25% — $1000
4 — Uberbot — 23% — $500
Yay! I’d won!!!! Not by a big margin but I’d actually won for a 4th time. Only 2 other people had won it four times, so I was absolutely overjoyed to join this very exclusive club.
Bertie explained that even if Uberbot had responded in round 3, it wouldn’t have received enough points to make a difference. He also explained to me that 2 judges ranked Mitsuku as being the best, one ranked it in 2nd place and the 4th ranked it as last for some reason. By my reckoning, this was 5+5+3+1 =14 points out of a possible 20, so I’m still not sure how the 33% result was achieved.
I received my bronze medal, said a few words of thanks and then Bertie made some announcements on the future of the contest.
Future of the Loebner Prize
Hugh Loebner sadly passed away in December 2016 and it appears that since then, the sponsorship money for the event has stopped. Unfortunately, the contest simply cannot continue in its current form and so Bertie outlined some possible future plans.
The AISB has an interest in all aspects of artificial intelligence not just chatbots and one big issue with AI is the public’s perception of it. Nearly every news article is illustrated by a picture of the Terminator. Hollywood movies loves to portray killer robots and unfortunately, this is how the average person sees AI.
It’s important that this view is changed to show that AI is a tool to assist humans not to destroy us and with this in mind, the AISB may be planning a yearly event to showcase all aspects of AI, chatbots included but also poetry, art and music created by computers. It’s hoped that the Loebner Prize contest will still be part of it but probably not in its current format.
Rather than having just 4 finalists pretending to be human, there may be many more chatbots just being themselves, as long as they pass a basic quality standard. The award going to the most engaging bot decided by judges but also possibly a public vote too.
It’s all in the very early stages yet though and the AISB are open to ideas.
Personal Thoughts on the Future of the Loebner Prize
I personally would LOVE to see The Loebner Prize continue in one form or another. If that means selling the name to a sponsor and turning it into something like The Walmart Prize then so be it.
I agree that it needs wider publicity than the current arrangement at Bletchley Park to make it attractive to a sponsor. The history of Alan Turing and the place itself makes it an ideal location but only for that reason. Most of the visitors to the Loebner Prize room just happen to have wandered in there by chance and have no clue (or interest) as to what is going on. When I was there, I’ve seen that a tourist group of senior citizens just aren’t going to care about humanlike conversational AI and prefer to reminisce about the war instead. However, when tech-savvy children visit, they seem to love talking with the demo bots and watching the contest.
Holding it as part of a larger science or tech event would make perfect sense to me. A dedicated, knowledgeable audience would attract more publicity, more sponsorship and even a slot on TV. I’m sure Channel 4 in the UK would love this sort of quirky contest, similar to “Child Genius” or any other of their niche reality shows. Have you seen some of the e-sports contests? They play to packed out arenas. I’m not saying we could hope for those numbers but I would confidently guess that a few hundred people would be interested in watching the event if they knew about it.
Personally, I dislike the “pretending to be human” element of the Loebner Prize and feel it would be a better use of our time to run a “best chatbot” contest instead. Let’s face it, no chatbot is going to convince a reasonable judge that it’s a human over a mammoth 25 minute session and it seems strange to try and make humanlike chatbots when they usually give themselves away after the first 4 or 5 interactions.
Let the judges talk to each bot and vote for which they liked best. This will do away with the prohibitive LPP protocols which prevent many people from entering quality chatbots. No Bruce Wilcox this year and only 11 entries. Let’s just allow the developers to make their own interfaces for 1 judge to chat to 1 bot. No need for human confederates and sure, internet access raises the temptation of cheating but online bots should be allowed. As they are not pretending to be human, the bots could reply pretty instantly which would rule out human interaction. You could even have some kind of award for best interface as part of the event.
I also agree that we shouldn’t limit the contest to just 4 finalists, as it’s pretty much the same old faces each year. Let them all take part, assuming they pass a basic level of competence. Anyone remember pi bot from a few years back? It responded to every qualifying question with something like, “I don’t know that but let me wow you with 33 digits of pi. 3.14159265…..” This just makes the chatbot scene look bad and so I would still advise a basic screening set of questions first.
The possibilities for this event are endless, as the contest in its current form has become a tad stale and it’s good to get the ball rolling to try to elevate it to the heights it deserves. I’m sure a good marketing team could have this very unique event at the forefront of the tech world.
One final thought is that it would be good to keep Hugh’s memory somewhere in the event if at all possible. This is his legacy and without him, none of this would have ever taken off in the first place. Perhaps keeping his image on the awards or projecting his picture on some display screen or even an announcement at the start of the event would be a fitting tribute?
But yes, let the Loebner Prize run forever. After all, I’ve now won the contest 4 times. It would be pretty awesome to go for the world record of 5 wins!
A huge thank you to all the organisers of this year’s contest. Bertie, Andrew, Nir and all the staff from the AISB. All the judges, confederates and chatbot developers and of course Bletchley Park. A big thank you too to everyone who attended the event. Hopefully, see you all (plus many more) at whatever form the contest takes in 2019.
To bring the best bots to your business, check out www.pandorabots.com or contact us at email@example.com for more details.