The Loebner Prize for Bots

Buster Benson
Written on BART
Published in
3 min readAug 31, 2015

--

I just learned about this from Rob Ellis, who’s building SuperScript. The Loebner Prize is a yearly competition to write a chatbot that passes the Turing Test. Or at least does better than all the other participants.

It’s been going on since 1991.

The gold medal will be awarded to a bot that is indistinguishable from a human.

Silver medals will go to any bots that can fool over 50% of human judges that they’re human (close to the classic Turing Test).

And every year, a bronze medal will go to the bot that does the best compared to other entrants, regardless of overall quality.

Amazingly, transcripts are posted for many of the years, so if you’re interested you can watch bots slowly getting smarter over the last 24 years.

This year there were 16 participants. None of them won the gold or silver medal, so it was mostly about the bronze. They were each asked the same 20 questions and scored on 3 criteria (relevance, correctness, and plausibility) between 0–2 where 0 means the criteria for the answer was not met, 1 means the criteria for the answer was partially met, and 2 means the criteria for the answer was fully met.

Here are this year’s questions:

With 20 questions and 6 possible points per question, the total possible score is 120. This year the winner, Mitsuku, got 100 points. They haven’t posted transcripts for this year, but here’s a sample from last year where Rose got 107 points:

I’m surprised that there are only 16 entrants in this competition.

With the recent resurgence of bot popularity amongst the tech community, I wonder if there will be more attention to this prize. Next year might have a much larger showing.

In the meantime, I’ve been thinking about the competition format.

Those 20 questions are a good place for new bot makers to start. They’re functional tests.

What if there was a simple way to run these tests across a broad range of bot submissions year-round?

How much can the relevance, correctness, and plausibility scores be determined by scripts rather than humans?

Does it take a Turing Test-passing bot to recognize a Turing Test-passing bot, or can it be a bit dumber since it has full control of the questions and is only judging responses. There’s no need for a judge bot to keep the conversation flowing naturally.

Of course, if the questions are public, any bot could pass all the tests by just memorizing the questions and answers (classic junior high cheating strategy).

But what if this hypothetical bot testing service also included the ability for people to submit new questions and passing criteria that could then be passed to all the bots and scored?

Bots could be ranked on their abolish to correctly answer new questions.

New questions could be ranked on their ability to properly trip up insufficiently clever bots.

Seems to me that the feedback loop of build, test, improve that could function on the order of days rather than years, especially if done in a public format where others could learn from the strategies and challenges of other, would result in even faster progress than we’re currently seeing.

Who’s interested?

— written on BART

--

--

Buster Benson
Written on BART

Product at @Medium. Author of “Why Are We Yelling? The Art of Productive Disagreement”. Also: busterbenson.com, new.750words.com, and threads.net/@bustrbensn