What did we learn about our robots?
It has been a week since we’ve released Robot Round Robin — A game to teach robots about language.
Why did we make a game?
Our company revolves around transcripts. As such, we’ve been doing research into several services that are available today, including IBM Watson, and Google Speech API. We are always running tests to determine which ones are better for the features we are developing. One avenue we wanted to evaluate was listener preference. This is such an enigmatic aspect of transcriptions, as opposed to just knowing which service is more accurate. There are algorithms to do that (see: Edit Distance), but there are no algorithms for what feels good to humans. The output of these services are imperfect, but it is important to know which of these bad alternatives people prefer.
So why a game? We could have just run a survey and asked people questions about their preferred transcription service but A) people wouldn’t see the value in that, and B) it’s boring. Signl.fm was founded by game developers, therefor much of our design is rooting in player interaction. For us, the easiest way to gather tangible information about listeners is to create a fun, interactive, experience that engages people in a meaningful way.
Robot Round Robin is intentionally minimalist and works similarly to Telltale’s decision making in games such as Walking Dead, Wolf Among Us, and Batman: The Telltale Series. The player is presented with three transcript variations, and told to choose the one they most prefer. They are intentionally left in the dark, as we didn’t want to create a bias toward any given service. When the player makes a choice, in that instance, it’s the option they preferred. It’s only after the selection has been made that we tell the player which robots were represented that turn.
At this point, they are shown how their decision stacks up with the community. This is the aspect of the design that was most inspired by Telltale. The idea is that, by seeing what group they belong to, it gives them a sense of comradery. After eight turns they get to the end of the round. Here they are shown their own results, and which robot they had the most preference for. The draw to share is that anybody who clicks their link will play the same round that they just played.
For this experiment we wanted to use a diverse group of comedy podcasts. There were a few reasons for this. The main one being that the services we used were all trained on different content. What that content was specifically, we don’t know. We’re pretty sure they weren’t trained on comedy podcasts. So the idea of using shows with noticeably dissimilar voices was a big part of learning more about these services.
By using this content we got some pretty funny results. It turns out that none of these services did well with 2 Dope Queens, and were completely unable to transcribe the word “ashy”. This not only gave us insight into what these services were trained on, but allowed us to develop guidelines when experimenting in the future. Knowing how something doesn’t work is just as valuable as knowing how something does work.
During our development we have done some experiments involving crowd-sourced transcription correction. You can look back at https://jem.signl.fm and see our early attempts. We learned that this is a huge undertaking for the users, not to mention designing a good experience. The whole experiment was intended to create a 100% accurate transcript. Whereas, Robot Round Robin was intended to create the best possible wrong transcription. We ended up being more successful, because we did not ask for the players to correct the transcripts, but instead to pick the best wrong option.
What We Learned
We have derived a lot of useful guidelines from running this experiment. We now have a better understanding of the kinds of components people look for in these transcripts. Our future experiments can now be focused more on delving into these components. We can ask ourselves more targeted questions. How can we surface what people look for in a more intelligent, automated way? What other content can we throw at these to get a better understanding of their limitations? These are all potential avenues we can go down, and I’m excited to get to work.
Almost 300 people played this game, of which about 150 actually finished a round of 8 turns. That’s a surprisingly good outcome given that the fundamental gameplay loop doesn’t change. Over 2000 selections were made, which gives a really nice chunk of data to analyze. The results were fun to mull over.
In the end, the winner was very close. Ardiod won 65% of the turns it was featured in, and Esper won 54%. For perspective, Lav won only 7% of the turns it was featured in. We have ideas for future experiment that might weigh in Lav’s favor, mostly related to UK podcasts.