Microsoft’s Tay is an Example of Bad Design

or Why Interaction Design Matters, and so does QA-ing.

Yesterday Microsoft launched a teen girl AI on Twitter named “Tay.” I work with chat bots and natural language processing as a researcher for my day job and I’m pretty into teen culture (sometimes I write for Rookie Mag). But even further more, I love bots. Bots are the best, and Olivia Tators is a national treasure that we needed but didn’t deserve.

But because I work with bots, primarily testing and designing software to let people set up bots and parse language, and I follow bot creators/advocates such as Allison Parrish, Darius Kazemi and Thrice Dotted, I was excited and then horrifically disappointed with Tay.

According to Business Insider,The aim was to “experiment with and conduct research on conversational understanding,” with Tay able to learn from “her” conversations and get progressively ‘smarter.’ ” The Telegraph sums it up the most elegantly though, “ Tay also asks her followers to ‘f***’ her, and calls them ‘daddy’. This is because her responses are learned by the conversations she has with real humans online — and real humans like to say weird stuff online and enjoy hijacking corporate attempts at PR…”

Here’s the thing about machine learning, and bots in general, and hell, even AI. They, those capabilities, are not very smart, and must be trained by a corpus of data. When that data is fed into a series of different kinds of machine learning algorithms, let’s go with one specifically designed for chat, that algorithm or chat set up must be trained. The corpus of data, when it comes to chat robots, can be things like questions and answers, with those questions and answers directly mapped to each other. “What is your name” can be asked a thousand different ways, but have one or two applicable answers. Training the system to match those two concrete answers to a variety of questions is done in Q&A, and reinforced through launching the system, and those answers will be mapped to new kinds of questions that are similar to the questions that it’s been trained to answer. And that’s what Microsoft seemed to be doing. They had a general set of knowledge trees that ‘read’ language, like different words, and mapped them to general answers. But their intention was to get a bunch of help in making Tay sound more ‘like the internet.’

However, Microsoft didn’t ‘black list’ certain words- meaning creating much more ‘hard coded’ responses to certain words, like domestic violence, gamergate, or rape.

They did, however, do that with Eric Garner. So some words, some key words, were specifically trained for nuanced responses, but a lot where not.

But what does this mean when it comes to training? So training a bot is about frequency and kinds of questions asked. If a large amount of questions asked are more racist in nature, it’s training the bot to be more racist, especially if there haven’t been specific parameters set to counter that racism.

People like to kick the tires of machines and AI, and see where the fall off is. People like to find holes and exploit them, not because the internet is incredibly horrible (even if at times it seems like a cesspool); but because it’s human nature to try to see what the extremes are of a device. People run into walls in video games or find glitches because it’s fun to see where things break. This is necessary because creators and engineers need to understand ways that bots can act that were unintended for, and where the systems for creating, updating and maintaining them can fall apart.

But if your bot is racist, and can be taught to be racist, that’s a design flaw. That’s bad design, and that’s on you. Making a thing that talks to people, and talks to people only on Twitter, which has a whole history of harassment, especially against women, is a large oversight on Microsoft’s part. These problems- this accidental racism, or being taught to harass people like Zoe Quinn- these are not bugs; they are features because they are in your public-facing and user-interacting software.

Language is fucking nuanced, and so is conversation. If we are going to make things people use, people touch, and people actually talk to, then we need to, as bot creators and AI enthusiasts, talk about codes of conduct and how AIs should respond to racism, especially if companies are rolling out these products, and especially if they are doin’ it for funsies. Conversations run the gamut of emotions, from the silly and mundane to harassing and abusive. To assume that your users will only engage in polite conversation is a fucking massive and gross oversight, especially on Twitter. But mix in the ability through machine learning where the bot is being trained and retrained? Then I have massive ethical questions about WTF design choices you are making. Microsoft, you owe it to your users to think about how your machine learning mechanisms responds to certain kinds of language, sentences, and behaviors. You literally just trained Cortana to fight back from sexual assault, so why didn’t Tay come with specific responses to certain words to avoid harassing users and being trained to question the Holocaust?

Allison Parrish summarizes it amazingly here:

Conversational structure and responses within machine learning algorithms is design, and Tay was flawed design. How your AI responds to conversation is a design choice you made. how poorly your AI responds to questions is on you. AIs have to be trained using disambiguation and chit chat to be able to suss out different kinds of interactions from flirtation, bad flirtation, to abuse, to silliness, and even anger. Chit-chat is exactly what it sounds like, and it’s what Tay was best at (I mean…sorta?) Disambiguation is used to determine what a user is asking a bot, and to help the bot better ‘recognize’ what is being asked or said. Sentences are parsed into things like:

The above is how some AI and chat bots understand language- it looks like a diagramming tool. But these AIs have to be trained. Tay didn’t understand “Holocaust” as the Holocaust, it understood it as a blank word or an event. And the things before it as negative, so thus if asked “Do you think the Holocaust is real?” and then being told “Yes, it’s not real” or repeat after me “BLANK PERSON is a fucking jerk,” it teaches the corpus those phrases and reinforces them as appropriate responses.

Tay did some disambiguation, especially if it was being asked too many questions. Tay needed to learn from input- it was good at realizing when there were too many questions and it wasn’t learning, so it would disambiguate and ask me a question. Some of it’s questions were clearly designed with Microsoft trying to understand language, such as what gender am I, etc.

But over all, these interactions, these conversations are a part of design. This is what design in artificial intelligence looks like. It’s not just the interface the participants or users are using to communicate, it’s how that communication unfolds, and what the backend system is structured to look like to start storing and parsing these responses. It’s the system of what queries should have which particular responses no matter what. Tay was an example of really, really bad design, and it was blamed primarily on training. With the future of chat and AI, designers and engineers have to start thinking about codes of conduct and how accidentally abusive an AI can be, and start designing conversations with that in mind.