A candid conversation about machine learning with Madison May

Welcome to Candid Machine Learning Conversations. I’m Ryan Louie, and I interview machine learning practitioners about their thoughtfulness behind their profession.

My full conversation with Madison May, a machine learning practitioner and CTO of text/image machine learning startup indico

Madison May has been the CTO of indico, a Boston-based machine learning startup, since Fall 2014. Before Madison worked at indico (yes, that’s a lower case “i”), he was an undergraduate computer science student studying at Olin College. For reader’s not familiar with the Olin College community, indico is one of the college’s most successful startup stories. The way Madison described how he got involved with indico is a candid view into the type of person he is.

Please listen as Madison describes how he got to where he is today — as CTO of machine learning startup indico.

I was fortunate enough to intern at indico during the summer of 2015, so I am familiar with indico as a company and Madison’s role in it as CTO. It didn’t take us too long to start having meaningful discussions about the thoughtfulness that is required when deploying machine learning models for the world to use.

Now, I’ll discuss some of highlights that came up in our conversation.

1. How do you measure success for a ML model? How do you know when it’s “good enough” to deploy?

When I asked Madison how indico measures success for their models, he admitted that this question is actually a challenging one.

“Ideally as a data scientist you’re looking to get things into a format where you have an error metric or a loss that you can optimize for. But sometimes, you are not given a training set, you’re not given true labels, you’re not given sufficient quantities of data to get accurate metrics about how well you’re doing.” — Madison May

“What? There’s no true validation data to report numbers on?”, one might exclaim. Madison said this situation appears enough to not be ignored. The most common option is to “suck it up and label more data” — which the team has built internal tools to accomplish the task of labeling image and text data more time effectively.

But in reality, another way is to randomly bring in a team member not familiar with the project, show them the output of two models, and have them provide a qualitative assessment of which one works better. This comparative assessment is another way to measure progress.

I gasped a little when he said this.

I have held a certain conception that there are several standard ways of accessing model performance. These are error metrics (or accuracy scores, a rephrasing of error) and losses. They come from the pervasive comparisons of different models by data science competitions hosted by Kaggle and the tables in ML research papers detailing how a new state-of-the-art was set for some benchmark dataset compared to a list of alternate methods. Using qualitative comparisons to assess which model is performing better is not part of this common repertoire.

I took this screenshot from an active data science competition hosted on Kaggle.com. The competition is an image classification task for different species of fish — and every competitor is being accessed by the Multiclass Loss of the training dataset. See, now there’s a quantitative measure by which to rank competitors.

However, as I started thinking about it, I realized that there are instances where a qualitative assessment is either necessary or can be more natural as a way of presenting results. As an example, I found a demo use case on indico’s frontpage of a clothing similarity application, which relied on indico’s image features product.

In this clothing similarity example, a machine learning model finds which photos in a clothing dataset looks most similar to a query image, highlighted by the purple box. (Screenshot taken from the “Clothing Similarity” example on https://indico.io)

A qualitative demo tells me a lot about the power behind the image features product. Similar colors, cuts, and patterns are all dimensions which the product seems to capture. Seeing the diversity of examples helps me as a user judge what might going on behind the scenes in finding similarities.

I could imagine setting up clothing similarity in a more rigorous, quantitative way. Connecting the fashion similarity feature to a business metric — if releasing a clothing similarity feature helps customers discover similarly styled clothes they wouldn’t have found otherwise, and purchases go up, this is good quantitative evidence that the product should be deployed. I think this is what a lot of sites will do when launching a new feature — they will use a method of A/B testing a feature, measuring the effect of including that feature, and assessing whether this new feature is better for more user engagement or purchases.

But I have to bring up again that getting to the stage of A/B testing a new algorithmic product requires some initial validation — that the algorithm actually does the basic task of finding visual similarities in clothes. I would argue that the visual, qualitative demo is necessary to convey whether the visual matching is working as expected. We could create hand-labeled tags for each of the items (i.e. floral print, dark colors, dresses that fall above the knees), and count how many in and out of class examples are retrieved in our similarity search. And while I think this is a way to be explicit, I don’t know how much more value it would add than having several fashion-savvy co-workers assess whether the model is making reasonable enough predictions to deploy.

Either way, Madison shared one surprising anecdote that helped me understand some of the limits of qualitative model assessment. It was a recent story of “machine learning gone wrong” that cautioned data scientists to evaluate whether a task is human solvable before you even approach the task of solving it via a machine learning algorithm.

Listen to Madison talk about his humorous story of machine learning gone wrong. Thankfully, the mistake was caught and the model prevented from deploying!

In the task, Madison and the team were attempting to predict personality-type from a piece of text. With only a few labeled examples mapping text to the 16 Myer-Brigg’s personality types, it was inherently difficult to evaluate via normal error metrics (i.e. does the trained model predict the same Myer-Brigg’s personality type for more writing authored by a particular individual?)

When Madison turned to “bringing in a team member, showing them the output of two models, and having them provide a qualitative assessment of which one works better”, he discovered that qualitative human feedback on this task was not a reliable measure of success.

When shown a comparison between two models,

  1. A sophisticated natural language processing model that was trained on the dataset mapping paragraphs to Myer-Brigg’s personality types
  2. A model that randomly outputted 1 of 16 Myer-Brigg’s personality types

the team member gave the random model the “thumbs up” that the product was ready to ship to customers! If random output is indistinguishable to a human, human evaluation definitely cannot be relied on to decide if a model is successful.

2. What do you imagine could go wrong with your machine learning research and development?

“indico took a stance a long while ago [when we were discussing what we would be okay with building now and in the future]. We threw out the idea of predicting demographic information like age, gender, and ethnic background based on text that was written. This type of information is regularly used to target advertising, so we had folks in the social media space interested in having this information even if a direct label was not provided. We… decided that this was just not a line we wanted to cross — we didn’t want to produce an algorithm that could lead to instances of discrimination.” — Madison May

If I wanted to just hear about how Madison displayed thoughtfulness as a practicing data scientist, then this quote would satisfy that need. It’s relieving, inspiring, and fantastic on so many levels. But I’m going to unpack it more on why this story supports my belief that the indico team has great morals.

Algorithms can be discriminatory and have a disparate impact on others.

I think it’s important to start by saying that the engineers at indico acknowledge that algorithms can be discriminatory, and see that as a problem.

That firm stance against algorithmic bias is not held by everyone. The discussion points refuting this point often fall into these predictable modes, as highlighted by the algorithmic fairness blog post, “Racist Algorithms” and learned helplessness.

  • Algorithms can not be biased or racist because they are fundamentally neutral, they don’t know what racism is.
  • Algorithms are only trained on historically discriminatory data, they learned from the data in our inherently racist world.

There’s a lot of protecting the helpless position of an algorithm, where the intent is not to be racist. Yet not enough focus is placed on the fact that the “disparate impact” (an actual legal term so close to the “differential impact” that Allen Downey uses to describe the effects of algorithmic bias) resulting from the application of algorithms is the issue, no matter where the intent of the discrimination is coming from. The final layer of problems is the “amplifying” effect that occurs when an algorithm is deployed on systems with broad reach to so many types of people.

“Even if ‘all an algorithm is doing’ is reflecting and transmitting biases inherent in society, it’s also amplifying and perpetuating them on a much larger scale than your friendly neighborhood racist. And that’s the bigger issue.“— Suresh Venkat, author on algorithmic fairness

Systems using demographics as a predictive feature sit on a slippery moral slope.

Using demographic info as a predictive feature can be a potential problem for data scientists creating automatic decision making systems. My interview with Allen Downey actually delved into this topic.

“When you are making predictions, very often there are factors that you should not include for ethical reasons even though they might make the predictions better. If you trying to make good predictions — about people — you will almost always find that if you include race, religion, age, gender… they almost always make your predictions better… your accuracy score go up. But there are a lot of contexts, like criminal justice for example, where we make the deliberate decision to exclude relevant information because it would be unethical to use that information… We have decided as a society to not include them.” — Allen Downey
I cut to the conversation with Allen Downey where we discuss factors that should not be included, for ethical reasons, in a model. When making predictions about people, these factors include race, religion, gender, age — all sorts of demographic information.

I appreciate how indico — despite their position one-step removed from the process of using demographic info in making predictions — was well aware that their predictive API could find itself as part of a larger pipeline that does use demographic information to make predictions.

Designers, Engineers, Data Scientists — you are the technical key holders who can decide the use and misuse of technologies

With a tool that could infer demographic information from a user’s social media profile or text, advertisers could use these features to drive their models. Not building this tool would leave the demographics out of the algorithms advertisers use.

I saw a similarity in philosophy between indico deciding not to build the demographic text prediction API and the messaging service Whatsapp deciding to encrypt all messages, with the consequence that any government interested in tapping into conversations have no possible way of doing so.

I draw a parallel, in response to indico’s refusal to build an inherently discriminatory algorithm based on demographic information, with how messaging service Whatsapp uses their development powers to express their view that all users deserve privacy.
“We are the technical key holders, and we want to explicitly say, ‘We don’t want to give those features out. We don’t want enable the misuse of them. [It’s our] decision not to build something that could be used in that way.” — Ryan Louie

I’d invite the critical reader to think of more examples, even outside of the data science realm, where it is up to the ethics of engineers to decide what to and what not to build in the world. There is a wealth of wisdom in the area of deciding what is ethical to build into the world for designers and engineers that are not just data scientists. Below is one link that talks one through what is worth working on, from an ethical standpoint.