ML Simplified : Process, Dilemmas and Stereotypes

Vikas Yadav
Design Studies in Practice
7 min readDec 2, 2017

But wait, are you a designer?

I am a designer myself, so trust me when when I say that its too easy to call machine learning a black box and walk away. AI is here to stay and so does its nuts and bolts aka machine learning. Machine learning will push the envelope of brain-computer interface, natural language understanding, linguistics, computer vision, object recognition and translation in the years to come. Following article deals with natural language processing and explains one of the learning models adopted by developer community. I have tried to break it down to its nuts and bolts but most importantly this article is trying to make a complex process transparent. Its imperative for design community to understand role of data in this domain.

Data can be good and dirty, so will be machine learning model trained on that data.

What is this post about?

This post is a simplified version of Rob Speer’s article on ‘Stereotypes in machine learning’. We will talk about word vector/embeddings, how they are extracted, factors that affect quality of word vectors and implications of different approaches.

What is this post not about?

Title of this post might be misleading but this post covers a very specific topic of Word embeddings and everything around it. Word embeddings is a way to accomplish Natural Language Processing(NLP) which uses concepts of Machine Learning. This post by no means talks about models of machine learning other than word embeddings/vectors.

So, what are word vectors?

As Explained in Rob’s article,

Word embeddings or word vectors are a way for computers to understand what words mean in text written by people. The goal is to represent words as lists of numbers, where small changes to the numbers represent small changes to the meaning of the word. This is a technique that helps in building AI algorithms for natural language understanding — using word vectors, the algorithm can compare words by what they mean, not just by how they’re spelled

image from https://www.slideshare.net/hadyelsahar/word-embedings-why-the-hype-55769273

Word vectors as concept are deeply mathematical, we don’t have to dive deep into it but just remember like most of machine learning algorithms, word vectors also use a probabilistic model to understand word meanings. Such an understanding comes with its pros and cons. Pros being it can understand and build up on it previous understanding to further increase its intelligence but perhaps the biggest con is to make sure that it understood the correct meaning in the first place. We will talk about this a lot in rest of this article. For now lets take a step back and understand what we are looking at.

Basic three components of NLP query

Any NLP query is returned by an appropriate result from processed library. This library consists of all word and their meaning perceived by the machine. But the interesting stuff happens under the hood, in the machine learning ‘black box’ of NLP. Lets break down this black box to understand some components.

Lets discuss these three components in more details:

DATA

Data is all internet, an algorithm’s accuracy depends on what data you decide to train it on. In past people have trained their NLP models on different type of data. e.g. “word2vec Google News” trained its word vectors from google news, “GloVe 1.2 840B” a system from Stanford was trained on whole web, “fastText enWP” is Facebook’s word vectors trained on english wikipedia.

Looking closely at these models, word2vec when queried with operation “king — man + woman” displayed a result

womanman queenking

This is remarkable as word vectors draw gender based relationship to draw analogies. Once pushed further the same model displayed following results:

man : woman :: shopkeeper : housewife

man : woman :: carpentry : sewing

man : woman :: pharmaceuticals : cosmetics

Yes, the model holds a gender bias. Trained on google news feed, “word2vec” picked up gender biased connotations as its foundational understanding about professions and activities. Rob describes in the original article

I had tried building an algorithm for sentiment analysis based on word embeddings — evaluating how much people like certain things based on what they say about them. When I applied it to restaurant reviews, I found it was ranking Mexican restaurants lower. The reason was not reflected in the star ratings or actual text of the reviews. It’s not that people don’t like Mexican food. The reason was that the system had learned the word “Mexican” from reading the Web.

If a restaurant was described doing something illegal, it will be terrible for its reputation. However, the issue in this context is that people use “mexican” disproportionately along with the word “illegal”, particularly to associate “Mexican immigrants“ with “illegal immigrants”. This happens when you train you word vectors on web content. Lets talk about another larger influencer, Porn. Yes, a lot of web is porn. Which means that NLP model trained on open web will going to end up learning associations for many kinds of words like “girlfriend”, “teen”, “asian”, would be very inappropriate. Many of these terms have negative connotations which causes gender, ethnic, age and sexual biases.

problems of data summarized

MACHINE LEARNING

This is a niche topic but as a simplified version, machine learning in this context could be understood as simply feeding data to the algorithm for it to start building an understanding. What is interesting though is when can we put check and balances for de-biasing. For the purpose of this discussion, I would like to illustrate the examples drawn by Rob in original article between Google and Microsoft’s way of training their algorithms.

Google’s approach : Feed all data, let algorithm learn biases as they exist in the real world. Google believes data is what it is, a reflection of world being unfair. To make it usable though google’s goal is to understand at what point this understanding affects someone (like classifying someone to qualify for a loan) and de-bias the actual decision, providing equal opportunity irrespective of what algorithm learnt from data.

Microsoft’s approach : If data feels biased, modify it until it feels least biased or most accommodating to ensure algorithm learns from clean data. This process is not as easy as it sounds, there’s definitely some technical nuance to this approach but we’ll not dive into technical rationale.

As recommended practice, an ideal algorithm should train on both approaches to be least biased in the first place as well as more aware of the biases that may have made their way in database somehow.

IN CONCLUSION

So now we know its nuts and bolts, what can we do? Designing for AI has tremendous power but also comes with great responsibility. Next generation of smart products will heavily rely on machine learning. As designers we can take some steps. Following is a synthesis of my experience from the industry and how knowledge from this article can be utilized in a collaborative team of designers, product managers and developers.

  • Context is king : In design, we spend a lot of time understanding context. This context can be used to dig for specific and relevant data on which the machine algorithm can train. Such specificity can allow us to purge data before training like Microsoft approach. Designers can communicate such an approach with developers for a unified understanding of this context.
  • Turing Test + Personality/Tone testing : In context of NLP, can we make early iterative concept prototypes to test against personality of an algorithm perhaps through a conversational bot which could convey the connotational tonality of it understanding of the world. Such an intervention can help surface hidden biases of given algorithm.
  • Selective Omission of Biases : Designers spend a lot of time understanding system in which they are trying to propose a solution. With it comes a pronounced understanding of perhaps all possible stakeholders in the landscape. Can an approach be adopted where collected data can be de-biased against possible biases which may directly affect stakeholders.

TL;DR

  • Word embeddings/vectors are a means of understanding structure of human language through numerical representation.
  • NLP can be broadly understood in three components: Data(raw data to train ML algorithm), Algorithm(program that holds logic to understand data) and Library (resultant understanding and relationship of data)
  • Based on source of data, it can hold biases around gender, age, ethnicity, race, sexuality, etc. One should practice care while choosing source of data.
  • Both Google (de-biased results) and Microsoft (de-biased data) approach are reliable but a combination of both would yield effectively least biased results.

--

--