Data is the new oil, but it needs refineries!

Ashutosh Sharan
Sep 2, 2018 · 3 min read

I wonder if ever a true ‘human like’ artificial intelligence arrives, its political views will be ‘Left’, would not know enough about Hispanics or Indians or Africans, and is very likely to be capable of mansplaining.

This is basis the fact that most of the people trying to build it (mostly based out of Silicon Valley at top tech companies) are ‘white’ ‘men’ with ‘liberal’ political opinions. And hence the data they are feeding into the machines to find the underlying equation that will let it see, read, understand, speak and ‘think’ have been biased as well. We already have quite a few examples of AI use cases not working right due to population biases in data and people creating it :

  1. A video conference software built by google was not able to identify people of color
  2. Most of the AI assistants are ‘women’ — Siri, Alexa etc.
  3. Tech companies have been blamed to have liberal bias in the past
  4. The twitter bot that Microsoft built learnt from the hyperactive trolls and became racist

In all the above cases, the biased data fed into the learning algorithm resulted into the AI application inheriting those biases. The Microsoft twitter ‘bot’ got overexposed to the data from abusive Twitterati and it learnt their language. And the video conference software at Google was fed predominantly white people’s video.

So, whoever said that data is today’s oil, forgot to mention that it is only ‘crude’ oil. And we need refineries to make it useful.

Today, the key sensors that collect large amount of data about human behavior are — social media, internet search, digital transactions, ecommerce, census data etc. Any AI that is built on these data would have inherent biases — it will me more aligned towards people who are more active on social media (and their inherent age and gender). Moreover some people are more aggressive and less thoughtful about what they speak on social media and hence their point of view will be more prominent. Similarly, internet search will have bias towards people with better internet connectivity. And digital transactions data would poorly represent unbanked and underbanked population.

Bias is not the only problem with data. Another source of poor quality data is outdated data. For example a bank creating ML based fraud solution on 12 month old data is going to be less likely to be useful considering the pace at which fraudsters change their approach to commit fraud, and also the pace at which consumers change their behavior in this digital age. Another source of poor data is poor implementation of data collection process/sensors.

Therefore, to build effective AI, the companies building it need to

  1. invest on building sensors to collect data for people who are under represented — for example — enabling internet in rural areas, investing in thorough census, making data collection easy for poorest of hospitals, making existing products friendly for all population segments — e.g. search engine, social media in all languages etc). They also need to collect fresh new data on a regular basis to build relevant AI
  2. use representative proportion of all segment of population to build any AI — e.g. had Microsoft had enabled a way to under represent the tweets from abusive racist trolls, the chatbot may not have become one

All those businesses who are building future AI or aspire to use AI significantly in their businesses, need to put massive investments in data sensors & refining. Over and above this, they also need to build/leverage algorithms that considers and manages ‘known’ biases in data. And more importantly — to ‘know’ about these biases, the engineers and data scientists working on these initiatives need to be more aware and hence must come from diverse backgrounds, race and ethnicity. Moreover they should be trained to understand unconscious bias and how to not let it impact their work.

Essentially these businesses need an entire set of systems and people focused on making their data useful and effective. They need a data refinery!

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade