Overcoming the Limitations of Learning from Data

and all that Jazz

Sherol Chen
The Eliza Effect
Published in
6 min readAug 1, 2017


It seems like AI/ML is the big thing in silicon valley. If you check out how the VC money is moving, what the TechCrunch articles say, and what’s being posted on HackerNews, it’s like ML is the greatest thing since social networks. If Machine Learning is so great, why don’t we use it everywhere? Why not get rich building prediction models for the stock market? According to this NVIDIA blogpost, Deep Learning, responsible for the current ML boom, wasn’t so great until recently:

Over the past few years AI has exploded, and especially since 2015. Much of that has to do with the wide availability of GPUs that make parallel processing ever faster, cheaper, and more powerful. It also has to do with the simultaneous one-two punch of practically infinite storage and a flood of data of every stripe (that whole Big Data movement) — images, text, transactions, mapping data, you name it.

In a previous article, I talked about the differences between ML, deep learning, and AI. It’s important to understand that each one behaves differently, and has their own gotchas. For example, the non-rule-based nature of how Deep Learning builds intelligence can also produce harder-to-detect and unintended consequences. A famous example is the Twitter chatbot, Tay, from Microsoft, headlines as “Microsoft silences its new A.I. bot Tay, after Twitter users teach it racism.” Here’s another headlining example from Google, “Google Apologizes For Tagging Photos Of Black People As ‘Gorillas’” So, while the algorithms are sound, we’re still bound by garbage-in / garbage-out. Let’s take a look at some other edge cases.

Machine Learning Falling Short — Example: ML for Games

Being fairly recent, there are still many areas where Machine Learning is too unreliable. For example, in the computer game industry, AI is widely used for content generation, mixed-initiative authoring, NPC’s, and computer opponents. The cognitive dissonance caused by the randomness in Machine Learning is too costly for expressive and immersive experiences like games. A lot of game AI is, as a result, heavily scripted with limited outcomes and adaptations (seen as a “glorified roller coaster ride”). Argument Champion is an example of a purposely awkward AI driven game. Jeff Orkin’s Restaurant Game is a more academic study on Machine Learning in games from the MIT Media Lab.

Subtle Biases of Data — Example: Data Driven Curation

The more subtle circumstances can be costly as well. Netflix uses big data to inform their next series, which is content creation based off of what people seem to already like. That may sound benign, but what about Facebook newsfeeds filtering out content we don’t agree with? By creating ideological echo chambers, we miss out on new ideas, innovations, and criticisms. What if your company gets a million resumes, and you decide to train a model to curate by people who’ve been successfully hired? You may greatly constrict the diversity of future hires.

The Grey Areas of What’s Fair or True — Example: Search Results

Tuning a consensus on how something should or should not be is tricky. For example, do you like jazz? Could you explain what is or isn’t jazz? When I was working on the Youtube’s Music Mixes in 2014, I noticed that “jazz” search results were not to my personal liking. As someone who’s been in various jazz bands, I even found it slightly irksome. With a quick search today, the top 5 video results are all mood music. The “top tracks” on the right is a collection of pop/fusion jazz.

(May 30th, 2017)

So what are results that I’d like? If you noticed, the top result is not a video, it’s a Mix. The composition of the videos in that mix are more appropriately curated and a great sample of jazz music.

So which of the three types of results are “jazz,” the Results, Top Tracks, or the Mix? The real question is whether my opinion of jazz results is more meaningful that what most of the world thinks. It’s not an easy question. If we go by search results, jazz could be what everyone else thinks it is. I, on the other hand, would rather jazz be strictly more informed. I’d love for the rest of the world to meet my standards for jazz, but is that fair? There doesn’t seem to be a straightforward answer, but as data drives our technology, we have to be mindful of such things. It all comes back to garbage in, garbage out. When you’ve got an AI being trained on human-provided data, you’re going to get skewed results. One perfect example is a project on crash test dummies.

Discrimination in Data and Design — Example: Crash Test Dummies

Algorithms in themselves long predate computers. An algorithm is simply a sequence of instructions. Law codes can be seen as algorithms. The rules of games can be understood as algorithms, and nothing could be more human than making up games. Armies are perhaps the most completely algorithmic forms of social organisation. Yet too much contemporary discussion is framed as if the algorithmic workings of computer networks are something entirely new. It’s true that they can follow instructions at superhuman speed, with superhuman fidelity and over unimaginable quantities of data. But these instructions don’t come from nowhere. Although neural networks might be said to write their own programs, they do so towards goals set by humans, using data collected for human purposes. If the data is skewed, even by accident, the computers will amplify injustice. (The Guardian, 2016)

A study in 2011 showed that seat-belted female drivers had a 47% higher chance of serious injuries than a belted male driver in comparable collisions. This was due to the lack of female crash-test dummies. For the 2011 Sienna vehicle, the federal government replaced averaged sized male dummies with average sized female dummies to test the discrepancy. They found that, “when the 2011 Sienna was slammed into a barrier at 35 mph, the female dummy in the front passenger seat registered a 20 to 40 percent risk of being killed or seriously injured, according to the test data. The average for that class of vehicle is 15 percent.” The difference in statistics was even greater for minor injuries. (Washington Post, 2012)

Discrimination may be as subtle as inappropriate auto-completes for “why are women….,” or as life-altering as the under-diagnosis of diabetes in Asian Americans. Anything driven by data, whether medical research or Artificial Intelligence should be thoughtfully and ethically driven by accurate representation. Just as a doctor ought not to knowingly misdiagnose a patient based off of race or gender, we should make our best attempts to build technologies that suit all people.


Augmenting Fairness instead of Automating Bias

Now that data driven practices are becoming wide-spread, here are 10 questions to keep in mind, when designing with data. These staple inquiries are analogously presented as the sections of a generic research paper, as data has been driving science for centuries.

  1. Abstract: What sort of prediction model are we building?
  2. Problem Statement: What problem are we trying to solve?
  3. Previous Work: How are we currently solving the problem?
  4. Methods: How will we collect data and build the model?
  5. Data Analysis: How do we debug and monitor good and fair data?
  6. Results: How will we measure and vet the success of the model?
  7. Discussion: Who are we helping and who benefits most?
  8. Limitations: How could this work be exploited or fall short?
  9. Conclusions: How will this technology be marketed and used?
  10. Future Work: How will this model be improved and maintained in the future?

Finally, here is a very cool interactive demo and visualization based off of a paper on Supervised Learning practices for Equal Opportunities: https://research.google.com/bigpicture/attacking-discrimination-in-ml/

As machine learning is increasingly used to make important decisions across core social domains, the work of ensuring that these decisions aren’t discriminatory becomes crucial.