On Labeled data

Andrej Karpathy, director of AI at Tesla, recently wrote a blog post “Software 2.0” where he said:

It turns out that a large portion of real-world problems have the property that it is significantly easier to collect the data (…) than to explicitly write the program. A large portion of programmers of tomorrow do not maintain complex software repositories, write intricate programs, or analyze their running times. They collect, clean, manipulate, label, analyze and visualize data that feeds neural networks.

I whole heartedly agree and want to elaborate on how “deep learning” has brought this shift about, and what it means for data teams inside of businesses and for businesses themselves.

“it is significantly easier to collect the data than to explicitly write the program”

Source https://xkcd.com/1425/

In 2013, Randall Monroe of XKCD fame wrote this comic, which states that checking if a photo has a bird in it is borderline impossible because we had to write an explicit program, deep learning wasn’t commonly used yet.

Fast forward to 2016 and the image above, we have systems that can recognize birds, count them and form proper sentences describing the contents of an image. And you can download them from Github

So what exactly was impossible in 2013 that became trivial only three years later? As Karpathy put it, we no longer have to “explicitly write the program”, instead we show a neural network labeled data and it “writes the program” for us.

Deep Learning goes commodity

It’s not just the advances in algorithms that have changed, the availability of those algorithms to engineers and businesses has exploded and changed the way we work with data. I think Yann LeCun, Facebook’s head of AI, captured this succinctly when he rechristened deep learning as “differentiable programming”

Implictly both LeCun and Karpathy are echoing the same sentiment, that deep learning / differentiable programming is becoming an activity of software engineers as much as it is an activity for data scientists.

Why ?

The “hard part” of deep learning was getting the math right and implementing it:

There is an intellectual overhead in implementing efficient matrix multiplication, in recalling the definition of cross entropy and implementing is in a way that is numerically stable. It takes some sophistication to recall that a convolution is equivalent to a Fourier transform as well as to a Toeplitz matrix and to have both the mathematical and engineering pedigree to know when to prefer which. And when you want to scale, you also need an intimate understanding of what a GPU is and how to best use it.

Today, to do a deep learning project, we don’t need to know any of that.

The open source frameworks out there like Tensorflow and PyTorch have abstracted a lot of the math and engineering stuff away, to the point where we can execute a very successful deep learning project without knowing what a Toeplitz matrix is.

To drive that point home, take a look at the following equations which define an LSTM, one of the building blocks of neural networks for sequences.

Equations for an LSTM

And compare that with the following code, which is what we write today when we want to use an LSTM

Code for an LSTM

Deep Learning for everyone

That simplicity and abstraction mean that virtually anyone with programming skills can take an online course and do something useful in a very short time. This isn’t just a change for the individual, it’s also a huge shift for businesses. The cost of trying deep learning has gone down dramatically.

To go back to that XKCD comic, 3 years ago we needed a research team and five years to find a bird in a photo. Today we need an engineer and two weeks.

For a business, dabbling in machine learning is risky. There is an investment of time and resources which might not yield anything. In the “old days”, the upfront costs of taking such risks were large. You needed a team of scarce and expensive professionals, and you had to give them the time to flesh out their infrastructure before they could start being productive. Those costs add up fast.

For many businesses, the old way made doing an ML project was prohibitively risky. Today, because of the amazing tooling that is available to anyone, it is riskier to not try.

But deep learning is not a panacea. I’d say its shifted the burden of labor as well as the costs from the algorithms to the data.

“programmers of tomorrow … collect, clean, manipulate, label, analyze and visualize data that feeds neural networks”

Data Shock

In parts of academia and in online courses, training data falls from the sky like manna. To fill our neural networks belly with sustenance, we just go to the Internet and download an already curated, separated, annotated and validated dataset that is clean and free of errors.

Unfortunately, the real world doesn’t work like that.

Every “new” data scientist I’ve met, myself included, has some deep shock when they encounter the state of data in real world settings. Often time their reaction reminds me of this great verse from the Bible, where the children of Israel revolt against Moses as he leads them out of enslavement

“The Israelites(Data scientists) said to them(Management), “If only we had died by the LORD’s hand in Egypt(Academia) ! There we sat around pots of meat(labeled data) and ate all the food we wanted, but you have brought us out into this desert to starve this entire assembly to death.” — Exodus 16:3

(No offense to academia intended)

While its true that a data scientists appetite for data is insatiable, stake holders of data projects are often unknowingly misleading about the condition, quantity and availability of the data they believe they have.

Kaggle’s recent survey on the state of ML captures the frustration of the data practitioner well, with 50% of respondents complaining about dirty data and 30% saying data is hard to access.

What barriers do data scientists face at work -from Kaggles 2017 ML Survey (https://www.kaggle.com/surveys/2017)

Training Data

It’s been popular to say that “Data is the new oil” as if just having data makes us the owners of a valuable commodity that can easily be transformed into cash. But the original quote, from Clive Humby (In 2006 no less) was a little more nuanced:

Data is the new oil. It’s valuable, but if unrefined it cannot really be used. It has to be changed into gas, plastic, chemicals, etc to create a valuable entity that drives profitable activity;

In the era of Differentiable Programming or Deep Learning or Software 2.0 or whatever you’d like to call it, Christopher Re’s modification is ever more fitting

Source (http://dawn.cs.stanford.edu/2017/05/08/snorkel/)

In this era, just having “data” is not even having a raw material, “data” is now a precursor for the raw material, Training Data. To complete the analogy, training data is “refined” by a model to produce a commodity of value.

The Key to Success

This has implications for data practitioners and for their stakeholders. For the stakeholder, its crucial to understand that the “data” part of the organisation encompasses multiple functions: Sourcing raw data, converting the raw data to Training Data and finally producing models.

For us practitioners, the key to a successful delivery is managing our data. That means executing an annotation process, having controls to measure its throughput and quality and being able to consistently report progress against a static, high quality test set.

When Karpathy said “programmers of tomorrow … collect, clean, manipulate, label, analyze and visualize data” he wasn’t kidding. The algorithms for leveraging data are a git pull away, the differentiator for success is in execution on the data side. Expecting otherwise seems puzzling, to quote Charles Babbage:

On two occasions I have been asked, “Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?” … I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.
Charles Babbage, Passages from the Life of a Philosopher