Some things I’d like you to know about data science

Things I’ve learned mostly by making mistakes

Masses of data + cutting edge machine learning + cheap compute = Profit.

Right?

It’s not that simple.

Data science isn’t a replacement for asking difficult questions and doing hard work based on the answers. In fact, it’s quite the opposite. Enabled by increasingly powerful algorithms and ever larger datasets, the breadth and depth of problems we can work on is only increasing. And that means choosing what to work on can be hard. This post is a loose collection of lessons I’ve learned about making the most of your analytical effort.

If you’re a project lead, product owner or decision maker, it’s vital to make the right choices with the information available to you. If you’re a data scientist/analyst/engineer you need to be laser-focused on furthering the goals of your organisation. The thread that ties these together is asking the right questions.

Asking the right question

Data science is not a magic bullet. It’s more like a good recipe. I like simple recipes with room to tailor them (or more realistically, use what’s in the fridge).

Since good analysis is essentially telling an informative story, I’ve taken this recipe from a textbook on journalism. It might seem simplistic, but often the goals, processes and environment are complex, and it is easy to get each of the following things wrong in subtle ways.

  • What are we doing? Ask a question with quantifiable answer
  • Why are we doing it? Know the upside of doing the work & and how to turn an answer into an action
  • How will we get & process the data? The ideal data vs. the actual data
  • When do you need it? Timeliness vs accuracy
  • Who is going to do it? Having the right skillset on your team

Let’s step through these. The focus will be on using data science to aid decision making, but some of this carries over to building data products. For a much deeper discussion of putting machine learning into products, check this out.

What are we doing? Ask a question with quantifiable answer

Ask questions so that the answers are quantifiable and actionable. If the answer isn’t quantifiable, it’s unlikely that a model can be built. If the answer you get from the model doesn’t imply a clear action you won’t get anything useful from even the best model.

Bad example: What patterns are there in my data?

This question will drive your data scientists mad. They probably have less domain knowledge than you, didn’t collect the data themselves, and don’t know what actions you could take.

This isn’t to say you shouldn’t explore new data, but exploring data without some defined directions of inquiry is unlikely to go anywhere. Exploratory analysis should be thought of as the zeroth step in the analysis process, not an end unto itself.

Good example: which customers are we most likely to lose unless we intervene? Which of them should we target for intervention? Which from a specific set of interventions will bring the highest expected ROI?

With this, data scientists can investigate which signals in the data led to customers leaving historically. These signals, and their combination, can be used to build a model to predict which customers are most likely to leave. They can propose targeted interventions for high-value customers and a threshold below which you shouldn’t intervene.

Why are we doing it? Know the upside of doing the work & and how to turn an answer into an action

Unless the work being defined can be clearly mapped to furthering the goals of the organisation (or something obvious like provably improving a product or decreasing a cost), the project should be treated with skepticism.

One failure mode is an unengaged stakeholder. Do they really want you to do the work? Will they act based on the results? Are they expecting a cat riding a fire-breathing unicorn with a gold desert eagle to solve their problems? Hint: they aren’t going to get it. They are going to get some very detailed input on what they should do, and that conversation may go badly.

Is this what you were expecting? (Original image here)

Typically, the results from the models will be idealized, and some suggested actions may be impractical for domain-specific reasons. This is where your stakeholders need to shine — you could always build a more complex model, or use the human (hopefully expert) judgement they bring. Often the latter yields higher ROI — the goal is to augment human intelligence, not replace it!

How will we get & process the data? The ideal data vs. the actual data

Not all use cases are the same but here are some thoughts about making data science projects valuable.

Designing ad-hoc analysis/experiments

As far as is practical, it is helpful to know in advance:

  • the criteria for success,
  • the metrics that define it,
  • the variables that may influence those metrics,
  • the data needed to find & represent these variables. The location, use restrictions and quality of the data have a huge impact on your project, so its good to sort that out early on.
  • A baseline: Measuring something is greatly devalued if you don’t have a specification for how the system you are observing is supposed to perform in the first place.

With this in place, a data scientist may not even be needed. Self-service visualisation/analytics tools will work in most cases.

This ideal case is rare, more frequently you need to “use what’s in the fridge”. A common problem is mistakenly believing you have something in the fridge that isn’t there. Double-check the fridge before getting elbow deep in the cooking.

The cost of data

Not all data collection can be defined up-front, or be gathered as a side benefit of something else. Generally, data has collection, extraction and storage costs, and we must be selective.

At the beginning, it is tempting to get as much data as possible, and that’s usually the right intuition. Over time, this isn’t sustainable, so how do we trace back from the actions taken as a result of analysis to quantify the contribution of any given data item, so we can optimize data collection for maximum benefit?

  • If the question is regression or classification, feature selection can be used to determine which data to collect going forward
  • customizable instrumentation helps — be prepared to iterate on the system that generates the data, you (or the team running that system) may benefit from collecting different or additional data. Software engineers are your friends.
  • purely technological solutions exist — cold storage, streaming, logical data warehousing, and most importantly, not choosing vendors who make it difficult to access your data for analysis.

Quantifying uncertainty

Some actions require near certainty, and some situations require different action when you are uncertain. Data science also includes working with small datasets (or that less-than ideal data you had in the fridge). Popular methods (deep learning, classical hypothesis testing) aren’t designed to tell you how uncertain your results are. Check out BEST for hypothesis testing, Gaussian processes for classification/regression and PyMC3 for density estimation.

When do you need it? Timeliness vs accuracy

Timeliness ~ usability of results + interpretability of analysis + required level of certainty

Do you even need a model? Often dashboards or visualisation suffice to maximise the above. It’s OK to solve a problem without a model! You don’t need Temporally Recurrent Online Learning to solve every problem, despite the barrage of hype and pushy vendors.

Jeff Bezos’ letter to Amazon shareholders had a wealth of advice on timeliness. One highlight was this:

Second, most decisions should probably be made with somewhere around 70% of the information you wish you had. If you wait for 90%, in most cases, you’re probably being slow. Plus, either way, you need to be good at quickly recognizing and correcting bad decisions. If you’re good at course correcting, being wrong may be less costly than you think, whereas being slow is going to be expensive for sure.

Another good guideline before you implement that algorithm hot off arxiv.org is to consider the benefit per human hour of effort. My lab mates and I once spent all night coding a faster image in-painting algorithm for Carola Schonlieb’s image processing course. Tired and proud of ourselves, when presenting our work, the only comment we were given by a professor in the audience was “human time is expensive, computer time is cheap”. Our time would have been better spent elsewhere.


Tangent: The rise in popularity of deep learning over other machine learning methods can be partially explained by this — deep learning often requires a relatively straightforward understanding of the mathematical problem, the ability to manipulate tensors and gradient descent. Contrast this with classical methods that require extensive feature engineering and model comparison across different families of problems. Bayesian methods are even harder to apply, requiring solid understanding of probability distributions, their interaction and computational quirks of approximating them efficiently. So, it’s easier to go from software engineering to deep learning, which means more people use it, and the rise becomes self-perpetuating. That’s awesome but it doesn’t make deep learning the right tool for every job (at least right now), it helps to keep other tools in mind.

Who is going to do it? Having the right skillset on your team

Aside from the well-known data science skills, you need some support from other folks in your organisation. Trying to be/hire a unicorn is a lot less efficient than just talking someone who knows something you don’t. So go and talk to some people. Maybe even stare at their shoes!

Deploying a model is non-trivial — you need some software chops to do it right. At scale, Python can be a nightmare (what was the type of that data you passed?). Your nearest friendly software engineers will almost certainly have some opinions.

If your model crashes or the data gets weird, what happens? Can you rollback to a previous model or a fail over to simpler system? Get a dev-ops person to school you.

Good decisions require human insight as well as data. Have you done everything necessary for a human expert to make an informed choice and mitigated the risks sufficiently? Get a domain expert to critique your model.

Are you sure your analysis and its consequences are ethical and legal? Are you storing the data according to regulations and informing users of your system appropriately. Make friends with Legal, they can make things much easier, or disastrously difficult.

I hope this helps! Comments and corrections gratefully received.

I’ll post again soon on why AI != Machine learning != Data science


Thanks to John Tipple, Al Grant, Pete Harris & Will Matthews for discussions at the first ARM Data & Insights Summit that lead to this post. All errors and opinions my own.