The “Joel Test” for data readiness

5 min readJan 30, 2017

So you’re a data scientist. Great! You have the sexiest job of the 21st century! You’re going to use cool distributed computing platforms. You’re going apply cutting edge machine learning techniques. You’re going to visualise the results so beautifully it will make David McCandless weep. You are going to disrupt, optimise and…

Wait, what is this data you are about to science? Where did it come from? Can you trust it? Do the fields mean what you think they mean? Does Harriet from the Boston office agree? Had she been up all night when she manually entered that data into a spreadsheet? What does that weird 5 letter German abbreviation in the SAP data mean? Can you even aggregate the transactional data meaningfully? What were we trying to do again?

Should we even start this project? How long will it take? How can we trust the results?

Eventually, you get the data into a workable state and into the models. With some clever hacks and ample doses of caffeine, you move the dial on a key business metric and everyone goes home happy. Until your next assignment, when you have to go through all of this again. How long is it going to take this time?

Is there a way to get all of this done once so we can move on with our lives? Maybe, but it will be expensive, slow and painful. Where do you even start?

A partial solution

If we relax the task to just identifying and classifying the data issues in an organisation, we can better plan data science projects and coordinate data clean-up exercises.

With this in mind, here is a partial solution, composed of 2 ideas and a definition.

The first idea comes from Neil Lawrence’s talk about personalised healthcare NIPS 2016 and this insightful post on data readiness. Neil’s idea is to classify the quality of a specific data set into “bands”.

The second idea is to come up with a “Joel test” for data readiness. The original Joel test is a lightweight survey that consists of 12 yes/no answers. It is designed to give a quick, easy way to measure how good a software engineering team is. It’s supposed to take less than 3 minutes to complete. A score of 12 is perfect, 11 is OK, and 10 or lower means things are going wrong.

We can merge these ideas with the following definition of data readiness.

Data readiness: the ability of a business unit/function to collect, curate and use data for operational and analytical purposes.

We can now define version 0.1 of the test, which gives a team-centric measurement of data readiness.

Questions

Please note the questions are binary (yes/no) — if in doubt choose the answer which is most often true.

Do you know what data is available/would help you in your every day job?
Can you find & get the data you need within 1 day?
Do you know the source of your data (e.g. system/person/3rd party)?
Do you get data directly from a system (as opposed to relying on people to get it)?
Can you verify the data is trustworthy for your purposes without going through every item?
Can you use the data for your purposes with less than 1 hour of manual manipulation?
Can you combine data from different sources with less than 1 hour of manual manipulation?
Do you make sure the data you generate can be joined to other data (for example, by including a universal identifier such as Product ID or Staff Number so that your output rows can be joined to rows from other sources)?
Do you analyse data to quantify how well things have gone in the past?
Do you analyse data to forecast?
Do you analyse data to make decisions?
Is the analysis you do repeatable (For example, it’s not repeatable if you have to manipulate a spreadsheet every time you need the answer or every time new data arrives)?
Do you share your work (data, insights) and use it to collaborate with other business units/functions?

Optional Bonus questions (Free text answers)

Can you see a clear opportunity to use existing data to do things better in the organisation? If yes, please describe:
Do you feel that you need more/better data to do meaningful/more useful analysis? If yes, please describe:
Please tell us any other problems you encounter or opportunities you see related to the organisation’s use of data.

What’s a good score?

The questions are designed to evaluate views on the maturity of the organisation’s data collection (Q1–4), verification (Q5–8) and analysis (Q9–12).

The overall score and each category can be assigned a maturity level (0–4).

Since data science is a younger field than software engineering, we soften the threshold: 10 or above is considered good.

A score of 3 or above in each category is considered good.

If you do the survey carefully (and with the right permissions), the results can be sliced by business unit, function, seniority of respondent, etc. This gives a detailed view on where a group is in their data journey before your data science team engages with them.

What’s in it for the respondent?

A quick, easy way to inform the organisation on how best to:

minimise manual work so the respondent can spend more time on interesting/important tasks
reduce uncertainty in data-related aspects of the respondent’s daily job
provide more transparency on how data is used in the organisation

Limitations

Binary: issues like joinability of data are knotty and this test has no room for nuance
Hacky: it’s quick and dirty — its meant to be a start, not an in-depth solution
Survey: we are asking people to self-evaluate. This is going to bias the results in some cases
Team-centric not data-set specific: The Joel test takes a team-centric view — this may not always be what you want. See Neil’s post referenced above for a data set-specific approach.

Side effects

In addition to the core goal of capturing data readiness levels across the organisation, there are some beneficial side effects.

Stimulate the organisation’s people to ask the right questions about data and analysis
Focus on how best to remove major sources of pain and drudgery
Make more effective use of data in running processes/building products/decision making
The results can be used to inform agile response to the issues identified — the questions translate neatly into user stories. This gives you a roadmap to good data.

Credits

Many thanks to Andrew Sloss, Dragos Petrescu, Simon Duckett & Ed Dupree for helping develop this test. Thanks to Neil Lawrence for interesting discussions and feedback. All errors my own.