Big Data Checklist

The Joel Test for big data.

10 min readOct 16, 2013

I started with Big Data before the term was coined.

Today it is trending. More and more projects understand that “Data is King” and are moving towards it.

Big data requires different framework for your gut feeling and intuition to keep finding the right routes. Good thing is, it’s a skill and it can be mastered.

Below is my “Joel Test” for big data projects.

It is not aiming at making everyone a big data expert. But it should help great engineers become great big data engineers. And make it more fun too.

If you are just starting to look into big data, you may think of the questions below as a guide to follow. If you already are into big data, you may find a few areas to sharpen your focus in.

Do you have big data separated from the rest of your product?
Do you have established metrics?
Do you have a live dashboard?
Do you have the fastest possible prototype-evaluate-ship-repeat cycles?
Do you log everything?
Do your designs enforce repeatability of results?
Do you run live experiments?
Do you run regression tests?
Do you have infrastructure ready for big data?
Do you understand the importance of organic data?
Do you know how much headroom you have?
Do you do research outside your main task?

Deep Dive

Having identified and prioritized twelve bullet points above, I am going to elaborate on them in more detail.

Down the text “big data” is used in a broad sense, that includes machine learning, data mining, information retrieval and other scientific disciplines along with infrastructure, evaluation pipelines and other implementation details.

On Metrics

The metrics are to help you track your progress, but their value drops below zero once improving the metric no longer improves the end user experience.
The metrics should allow comparing apples to apples.
If the metric you like does not allow comparing its value today to yesterday, you may want to introduce another metric to track day to day progress.
The metrics should have a rationale behind them.
The metrics should be easy to explain to people who are not into engineering.
Imposing numerical goals on metrics can be a curse.
If you do impose them, base your expectations on headroom analysis and proven big ideas to go live soon — not on past improvements rate or on how well your competitors are doing.
Do reality checks on your metrics often.
Your localized precision or relevance or engagement should correlate well with higher-order metrics like growth rate, retention rate, how often your users comment on or share what you show them, etc.

On Evaluation

Along with data infrastructure, evaluation pipeline is what enables you to run dozen iterations per week instead of barely one.
With big data 10x iterations is what makes a difference between good and great.
The evaluation should be as automated as possible.
The evaluation should be as fast as possible.
Offline evaluation should ideally run in a few minutes.
If it’s all about going through under a gigabyte of labeled validation data, why on Earth should it be taking longer?
Online evaluation, on the other hand, often does take longer.
You may have to settle for something on the scale of 24-hour timeframes. It’s OK — but make sure you can run multiple experiments in parallel within one 24-hour window.
Sometimes it helps to have a model created manually, without involving machine learning techniques, to have the baseline to compare against.

On Dashboard

Put a big screen in the office with the sole purpose of showing off how great are you doing numbers-wise.
Be honest with yourself and don’t just hide it if the numbers don’t look good.
Dashboard is way more useful if it is using post-cooked big data.
This way it serves as the first-order customer of logs processing logic and pipelines.
Key big data metrics should be on the dashboard, along with the basic usage numbers.
The basic usage numbers may come from outside of the big data infrastructure, but most should go through it.

On Organic Data

Organic data is the data that captures the behavior of your users best, without pruning or filtering.
Be well aware of user behavioral patterns and the 80%/20% rule.
If some type of action accounts for the majority of user actions, it would account for the majority of user actions in organic data.
If some content accounts for the majority of content your users go through, organic data will have the same skew.
Evaluations using the organic data are the most valuable.
Whenever possible, top-level metrics should be based on organic data sets.
Having said that, you will need to sub-sample the data for more accurate metrics on deeper levels. But make sure the top-level metric does improve as well. It will take more work and more time to get noticed, but one should be able to see the improvement.
Have a good idea of your headroom.
It is not the absolute best value for the metric you have crafted: it is where can you get in short-, mid- and long term in a decent, yet realistic, scenario.
Understanding headroom requires manual work. The people doing the job of data scientists should get their hands dirty from time to time.
It also requires creativity, so make it as much fun and as little routine as possible.
In smaller teams a useful habit to have is to dedicate a day or half a day per month trying to pretend you are “the real users” to get a feel of how their lives differs from what you thought it is.
At times, you may want to involve more people, unaware of your current direction, whose only job would be to tell you how and where can you do better.

On Labeled Data

Labeled data is orders of magnitude more expensive.
Don’t hesitate to label some data yourself.
Ten minutes of looking through some side-by-side results of old and new models is a good start of the day.
Reuse your labeled data as much as you can. Don’t invalidate it until you absolutely have to — or until it becomes obsolete on its own.
In particular, keep at least part of your labeled data excluded from training for validation purposes.
Looking at the value of your high-level metric on labeled data is OK.
Manually examining concrete cases in labeled dataset instantly marks this set ineligible for further clean experiments.
Don’t do it and don’t let your teammates to.
Rotating labeled data is a good habit.
If you are willing to have 1'000 labeled examples per week, keep the ones from the most recent weeks “clean” and use the older ones for deep dives.

On Live Experiments

Live experiments help a lot. Unless you have a good reason, don’t hesitate to route 1% or 0.1% of traffic to experimental models.
In fact, in a lively product with big data team at work, multiple live experiments running non-stop is a healthy environment.
Sharding your traffic may be harder than you think.
For a stateless search engine you can afford to shard
by query hash. But the world seems to be pretty much done with building stateless search engines.
Sharding should be designed in a way where splitting off 0.1% of traffic keeps both 0.1% and 99.9% parts organic.
For example, if you are building an app store and some app has significant share of traffic, sharding by app ID does not work since it denies you the opportunity to fan out 0.1% of it.
Shard by user sessions or come up with something smarter.
For most applications it is perfectly fine for the same query from the same user to end up in different shards from time to time. The users would not hate you if sometimes the results they see get altered by a bit — while in return you will get valuable apples-to-apples comparison results to explore.
Once you have established live experiments infrastructure, back tests are a great way to confirm you are going in the right direction.
Tee-ing some traffic to test/canary machines is a good thing too, assuming your data stays immutable.

On Serving Infrastructure

Big data logic should run on dedicated machines.
At the very least this covers logs cooking jobs, modeling processes, evaluation pipelines and serving infrastructure.
Of all components, serving infrastructure is the first one you want to have dedicated environment for. Now.
REST APIs are your friend.
Have your results repeatable. Take it seriously. More seriously.
Two decent ways to ensure repeatability are: 1) put everything into source control (usually git) and make the configuration parametric on commit hash, 2) keep server code and models as VM images.
Spawning a new serving job, production-ready on a fresh machine or locally for testing, should be a matter or running one script.
Top-level logs cooking jobs should also be possible to spawn via one script.
Regression tests are shame to not have.
It only takes gathering some data, anonymizing if necessary, running it through your service, saving the results into a golden results file and diff-ing against that file later on.
A good regression test can also test live experiments and sharding logic.
A good regression test is also a load test.

On Data Infrastructure

Along with evaluation pipeline, data infrastructure is what enables you to run dozen iterations per week instead of barely one.
At risk of repeating myself, with big data 10x iterations is what makes a difference.
“It’s not big data yet when it doesn’t fit into Excel!”
Early on you may well live with CSV/JSON/BIN files and Python/Ruby/Clojure scripts.
There is no need to set up or maintain a Hive/Hadoop cluster until you hit terabytes scale.
Make sure you log all the data you need.
It goes beyond the user clicks on your website. Consider viewport, from where did this particular user land on your service, IP / proxy information, browser / OS / screen resolution, mouse hovering, actions involving external websites, co-installed applications on mobile — and much more.
Mobile is especially important: lots and lots of data is available once you have an actively used mobile app.
You never know where the next insight would come from — but, chances are, it will come from the data.
Log your own work along with user actions.
Which live experiments were running and when, which user requests got routed to which experiments, labels you have obtained, by what means, which portions of data did you send out for labeling and for what reason — all these in-house things count as the data you must log and keep.
Log data cooking is usually harder than serving.
And it is one the few pieces that falls in between the big data and the other part of the product.
KISS is your friend. I’d totally bless something like “the server stores logs in certain directory in the cloud, the big data infrastructure parses those logs as they arrive”.
Normally, most features would be computed on data infrastructure side.
If this is your case, plugin structure works best.
Often times it is more efficient time-wise to first implement the logic that adds a new feature and keep it running for a few days. After the new feature is already stamped along with the existing ones, it is much easier to experiment with.
Therefore, make sure new featurizers are easy to hook up — perhaps, automatically, when the code is checked in.

On Modeling

Make it enjoyable and comfortable to dig into your data — the world of modeling is where most of creative time is being spent.
The efficiency of modeling depends largely on how fast the iterations can be.
Multiple full-stack iterations per day should be your goal.
If viable, make it possible to run modeling on a single machine.
Running stuff locally is way faster and has less or zero external dependencies.

On Prototyping

Do whatever you want and have fun —as long as you are moving forward.
Try any idea you feel is worth trying — but aim at getting headroom estimate soon and don’t hesitate to drop the idea as soon as you believe there may be a lower hanging fruit.
Use any tools you feel like using.
Don’t hesitate to invest into building reusable tools for the future.
At the same time, don’t hesitate to live on dirty hacks if it lets you run a reality check of some idea quickly.
Don’t bother if the implementation looks ugly. It’s one of very few places it’s allowed to.
However, once you have something you can demonstrate business value with, switch from prototyping to productization and clean up the mess before it hits the fan.

On Research

No matter how strong of a team you have, make sure to communicate to the outside world.
Sending data scientists to conferences asking them for trip reports in exchange is a practice that works well.
Dedicate some time to explorations that do not have immediate value.
For example, if your job is to do supervised learning and categorize your users into paying and not paying customers, find time to train a few unsupervised models and look at the clusters you get.
A few insights coming from this may be well worth it very soon.
Give talks, open-source stuff that does not carry immediate business value, write blog posts about how amazing are your data challenges — make sure you establish presence in the community.
Interns are a great way to accomplish all or most of the above.

Bottom Line

The field of big data is different from other software engineering disciplines. The intuition and gut feeling you used to rely on may play a joke on you. And with data-driven projects it often takes more time to realize the wrong route was taken — and sometimes it may be too late.

Getting big data done right should become easier with twelve high-level concepts embodied above.

I have done plenty of machine learning and can say with confidence that changing a “no” into a “yes” for the questions above has been the right thing to do consistently — and would probably keep being the right thing for quite some time.

Big Data Checklist

The Joel Test for big data.

Written by Dima