The 5 Stages of Grief

On the Road to Big Data

Data infrastructure is really hard and it’s hard for a lot of different reasons. Each company’s needs feel so completely unique, which is only compounded by the fact that there are thousands of different software tools available with widely varying ideal use cases, costs, and benefits. It’s difficult to choose which tools to use and, in addition, the right tool for the job now almost certainly isn’t the best software to use later.

My co-founder and I have been working in the data infrastructure space for a while now. We were the earliest engineers at RethinkDB, helped run the data infrastructure at Airbnb, and now have started our own infrastructure company, Pachyderm.

Over that time, we’ve talked to hundreds of companies about their data infrastructure, ranging from tiny startups to huge tech giants, and we’ve noticed some really interesting patterns. Most notably, we found that almost every company goes through the same 5 stages as they scale from a small scrappy startup to a data-crunching juggernaut.

Hopefully the information here will help you identify where your company currently stands and help you plan for the road ahead.

Stage 1: I don’t know, log everything!

Your company consists of just the founding team and maybe a few early engineers if you’ve raised seed funding.

At the earliest stages of a fledgling company, data analysis is about as far from the highest priority as you can get. Product-market fit or even just product existence is — and should be — the main focus. That said, there is still tons of valuable data being produced and often times this is some of the most valuable data. You might not have many users yet so tracking everything possible about the users you do have is really important.

Many companies learn this pretty quickly and just say “screw it, log everything!” You don’t actually have that much data and it’s totally fine if you have no clue what you want to do with it. You may eventually want to sift through that data and extract insights, but since you can’t spare the manpower at the moment, just take the fisherman approach and “cast a wide net to catch everything.”

What should you use to collect and store your data? By far the most common storage solution is to just dump everything in Amazon S3 or Google Cloud Storage. It’s cheap, its simple, and it can hold everything you throw at it. What you’re using to collect data depends a lot on your product, but user events and site traffic are good starting points. Google Analytics or Mixpanel should be enough for now.

Stage 2: I can haz data analysis?

You’re still in the scrappy startup phase — nearing your series A funding. Your team is growing rapidly, although it’s still less than 15 people.

You’ve likely been storing all the data you can in the cloud, hopefully in a vaguely logical structure, but data still isn’t your top priority — growth is. Nonetheless, data is certainly becoming more and more useful for measuring your growth and expanding your early customer base.

Your data infrastructure is still pretty basic, but you’ve had a few engineers write Python scripts that manually run over an S3 bucket to extract some key metrics — maybe looking at new signups, trying to get a rough idea of customer churn, or generating a dashboard of customer spending habits. Still pretty simple stuff, but it lets you track progress and show your investors a nice up-and-to-the-right graph. Woot!

Now that you have analytics, you’re definitely tracking site traffic and user behavior. Fancy dashboards are a luxury, but there are loads of tools out there that offer data insights for event-driven applications. Choosing one can be a nightmare — they all claim to be great at everything and it’s nearly impossible to know their limitations until you become a power user.

Your team is still mostly engineers and product-related roles — definitely no full-time data scientists in sight. One of your backend engineers, let’s call him Steve, has been dubbed the data infrastructure guru ever since the CTO made him evaluate all those event tools and pick one.

Stage 3: All the cool kids are data-driven

You’re growing fast, definitely post series A and potentially closing in on series B funding, depending on your industry. You have a few PMs and the company is subdivided into cohesive teams. Data science isn’t one of them. Your sales and customer success teams are plenty data-driven, but that’s all wrapped up in whatever CRM they use.

By now your backend infrastructure is fairly mature and it’s finally time to put some formal processes around analyzing your data. There’s a clear inflection point where optimizations can start to pay significant dividends for the company, which can come in a variety of different forms. For example, maybe your company has a solid $10+ million ARR (nice job!) — now, rigorously analyzing user behavior for a 10% user conversion or retention improvement is worth $1 million! It totally makes sense to start analyzing that data, right? Or how about an in-depth study of your pricing model? Or maybe it’s time to use a more advanced log-aggregation platform?

Whatever your high-value data problems might be at this stage, there’s almost certainly a specialized SaaS tool with a spiffy UI that is designed for your particular need. Instead of hiring a flock of data scientists to build internal models from the ground up, just get one of your engineers, probably Steve again, to set up a few special-purpose tools and you’re good to go!

Data silos. The tools listed are just examples.

Your data infrastructure will eventually become a series of silos with specific storage and analysis tools for each type of data you’re collecting.

Stage 4: The Web of Chaos

Kicking butt and taking names… on the outside. Your company is growing rapidly and your user base is exploding. Internally, everyone’s hair is on fire because your backend systems are melting under the increased load. You have a small team of infrastructure engineers and data scientists producing growth reports, A/B testing new features, and tracking user behavior. Unfortunately, they’re using such a multitude of internal and external tools that your system has become fragile and occasionally produces incorrect results when pipelines break. The biggest problem, though, is that every time your analysis is wrong, project managers stop trusting data for up to months at a time and start making decisions based on gut feel. This is not where you want to be as a data-driven organization. When it comes to trusting your data, even 99% accuracy isn’t good enough, so you need to find a more reliable solution.

How did we get here though? Your nicely isolated silos from stage 3 have multiplied as different teams chose overlapping tools and your infrastructure engineers built custom connectors to let everything talk to everything else. Now, data is constantly shuffled from your app, to production databases, to a bunch of different storage solutions depending on which business intelligence tools your PM’s wanted to use for that data. Data provenance is a nightmare because if two sources have conflicting data, it’s nearly impossible trace it backward and resolve the issue.

Tool creep is extremely common and can be problematic, but dictatorial restrictions on tools isn’t usually the right answer. You want your teams to be flexible and have the freedom to pick what works best for them, but then sometimes you land here… the web of chaos!

The Web of Chaos. This is just an illustrative example, but we’ve seen even more complicated infrastructures.

Tons and tons of companies get trapped here. We’ve worked with many of them, and I promise, there is a light at the end of tunnel. Unfortunately, that light is the blazing bonfire of your current infrastructure as you move to stage 5.

Stage 5: The Data Lake

Your company is getting big and your company has big data problems that need solving. There are three major data infrastructure challenges that keep you up at night:

  • Scalability — How to build a data infrastructure that works now and will be able to grow with you forever.
  • Dependency management — You’ve got a large DAG of jobs and you need a good tool to manage those pipelines, monitor progress, and efficiently fight fires when processes break.
  • Collaboration — Everyone in your company benefits from access to data. Your infrastructure needs to handle multi-tenancy and collaboration.

Scalability is solved by introducing the concept of a data lake and consolidating all of your data into one system. Wikipedia’s definition is actually spot on:

A massive, easily accessible data repository built on (relatively) inexpensive computer hardware for storing “big data”. Unlike data marts, which are optimized for data analysis by storing only some attributes and dropping data below the level aggregation, a data lake is designed to retain all attributes, especially so when you do not yet know what the scope of data or its use will be.

Storing everything in one place and then pulling subsets of that data into the various business intelligence or analytics tools is much more manageable because it clearly establishes your source of truth. For every piece of analysis, you should always be able to trace the provenance of data back to your data lake.

The most commonly used data lakes are Amazon S3, Google Cloud Storage, and HDFS (Hadoop). All of these options offer “infinite” scale for relatively low cost. You do trade away significant read/write performance, but there are fairly well-established workarounds. By just adding a data lake and consolidating data flow, your infrastructure can become radically simpler and look something more like this:

The Data Lake and surrounding architecture: this is an overly simplified picture, but it helps to visualize the flow of data

For Dependency management, there is no clean list of de facto tools because, in reality, almost every major company builds their own internally. This is generally a terrible idea, yet so many companies do it because there aren’t any great options available. There’s Oozie for Hadoop, Luigi made by Spotify, Airflow built by Airbnb, Azkaban created at LinkedIn… you can see the pattern here. Every big company built an internal tool which worked great for their specific needs, but unfortunately, internally-built solutions don’t always translate well into generic products.

Collaboration, the last of the three major data infrastructure challenges, is the most poorly addressed. Large enterprises need a better way to manage data access for teams of data scientists across numerous departments. In addition, data scientists need better collaboration tools so they can work together and build on each other’s work more effectively. Although HDFS does have some basic permissioning configurations, the above tools offer neither fine-grain user and data access control nor sharable workflows.

The collaboration tools that do exist (e.g. Kaggle, Mode Analytics, iPython notebooks), either don’t work at the scale we’re dealing with or are restrictive and cumbersome in other ways. In most large organizations today, collaboration either isn’t a common practice or is governed by internal policies such as naming schemes instead of being exposed through software.

So how do you actually transition from stage 4 to stage 5 effectively?

Or better yet, how do you skip the Web of Chaos completely and build scalable infrastructure from the very beginning?

Well, the major challenge with going from stage 3 to 5 is that you have to give up all those easy-to-use tools in exchange for scalability and generalization — hence why some companies get stuck in stage 4. In addition, most of the best stage 5 tools are in the Hadoop ecosystem and are insanely expensive and complicated to use (I wrote an entire post on this). This is the big problem that my company, Pachyderm, is trying to solve.

Pachyderm provides the scalability of the other stage 5 tools, but with the ease of use from those in stage 3. We accomplish this by packaging everything up in containers so you can use all your favorite tools at scale, yet a single engineer can set up and manage Pachyderm on their own. Pachyderm also comes with built-in dependency management and collaboration primitives, the key challenges that are severely underserved by existing options. If you want to learn more about Pachyderm, check out our website or read some of my other posts.

Conclusion

Hopefully this framework for thinking about the different stages of data infrastructure is useful. Knowing what’s ahead can help you avoid it and make decisions early that will scale with your company instead of accruing technical debt.

It used to be that only the absolute biggest companies actually had petabyte-scale data workflows. But these days, many teams reach that point at much earlier stages and there’s tons of interesting development happening to meet the growing demand and new challenges ahead. Stay tuned.