Over the past few years, I’ve had many conversations with friends and colleagues frustrated with how inscrutably complex the data infrastructure ecosystem is. Although not quite as bad as the front-end world, things are changing fast enough to create a buzzword soup.
As a beginner, it’s super challenging to decide what tools are right for you. The Apache Foundation lists 38 projects in the “Big Data” section, and these tools have tons of overlap on the problems they claim to address. For example, Flink, Samza, Storm, and Spark Streaming are “distributed stream processing engines”, Apex and Beam “unify stream and batch processing”.
In this post, I hope to provide some help navigating the options as you set out to build data infrastructure. These are roughly the steps I would follow today, based on my experiences over the last decade and on conversations with colleagues working in this space.
At the start of your project, you probably are setting out with nothing more than a goal of “get insights from my data” in hand. Mapping this to specific set of technologies is extremely daunting. You probably don’t have a great sense for what tools are popular, what “stream” or “batch” means, or whether you even need data infrastructure at all.
In this post, I hope to provide some guidance to help you get off the ground quickly and extract value from your data. I strongly believe in keeping things simple for as long as possible, introducing complexity only when it is needed for scalability. This post follows that arc across three stages. In many ways, it retraces the steps of building data infrastructure that I’ve followed over the past few years.
Note that there is no one right way to architect data infrastructure. For the experts reading this, you may have preferred alternatives to the solutions suggested here. That’s fantastic, and highlights the diversity of amazing tools we have these days. We’ve come a very long way from when Hadoop MapReduce was all we had.
Stage 1: You have small data, and that’s a good thing
So here’s the thing: you probably don’t have “big data” yet. Almost 4 years later, Chris Stucchio’s 2013 article Don’t use Hadoop is still on point. If you have less than 5TB of data, start small. This will save you operational headaches with maintaining systems you don’t need yet. But hey, if you love 3am fire drills from job failures, feel free to skip this section…
Otherwise, stay away from all of the buzzword technologies at the start, and focus on two things: (1) making your data queryable in SQL, and (2) choosing a BI Tool. These will be the “Hello, World” backbone for all of your future data infrastructure.
Everything in SQL
This is really important, because it unlocks data for the entire organization. With rare exceptions for the most intrepid marketing folks, you’ll never convince your non-technical colleagues to learn Kibana, grep some logs, or to use the obscure syntax of your NoSQL datastore.
Providing SQL access enables the entire company to become self-serve analysts, getting your already-stretched engineering team out of the critical path. It also turns everyone into a free QA team for your data. The “hey, these numbers look kind of weird…” is invaluable for finding bugs in your data and even in your product.
If your primary datastore is a relational database such as PostgreSQL or MySQL, this is really simple. You can just set up a read replica, provision access, and you’re all set.
With a NoSQL database like ElasticSearch, MongoDB, or DynamoDB, you will need to do more work to convert your data and put it in a SQL database. If you’re new to the data world, we call this an ETL pipeline. Avoid building this yourself if possible, as wiring up an off-the-shelf solution will be much less costly with small data volumes. Depending on your existing infrastructure, there may be a cloud ETL provider like Segment that you can leverage.
If you find that you do need to build your own data pipelines, keep them extremely simple at first. Write a script to periodically dump updates from your database and write them somewhere queryable with SQL.
The story for ETLing data from 3rd party sources is similar as with NoSQL databases. Use an ETL-as-a-service provider or write a simple script and just deposit your data into a SQL-queryable database.
Set up a machine to run your ETL script(s) as a daily cron, and you’re off to the races.
A good BI tool is an important part of understanding your data. Some great tools to consider are Chartio, Mode Analytics, and Periscope Data — any one of these should work great to get your analytics off the ground. In most cases, you can point these tools directly at your SQL database with a quick configuration and dive right into creating dashboards.
Pulling this all together, here’s the “Hello, World” of data infrastructure:
Stage 2: Let’s call it “medium” data
At this point, you’ve got more than a few terabytes floating around, and your cron+script ETL is not quite keeping up. Perhaps you’ve proliferated datastores and have a heterogeneous mixture of SQL and NoSQL backends. You may also now have a handful of third parties you’re gathering data from. Finally, you may be starting to have multiple stages in your ETL pipelines with some dependencies between steps.
Workflow Management & Automation
Your first step in this phase should be setting up Airflow to manage your ETL pipelines. Airflow will enable you to schedule jobs at regular intervals and express both temporal and logical dependencies between jobs. It is also a great place in your infrastructure to add job retries, monitoring & alerting for task failures. It’s a running joke that every startup above a certain size writes their own workflow manager / job scheduler. Among others, Spotify wrote Luigi, and Pinterest wrote Pinball. However, these have less momentum in the community and lack some features with respect to Airflow.
Building ETL Pipelines
As your business grows, your ETL pipeline requirements will change significantly. You will need to start building more scalable infrastructure because a single script won’t cut it anymore. Your goals are also likely to expand from simply enabling SQL access to encompass supporting other downstream jobs which process the same data.
To address these changing requirements, you’ll want to convert your ETL scripts to run as a distributed job on a cluster. The number of possible solutions here is absolutely overwhelming. I’d strongly recommend starting with Apache Spark. Spark has a huge, very active community, scales well, and is fairly easy to get up and running quickly. On AWS, you can run Spark using EMR; for GCP, using Cloud Dataproc. If you’re ingesting data from a relational database, Apache Sqoop is pretty much the standard.
At this point, your ETL infrastructure will start to look like pipelined stages of jobs which implement the three ETL verbs: extract data from sources, transform that data to standardized formats on persistent storage, and load it into a SQL-queryable datastore.
At this stage, getting all of your data into SQL will remain a priority, but this is the time when you’ll want to start building out a “real” data warehouse.
For those just starting out, I’d recommend using BigQuery. BigQuery is easy to set up (you can just load records as JSON), supports nested/complex data types, and is fully managed/serverless so you don’t have more infrastructure to maintain. As with many of the recommendations here, alternatives to BigQuery are available: on AWS, Redshift, and on-prem, Presto. I have a strong preference for BigQuery over Redshift due to its serverless design, simplicity of configuring proper security/auditing, and support for complex types. Presto is worth considering if you have a hard requirement for on-prem.
When thinking about setting up your data warehouse, a convenient pattern is to adopt a 2-stage model, where unprocessed data is landed directly in a set of tables, and a second job post-processes this data into “cleaner” tables.
Treat these cleaner tables as an opportunity to create a curated view into your business. For each of the key entities in your business, you should create and curate a table with all of the metrics/KPIs and dimensions that you frequently use to analyze that entity. For example, a “users” table might contain metrics like signup time, number of purchases, and dimensions like geographic location or acquisition channel.
At the end of all this, your infrastructure should look something like this:
🚀 Stage 3: Going big
With the right foundations, further growth doesn’t need to be painful. You can often make do simply by throwing hardware at the problem of handling increased data volumes.
The most challenging problems in this period are often not just raw scale, but expanding requirements. For example, perhaps you need to support A/B testing, train machine learning models, or pipe transformed data into an ElasticSearch cluster.
Some things you may want to consider in this phase:
- Near-Realtime: You probably won’t need a distributed queue or near-realtime infrastructure until much later than you might think. It comes with a lot of added complexity to handle all possible failure modes, which isn’t worth it early on. Once the ROI calculus makes sense, try Kafka or Cloud Pub/Sub.
- Scalability: With a single monolithic Spark cluster, you’ll almost certainly run into issues with resource contention. When you do, fully elastic job-scoped Spark clusters are worth exploring.
- Security & Auditing: At some point you may want to enforce more granular access controls to your data warehouse data. If you use BigQuery, you can provision dataset access to Google Groups, and programmatically manage that access using Deployment Manager. BigQuery also provides audit logs to understand user queries. Other tools like Apache Knox and Apache Sentry are available for on-prem security solutions.
- A/B Testing: For building in-house A/B testing and supporting experimentation, there unfortunately are not many off-the-shelf solutions. Building a pipeline of Spark jobs that populate tables in your data warehouse is probably your best bet.
- Machine Learning: For feature extraction, you can build additional data pipelines in Spark. For the models themselves, you should again start small. Your processed features are likely small enough to fit on one machine, so you can train models using scikit-learn or TensorFlow. When one machine is no longer enough, you can use Spark MLlib or distributed TensorFlow.
It’s exciting to see how much the data infrastructure ecosystem has improved over the past decade. We’ve come a long way from babysitting Hadoop clusters and gymnastics to coerce our data processing logic into maps and reduces in awkward Java. Back then, building data infrastructure felt like trying to build a skyscraper using a toy hammer.
Today, we have an amazing diversity of tools. Spark has clearly dominated as the jack-of-all-trades replacement to Hadoop MapReduce; the same is starting to happen with TensorFlow as a machine learning platform. With very few exceptions, you don’t need to build infrastructure or tools from scratch in-house these days, and you probably don’t need to manage physical servers. The skyscraper is already there, you just need to choose your paint colors.
Looking ahead, I expect data infrastructure and tools to continue moving towards entirely serverless platforms — DataBricks just announced such an offering for Spark. Serverless infrastructure permits an elegant separation of concerns: the cloud providers can worry about the hardware, devops, and tooling, enabling engineers to focus on the problems that are unique to (and aligned with) their businesses. The future is one without hardware failures, ZooKeeper freakouts, or problems with YARN resource contention, and that’s really cool.
Edit: adding links out to some previous posts I wrote about Thumbtack’s data infrastructure: