Building Thumbtack’s Data Infrastructure

By: Nate Kupp

Image for post
Image for post
Photo by Joshua Sortino on Unsplash

Part 1: Organizing Chaos

Over the past year, we’ve built out Thumbtack’s data infrastructure from the ground up. In this two-part blog post, I wanted to share where we came from, some of the lessons we’ve learned, and key decisions we’ve made along the way.

Image for post
Image for post

Legacy Event Data Pipeline

Our pre-existing event-logging pipeline was designed to collect events from our front-end servers and deposit those events into . All event data was batched on the production front-end servers and an hourly cron was executed to send these batches off to MongoDB for storage. Once in MongoDB, an aggregation process would summarize inbound events.

@events.capture('type = "home page/view" AND browser_engine')
def on_homepage_view(event):
on_view(event)
@events.capture(type='category group page/close lightbox')
@events.capture(type='landing page/close lightbox')
@events.capture(type='service/close lightbox')
@events.capture(type='request form/close lightbox')
def on_interaction_event(event):
on_view(event)

Planning: Our Roadmap for Success

In planning our data infrastructure development project, we had three primary goals:

Scalability

A scalable data infrastructure to accommodate our growth was top-of-mind, given how our existing infrastructure wasn’t able to support our data volume. We wanted to ensure our redesigned version could support significant, sustained growth.

Decoupling

Decoupling of analytics and production resources was also important to ensure that failures in our analytics infrastructure did not impact production resources and vice versa. Decoupling these systems also enables us to scale our production and analytics infrastructures independently.

Ease of access

In our revised data infrastructure, we wanted to ensure that everyone at Thumbtack had easy access to all of our data, even for teams outside of engineering. For us, this meant our planning included both (1) a SQL interface, and (2) dashboarding/analytics tools.

Image for post
Image for post

Batch Processing

One of the most pressing problems that we needed to solve at Thumbtack was a system for distributed batch jobs to process our data. As described above, our event analytics were encountering serious scaling issues, and replacing this system with some kind of distributed event analytics was our highest priority. Our events data was essentially unusable, given that it was locked up in an underpowered MongoDB cluster with no parallelism for aggregations or downstream processing. This was a key decision point for us, since quite a few different batch processing systems were available. In my previous work, I used vanilla Hadoop MapReduce, but had seen data engineering teams elsewhere having success with / and/or Apache Spark .

AWS Deployment

Once we decided to use Spark, we also had to decide how to deploy it in production. We considered several options for our Spark deployment, based on either Amazon Elastic MapReduce (EMR) or Cloudera CDH . These were:

  1. Amazon EMR: write data to EMR HDFS, run Spark jobs on EMR and write to EMR HDFS
  2. Cloudera CDH: write data to CDH HDFS on EC2 nodes, run Spark jobs on CDH and write to CDH HDFS
  • Spark at the time we began this project in early 2015, Spark was not yet available on EMR (Spark on EMR was introduced in June 2015), and so we’d need to manage our own Spark installation on EC2 nodes.
  • Package Versions EMR tends to package and support somewhat old versions of tools in the Hadoop ecosystem, a point that became particularly important when we were considering what distributed SQL engine to use. For example, at the time of writing, EMR packages Hadoop 2.4.0, Hive 0.13.1, and Impala 1.2.4, whereas we currently have Hadoop 2.6.0, Hive 1.1.0, and Impala 2.3.0 deployed on our production CDH cluster.
  • Availability Amazon’s EMR releases differ from the Apache releases (see notes ). Most importantly, the high availability features of HDFS and YARN are not included. We were particularly concerned about availability guarantees with our cluster, and this was one of the biggest factors in choosing not to use EMR.
  • Experience two members of our team had prior experience maintaining large production CDH clusters.
Image for post
Image for post

Distributed SQL Infrastructure

With this infrastructure in place to address running batch jobs, we also needed to support SQL queries for offline analytics. This was critical to the business, as everyone from engineering to product management to finance to marketing needed rapid access to large amounts of relational data. Largely, the organization was still using PGAdmin to run queries against a production PostgreSQL replica. This had a few major problems:

  • Coupling since analysis queries were hitting a read-only replica of our production PostgreSQL database, offline analytics were tightly coupled to the state of production infrastructure.
  • Performance many of our offline analytics queries involved large batch reads (e.g. summarize all actions taken on Thumbtack by pro over the past X years) from a single database replica, which meant that performance was terrible. Many queries would take nearly an hour to complete and any queries hitting our largest tables would frequently just time out.
  • Authentication and Authorization all queries were run against the database using a single read-only user, which meant that we had very little visibility or control regarding access to data.

Dashboarding & BI Tools

The widespread access to a PostgreSQL replica meant that a plethora of downstream tools had accumulated and it was now time to painstakingly migrate to Impala (fun!). At the onset, many fragile pipelines had popped up to move data around to teams outside of engineering, and we had no idea all the ways that teams were querying this data.

Data Ingest Pipelines

At Thumbtack, we have many sources of data that we wanted to ETL into a single data warehouse. However, at the beginning of this project, we had two data sources we were most interested in ingesting and using. First, we wanted to ingest all of our relational data to provide access in Impala. This data is frequently used for tracking key company metrics, financial reporting, and other critical business analytics. Second, we wanted to ingest our event data, and transform it (with Spark batch jobs) to make it accessible for downstream analytics, most importantly in support of A/B testing.

Relational ETL

For our relational data, we needed a way to retrieve data from our PostgreSQL database and ingest it into HDFS. Our goal was to get all of our relational data regularly imported from Postgres into Parquet files on HDFS, without too much delay from the most recent records stored in Postgres. We use Parquet to reap the performance advantage of a columnar data storage format.

Event ETL

When we began our effort to bring event data into Hadoop, we needed a fast, production-ready solution. The existing event ETL pipeline, which was loading events into MongoDB, batched events in hourly event logs local to each of the front-end servers. A cron ran hourly, picked up these logs, and pushed them into MongoDB.

Time to Put Our Plan Into Play

Over the past year, we’ve spent a lot of time thoughtfully designing and building out our data infrastructure and made great progress towards our goals of scalability, decoupling, and ease of access. We’ve now scaled up to handle thousands of SQL queries per day across billions of records, our data infrastructure is fully decoupled, and the vast majority of the organization is designing and running reports and dashboards touching data from every corner of the organization.

Thumbtack Engineering

From the Engineering team at Thumbtack

Thumbtack Engineering

Written by

We're the builders behind Thumbtack - an online marketplace that matches customers with local professionals to accomplish their projects.

Thumbtack Engineering

Stories from the Engineering team at Thumbtack

Thumbtack Engineering

Written by

We're the builders behind Thumbtack - an online marketplace that matches customers with local professionals to accomplish their projects.

Thumbtack Engineering

Stories from the Engineering team at Thumbtack

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store