Rethinking our data platform

Published in

In the weeds

6 min readJul 17, 2023

I’m Holt, and I lead the Data Platform team here at Greenhouse. Recently we have been modernizing our data stack, moving away from an all-in-one vendor in favor of a platform that uses the best technology for each job. I’d like to walk you through our platform modernization initiative, and open a series that focuses on the technical details of how we implemented each of the technologies that powers powers our platform.

Why we embarked on this journey

When I joined Greenhouse our data group looked similar to a lot of born in the cloud organizations. We started with a small team that was responsible for every aspect of providing data to our users to support reporting, data science, and even some early machine learning workloads. Early on it was clear that this team could be the most impactful by minimizing the amount of time required to keep the lights on for the platform, and that led us to select a managed platform that handled almost every aspect of data engineering for us.

Earlier this year we learned that we would be required to migrate away from the all in one vendor, with a decommission date at the end of 2023. Due to the aggressive timeline, we really had to be strategic about the technologies we selected and focus on delivering value quickly. A lot of our implementation journey caused us to iterate and improve on an MVP, which helped us to move workloads over as quickly as possible once the foundation was in place. In a lot of ways this was a blessing in disguise because like so many others we had outgrown our tech stack.

The existing platform was full of issues we needed to address in our migration. As our users and data volumes grew, basic tasks became difficult and time consuming. For example we leveraged Terraform to manage the users in our Redshift environment. When someone wanted to join the platform as a power user, something that should be celebrated turned into a weird ritual that looked like this:

We provision the user with Terraform
We securely share credentials with the end user
We share some documentation for how to connect to our VPN
We share more documentation for how to configure DBeaver

A lot of analysts and less technical users dropped out before they were even able to begin exploring ways they could use the platform beyond the traditional reports in our BI tools.

In addition to workflow inefficiencies, as our data volumes grew performance issues became more common and more complex to troubleshoot. As time went on the amount of incidents we encountered turned our support rotation into more than a full-time job, and earlier this year we jumped at the opportunity to redesign the platform.

Technologies we used

A visual representation of how data flows through our platform

I’d be remiss to not call out that technology is a small part of the equation, and my team at Greenhouse has been absolutely world class at making these technologies work for our end users. When selecting tools we had a guiding principle that all of these tools needed at some level to be self-service for our end users as we grow.

The tools we chose are:

Snowflake
AWS Database Migration Service (DMS)
Fivetran
dbt

We built up our entire platform around Snowflake as the central warehousing technology. Snowflake worked for us because it unlocked the ability for us to right size our compute resources for each of our use cases, and get out of the business of fighting for time in the middle of the night for jobs to run on our warehouse. In addition to the separation of storage and compute that is foundational to the Snowflake ecosystem, we have seen a huge benefit from some of the features that come along with Snowflake being a more modern platform (for example the user signup flow I described earlier was replaced by Snowflakes Okta Integration and is a part of our enterprise wide Identity Governance program). We will share more detail in future blogs about features we’re using in Snowflake so keep an eye out.

For ingestion we actually decided to change our thinking entirely away from all tables using the same solution. We understand that some tables are going to be ingested hourly, some daily, and others even more frequently. For that reason we leverage a handful of tools to populate our data lake inside of Snowflake. For our product datasets we stayed within the AWS ecosystem and are using AWS Database Migration Service (DMS) to publish our datasets to S3 where we then ingest them into Snowflake using Tasks. This solution provided us the flexibility to scale for the foreseeable future at a fraction of the cost of other tools. Along the way we learned that DMS can be a complex system, and we will be sharing more about our lessons learned in a follow-up blog post. For other ingestion pipelines where there is a connector available we leverage Fivetran for its ease of use and reliability. We’re still solving for some of the near real time workloads, but are excited to explore Snowpipe Streaming when it becomes generally available.

To take data from our data lake and make it valuable for the different workloads for our users we rolled out dbt and made it self service for power users. dbt really seems like the de-facto choice for the T in ELT, and the things our power users have been able to bring back to their departments has solidified our decision to build transformations across the platform with it.

Why we selected the tools

At the end of the day our new platform is about delivering value to our stakeholders and our customers as quickly as possible, and these tools are already helping us do that. We based our vision for the new platform around the idea of the Modern Data Stack, which at its core focuses on selecting the right tool for the job and using services that are easy to try and easy to deploy. We’re confident that selecting the best tool for the job will allow us to continue to stay at the cutting edge, easily make tooling changes as the industry matures, and give us the opportunity to take ownership of workloads completely by bringing them in-house when it produces a competitive advantage.

Using this approach, we’ve been able to fail fast, iterate on our work quickly, and focus on delivering value to our users.

I’m excited to continue sharing with you the details of our journey and what we’ve learned along the way (I’m super excited for us to put the DMS blog out). If you have any thoughts or questions please feel free to reach out and we’d love to hear and learn from your experiences!

Our team at Greenhouse is responsible for delivering data to the entire organization and supports delivering data to our customers to promote more fair and equitable hiring. I remember as a candidate encountering Greenhouse regularly in my job search and learning more about Greenhouse’s ethos behind hiring once I applied. The idea that the most important decision a company makes is their next hire resonated deeply with me, and everything that we do with data at Greenhouse is to serve the goal of making our customers better at hiring. Whenever I encounter a Greenhouse customer I’m really proud to represent our organization, and I’m thrilled to start sharing with you all the journey we’ve been on over the last few months. So, with that lets jump into exactly how we have gone about modernizing our data platform.

Rethinking our data platform

Why we embarked on this journey

Technologies we used

Why we selected the tools

Written by Holt Calder