[Podcast] Greenhouse’s Data Infrastructure

Published in

In the weeds

1 min readJun 7, 2019

Two weeks ago I was a guest on the Data Engineering Podcast along with the founder of Datacoral, Raghu Murthy. In the episode, I discuss how we use Datacoral to manage our data infrastructure at Greenhouse. We use Datacoral’s ETL infrastructure (hosted in our AWS VPC) to pull data from many different sources, including our production Postgres databases, Salesforce, Zendesk, Jira, Asana, Datadog, and more into our S3 data lake/RedShift data warehouse.

Leveraging this toolset, our data science team has deployed hundreds of automated “materialized views” (SQL queries automatically converted into a DAG and run on a regular cadence) that shape and reshape the data in S3/RedShift into forms that are more easily queryable in our BI tools (for dashboards and other reporting) and that are ready for modeling. We are also able to leverage these ETLs to push data from S3/RedShift out to other systems, such as our Customer Success Management software that our team uses to ensure our customers are getting as much value as possible out Greenhouse’s Talent Acquisition Suite.

I go into more details in the episode—please give it a listen!

P.S. We’re hiring for many roles!

[Podcast] Greenhouse’s Data Infrastructure

Written by Aaron Gibralter