Building a Modern Data Platform on AWS Glue at Hootsuite

As Hootsuite becomes an even stronger data driven organization, the demand for data across the company grows. Whether it’s the sales teams that want to know more about their enterprise customers, or the marketing teams that want to build better user journeys, or the product teams that want to build the best user experience, teams across Hootsuite are using data to make decisions.

With over 50 inbound and outbound data pipelines and a growing analytics team, we needed a robust, standardized platform to create new data pipelines quickly, while still supporting the legacy pipelines we’ve created over the past 7 years and the variety of use cases our team has.

As the team responsible for building and maintaining new data integrations, we want to build new integrations quickly and reliably. We wanted our data platform to be highly standardized, opinionated, and templated in order to reduce the time it takes to build a new integration and make them easier to maintain. In a standardized and opinionated platform, there is one way to do common tasks. We named the data platform DataHub.

We chose AWS Glue as the backbone of DataHub. It provides mechanisms to scale for the various workloads of our data pipelines and is easier to maintain than the custom docker images that we used previously. Additionally, taking advantage of Glue Tables allows our team better visibility into the data at all points of our data pipelines. The Glue jobs are orchestrated by Airflow. We have used Airflow for orchestration for many years and it has served us incredibly well as an orchestration engine.

Typical data pipeline using the DataHub framework

Developers can create a copy of our template data pipeline, which creates all of the infrastructure and code to run a simple data pipeline. This allows us to create a new data pipeline in just a few minutes.

Next, let’s look at each of the key pieces that make up DataHub.

Airflow is an industry standard orchestration engine and is popular for orchestrating data pipelines. One change that we made with DataHub is moving from configuring Airflow DAGs using Python files to configuring DAGs as yaml with dag-factory.

In the theme of keeping DataHub highly opinionated, standardized, and templated, we built tooling on top of dag-factory to have a “Standard Pipeline”. A Standard Pipeline looks like the pipeline pictured above. However, in the yaml file, we define it as a single task that our build magic turns into Airflow operators with the correct arguments. The process of editing a yaml file and validating that it is still correct is much simpler than doing the same with a python file and one of the advantages of using dag-factory and yaml configuration.

We believe that DevOps should make your life easier, not harder, and so we centralized deployment code in a shared library on Jenkins so that all pipelines are deployed in a standardized way that is easy to maintain for a large amount of pipelines. Deploying a data pipeline includes deploying glue scripts and airflow config to S3 and updating glue table schemas. The common deployment logic handles the vast majority of data pipelines but is extensible for complex data pipelines that need custom steps during deployment.

Writing infrastructure as code helps maintain parity across dev, staging, and production environments. At Hootsuite we use Terraform as the infrastructure as code tool. In DataHub, we terraform as much as possible, including Glue jobs, IAM permissions, S3 buckets, and the Redshift clusters.

Taking advantage of Terraform modules, all of the infrastructure code for a data pipeline is defined in one place. This reduces the time to create a new data pipeline and making changes that affect all pipelines (eg. changing default glue job arguments) is much simpler. This also makes setting and managing tight permissions much easier, as will be described in the next section.

Locking down permissions is critically important to protect an organization from bad actors and from preventing bugs caused by programs having more permissions than they should.

Using a terraform module for infrastructure makes it much easier to manage IAM permissions. Each data pipeline has its own IAM role with the IAM permissions that it needs to access the resources (and only those resources) that are created for the data pipeline. This makes it impossible to accidentally read or write data from a different data pipeline.

If any additional resources are needed for a data pipeline (eg. an external S3 bucket), an additional IAM policy can be passed into the terraform module and it will be attached to the IAM role for the data pipeline.

Each data pipeline has one IAM role, with multiple IAM policies attached

Redshift is the core of our data warehouse. We have 2 Redshift clusters; a service cluster that data pipelines read/write to and an analyst cluster that analysts can run ad-hoc queries and data models against. We took advantage of data sharing on Redshift to share the data between the service cluster and analyst cluster to allow us to have two different clusters while only maintaining one source for data.

By creating a highly standardized, opinionated, and templated data platform, we have reduced the time to create new data pipelines and reduced the complexity of maintaining them. In the future, we are looking at migrating our legacy data pipelines onto the new data platform, as well as improving data observability on the platform

If you enjoy working with data pipelines, data analytics, or just like the sounds of working at a data driven company, check out the openings at careers.hootsuite.com

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store