As much as we’d like to generate all the data we need in-house (and we do generate a lot), there is often a need to import datasets from external sources, and make them available for querying. Examples of imported data include: analytics metrics, advertising data, and currency exchange rates, all of which are used by Skyscanner engineers and data scientists.
The main entry-point for data at Skyscanner is our Kafka-based streaming platform which allows near real-time processing and archiving. However, in the case of third-party batch data ingestion, realtime processing was not a requirement and the use case was not aligning very well with the stream processing paradigm. For these reasons, we decided to come up with an alternative approach which would allow users (i.e. Skyscanner engineers) to import external data sets without abusing our existing streaming infrastructure.
With this blog post we want to share the approach we followed to solve this problem and the particular aspects of the system we decided to optimise. Hopefully, you’ll get a better understanding of the employed services and a high-level outline of the steps required to implement a similar solution.
To provide a solution that scales across the company we approached the problem with an emphasis on:
- Minimizing dependency on the Data Platform engineers when onboarding new datasets — preventing our availability from becoming a blocker.
- Moving ETL pipeline ownership to the user — allowing users to own the data they produce.
- Automating boilerplate code and config generation — ensuring that infrastructure, permissions and deployment setup are abstracted away from users.
- Scalability and cost management — maintaining flexibility and cost efficiency, and ensuring the solution is future-proof.
With this in mind, we’ve opted to use managed services provided by AWS to enable the various ETL stages and have adopted Cookiecutter for project templating. Cookiecutter is a templating utility widely used throughout Skyscanner and promotes a convention over configuration approach when creating new projects.
These templates we provide can be used to generate the components for a full third-party data pipeline — however, ownership of the resulting components is left with the user. The Data Platform Tribe does still maintain ownership of some basic infrastructure required to integrate the pipeline components, store the ingested data, make ingested data queryable, and simplify monitoring.
Should we go with AWS Batch, Glue or both?
Data in Skyscanner is stored in Parquet, an open source column-oriented format widely used in the Hadoop ecosystem. Parquet provides better compression and improved query times compared to row-based formats like CSV. However, most of the third-party datasets we import arrive in a row-based format (CSV or JSON) so in order for us to gain the advantages of column-based storage, a Parquet conversion step is incorporated in our pipeline.
AWS Batch is a service which allows Docker containers to be executed on both managed and unmanaged compute environments to a schedule. At Skyscanner we use an ECS cluster, which is an unmanaged compute environment. For each container the provisioned vCPU and memory resources are user configurable.
Conversely, AWS Glue ETL is a fully managed service which essentially offers “Spark-as-a-service”. It allows the execution of Python or Scala Spark jobs at scale without having to run your own cluster. The resource allocation of a job is configured with a DPU (Data Processing Unit) parameter, where 1 DPU corresponds to 4 vCPUs and 16GB of memory.
Initially we’d only planned to provide a job template for AWS Glue allowing users to convert imported data into Parquet and apply any necessary transformations. This would have made users responsible for setting up their own process to import data into S3 in a format consumable by Spark. In order to ease this process, we decided to also provide a simple AWS Batch template for importing third-party data, allowing users to import data to S3 by writing only the logic required to access the external dataset.
So, we opted to utilise both solutions; exploiting the unique characteristics of each where appropriate in our pipeline. Although AWS Glue is much easier to scale, it’s limited by the fact that it can only read from sources supported by Spark, whereas AWS Batch can be used to run any kind of logic e.g. hit an http API and store response data in S3.
What does it look like?
A high-level illustration of the third-party data ETL pipeline is presented in the following figure. The two components of the pipeline, which are owned and deployed by the user, are indicated with a cookie 🍪 on the diagram.
The following steps are executed as part of the ETL pipeline:
- An AWS Batch job is triggered on a schedule, imports data from a third-party source, and writes JSON or CSV to the intermediate batch-input S3 bucket under the project specific prefix key. When an import completes a “_SUCCESS” file is written to the same location.
- The generated “_SUCCESS” file is used as an event trigger for an AWS Lambda function which executes the corresponding AWS ETL Glue job, based on the S3 prefix extracted from the event trigger.
- The AWS ETL Glue job handles the Parquet conversion of the batch input data and outputs to the data-platform-archive bucket. Other transformation, validation, and enrichment steps can be executed as part of the job.
- A scheduled job in airflow creates and updates a table in the AWS Glue Catalog. This table can be used to query the generated parquet data using a number of available query engines.
- Metrics indicating the success or failure of the pipeline components are generated by capturing the corresponding CloudWatch events using AWS Lambda.
A naming convention based on the <project> and <event-name> parameters passed to the Cookiecutter templates is used to determine how data gets ingested. As data transitions through the pipeline it is placed in the following locations:
- /<project>/<event-name> prefix within the batch-input S3 bucket, containing the raw third-party data in JSON or CSV format.
- /parquet-batch/raw/<project>/<event-name> prefix within the data-platform-archive S3 bucket, containing the parquet data generated by the glue job.
- <project>_<event-name> table in the AWS Glue Catalog.
Users can create a project for either of the jobs employing Cookiecutter templates and providing a number of configuration parameters, in addition to the <project> and <event-name>. As illustrated in the following figure, after the project is created locally, the user pushes it to GitHub which generates an event that triggers the corresponding Drone build. Subsequently, our internal deployment tool handles the deployment of the actual job (AWS Batch or Glue) along with any dependent CloudFormation resources. By the end of this process, the third-party data pipeline is deployed, and data can be ingested.
Here’s what we learned
- Making the onboarding process as simple as possible greatly improves adoption rate, especially by teams that do not have significant engineering resources.
- Glue jobs have a minimum charge of 10 minutes and a minimum DPU of 2 meaning that it’s not the best option if you are processing small amounts of data.
- If a Glue job runs for less than a minute, no CloudWatch metrics are generated. This is likely due to the polling frequency of the AWS mechanism used to receive spark metrics.
- Scaling AWS Batch for third-party datasets consisting of TBs of data per day might not be very straightforward.
- AWS Glue jobs should be made to be idempotent so that replays of the same job don’t require manual intervention to remove existing output data.
Special thanks to my colleagues Omar Kooheji and Michael Okarimia for their contributions to the current architecture of the system and the preparation of this blog post.
Join Skyscanner, see the world
Life-enriching travel isn’t just for our customers — it’s for our employees too! Skyscanner team members get £500 (or their local currency equivalent) towards the travel trip of their choice in 2019 — and that’s just one of the great benefits we offer. Read more about our benefits and apply online right here.
About the author
Hi! My name is Nikolaos Tsipas and I’m a Distributed Systems Engineer working in the Data Platform Tribe in London. We’re an enablement squad developing intuitive tools allowing anyone in Skyscanner to easily extract information from Petabytes of accurate historical data. Outside of work, I love travelling, playing music and exploring new ideas!