Skyline: ETL-as-a-Service

Published in

Pinterest Engineering Blog

4 min readOct 2, 2015

Mao Ye | Pinterest engineer, Discovery

Data plays a big role at Pinterest, in both enabling new product experiences for Pinners and providing insights into Pinner behavior. We’re always thinking of ways to reduce the friction of converting our data into actionable insight. A question often asked by our data users (primarily analysts, product managers, engineers and data scientists) is, “How can I build a reporting dashboard to track xyz metrics daily?”. As the number of employees accessing data, the volume of data and the number of queries increased, scaling and supporting emerging requirements became a big challenge. In order to tackle this, we built a new data platform called Skyline.

Before the birth of Skyline, data users were dependent on data engineers to:

Find the right tables to query
Productionize the query as an Extract, Transform and Load (ETL)
Backfill historical data (e.g., past one year data) for a given ETL job, because data users often want to track metrics for a whole year

Skyline solves the above requests and enables three core use cases:

Self-service UI to create and schedule workflows/jobs that process this data
Visual exploration of the data in Pinterest warehouse
Ability understand the lineage of data (i.e., data dependences)

Self-service ETL

We built a self-service ETL manager to simplify job creation and scheduling. Using this tool, data users can create and schedule data jobs using simple SQL queries to do the following tasks all by themselves:

Move data from Hive to Redshift
Build Pinalytics (Pinterest’s data analytics engine) reports
Backfill historical data on demand

Because of the simple design and ease of use, ETL Manager has been adopted by data analysts and product managers quickly. More than 40 percent of Pinalytics reports have been created using Skyline.

Data Exploration

Along with providing a self-service ETL, we knew users wanted to explore the data that exists in the Pinterest warehouse (Amazon S3). We built a feature called Skyline Data Warehouse which provides a catalog view of core datasets in our warehouse and is automatically updated every day. Skyline Data Warehouse provides the following information about data tables:

Which workflow/job generated the data
The owner (i.e., a person or a team to can contact with questions)
Whether a table is up-to-date
Column names and types
Other details like its size in bytes, estimated row counts, table partitions and sample rows
Data users’ comments and notes

Visualizing data lineage

At our scale, we run thousands of data jobs, and each has inter-dependencies. Before, if a job broke in production, it was difficult for users to find all the affected downstream tables. In order to make this easier, we built an automatic lineage detection system that monitors and instruments all data jobs and helps to:

Understand how a table is derived from other tables (in MySQL, Hive, HBase, Redis and Redshift)
Figure out how a Pinalytics report is generated
Get a quick glimpse of owner/job/workflow information of a table

Architecture

Skyline has the following components:

Skyline UI: We built the web application on top of the Flask framework. The frontend uses the React.js library to create user interface components, such as simple interactive visualizations, and SQLAlchemy interacts with the backend database.
Thrift service and database: Job’s input/output information is the essential knowledge that enables Skyline Data Warehouse, Data Lineage and ETL manager features. All data is stored in a MySQL database, and we built a Thrift server that the clients interact with to read and write this data. Because the data volume isn’t huge, and because we needed to do many join operations to get data lineage relationship, we use MySQL as the persistent storage instead of other choices like HBase.
ETL Driver: ETL Driver is a Python library we built to parse ETL workflows/jobs, created through Skyline ETL manager. After parsing, ETL Driver converts the jobs as tokens that are recognized by Pinball (our workflow management system), which later schedules those jobs for processing.
ETL query/job parser: In order to automatically instrument the input and output of the jobs, we built a pair of parsers:

Query parser, which automatically detects the input/output tables for all queries
Hadoop configure parser, which automatically detects input/output S3 file paths for given Hadoop jobs.

Whenever a user creates a new ETL job on Skyline, its input/output information will automatically be extracted, stored and made available for easy exploration through Skyline UI.

Roadmap

We hope to open-source Skyline soon. In the midterm, we’re thinking about extending Skyline by implementing the following features:

The ability to run ad hoc queries on a variety of backends like Hive, Redshift and Presto
The ability to search any data-related information (e.g., user queries, comments, column information, sample data)
Support for prioritized ETL workflow/job scheduling

If you’re interested in working on data infrastructure projects like Skyline, join our team!

Acknowledgements: Skyline was built by Mao Ye, Changshu Liu, Yinian Qi and Julia Oh along with the rest of the Data Engineering team. We got a lot of useful feedback from the Business Analytics team during the development phase and other data users across the company.