Taking the data engineer out of ETL

A story of how Atlassian is reducing data engineering involvement in day to day ETL

6 min readSep 6, 2017

Extract, Transform and Load

Extract, Transform and Load or ETL for short is beautifully simple concept. Most data scientists, analysts, product managers, developers or anyone working with data should be familiar with the concept of extracting data from somewhere, transforming it such a way that is more useful to them and then loading their new data into a new location to be used later.

Although ETL is a simple concept to understand, in reality it can be quite the opposite. Implementing ETL using big data tools can be extremely complex, if not unnecessarily difficult. Atlassian primarily uses the AWS stack (S3, EMR, Hive, and Athena) to fulfill its data engineering needs and although AWS provides us a wide range of useful tools that allow data engineers to do their jobs, for anyone else sitting outside of that sphere the learning curve can be steep.

Atlassian’s ETL landscape

For those not familiar with Atlassian, Atlassian is a software as a service company offering with a wide suite of tools for teams, from startup to enterprise.

With such a wide variety of tools, Atlassian has some complex data engineering needs. Conforming different product data, tracking customer events across products, and calculating company metrics for each product are some of the challenges faced. Internally this translates to numerous data scientists, data analysts, product managers and data engineers working closely together.

With differing teams and independent goals, it can be hard as an Atlassian data engineer to be across everything, nor should they need to be. Data engineers should be solving the big engineering problems, or providing services or infrastructure that others can easily use. However, if the services aren’t there, or are not user friendly, data engineers can easily start becoming blockers.

Returning to our ETL theme, here at Atlassian we’ve identified that ETL for non data engineers was becoming a major point of contention. Atlassian’s data lake is exponentially growing and there is a critical need to extract and transform data from our major datasets and load it into smaller subsets which differ depending on each persons use case.

The contention arose from the fact that data engineers are ETL domain knowledge experts, but with limited time to solve every use case and not willing to support minor ETLs moving forward. Whereas non data-engineers are also time constrained and want to find insights quickly but generally lack the data engineering nuances to ensure their ETL will work, or understand why their ETL might not work. For the most part the minor ETL’s are “simple” initial CREATE TABLE AS SQL queries with the possibility to also schedule these queries, which can easily change to become INSERT INTO (append) type queries.

Example of a simple ETL someone might want

Behind the “Create Table As” ETL

When data engineers see a CREATE TABLE AS statement these are the questions that are come to mind:

How much data is being scanned? Are data partitions being hit?
How is the SQL statement being run? Hive? Presto? Spark? Are there enough resources to process the request?
Where is the output data being stored?
How is it being stored? What file format? Parquet, ORC, Text?
Is the output table partitioned?
Is the metastore being correctly updated?
How often is this going to be run? How is it going to be scheduled?

Most of these questions relate to data engineering best practices, and are a case of creating correct data definition layer (DDL) SQL statements, or ensuring the best tool is used for the right job and linking everything together to work.

However non data engineers or the submitter of the SQL shouldn’t need to additionally burden themselves with trying to figure out how to answer these questions, or learn these data engineering best practices. This is especially true as they should be more focused on the data output and the insights they can find from it.

It became apparent there was a clear need for a service that would:

Make it easy to create simple ETL, but adhere to data engineering best practices
Reduce actual data engineer involvement, as they shouldn’t need to be involved all the time

So that’s what Atlassian has done. We think that in an ideal world minor ETL should really just be a case of submitting some SQL and the rest is magically taken care for you.

Transformation Service

Leveraging Spark on AWS EMR, Atlassian has built a service that allows users to submit SQL where the output of that SQL gets populated into a table of their choosing. The service essentially abstracts away all the data engineering and takes care of it for you in the background.

Using a data portal front end users can submit an SQL where the service initially:

Ensures the SQL is valid
Outputs the table schema of what the SQL would return

Users can then select:

Which schema to create a table in
Whether they wish to partition it on any of the columns returned
Whether it should be run on a schedule.

The service then compiles a Spark submit job that is sent to an EMR Spark cluster that:

Processes and runs the SQL
From the resulting SQL writes parquet files to a unique s3 location with partitions (if any were selected)
Updates Atlassian metastores by issuing DDL commands

Should users want to schedule their SQL (to append or refresh) we utilise Airflow and create a DAG that compiles the Spark submit job and sends the Spark submit job on the requested schedule.

With this service now running, we are seeing a reduction of data engineering requests on how to create tables and populate them. This is the desired output as now there is a user friendly way to self-serve ETL.

Learnings

Although the transformation service has had successful adoption, resource management to ensure everyone’s ETL will run successfully is still difficult. There are cases where our Spark cluster can’t cope with how much data is needing to be processed. We are looking at introducing ephemeral clusters that would persist for single Spark submit jobs, allowing users to specify the cluster size should they need something larger then usual.

In the meantime we have also implemented autoscaling on our cluster that triggers on memory use. This has proven to be quite successful providing the submitted SQL is reasonable. The cluster scales up aggressively should there be required load, and gracefully scales down when nothing is running. This has the added benefit of cost saving as we keep the cluster size low during quiet times (i.e the weekends).

Finally, education in SQL is still proving to be necessary. Optimising users’ SQL solves 90% of our failures within the service. Many users are unaware of how much redundant data they might be scanning with their SQL. We are still working on the best way to help inform users of SQL best practices.