Ben Parker: Thinking Like A Data Engineer

Published in

Kobalt Music Group

6 min readMay 2, 2019

As a Senior Data Engineer at Kobalt, Ben Parker, discusses the world in which he works, the challenges faced by data engineers, and why testing like a scientist is the way forward.

Ben Parker, Senior Data Engineer, Kobalt

There are many challenges when transitioning between back-end and data engineering. I’ve found I have needed to change mindset — particularly when thinking about my team’s role within the wider business, my approach to testing and working effectively with the slower feedback cycle. These challenges are especially present at Kobalt where explosive growth in the music streaming market creates large, complex data sets that need effective management. Processing royalties on a global scale requires a coordinated and efficient data engineering function where accuracy and speed allow us to build trust and transparency within a traditionally opaque industry.

Before diving into the detail, it’s worth asking what I mean by data engineering.

What is a data engineer?
Consider a tech department in a service-oriented architecture. There is a front-end that communicates with a collection of independent services to handle specific requirements. Each back-end team has their own area of responsibility and business understanding, ideally with a significant degree of independence. Let’s say we have a team that gathers royalties, a team that investigates streaming data (Spotify plays, YouTube hits) and a team who is responsible for client agreements (and the commercial split between them). With these teams established, someone in the company would like to know “Which of our clients have a lot of Spotify traffic but low total royalties? ” There doesn’t already exist a unified way to talk to all three of these business functions, and the teams responsible for them are very careful about integration for solid architectural reasons.

Enter data engineering:

Data engineers will extract the data out of back-end systems and external sources (or have the systems do the extract for them), organise it and present it to those who have questions. Working as a data engineer in such an architecture requires a broad understanding of the business, often utilising relationships with 10+ different teams to gather the data required and surface to a variety of users. I similarly find myself needing to understand far more of the wider architecture of the department than when I was a back-end engineer with a specific responsibility.

Acquiring this breadth of business knowledge usually comes at the cost of an in-depth understanding of each individual system. I can’t hold the entire business in my head, so a compromise must be reached. I learn just enough about each team to get the job done. This broad but shallow knowledge can miss crucial edge cases, so I always consult experts from a given team when doing work in their area.

In addition to a different organisation approach and priority, the actual day-to-day workflow of back-end engineering and data engineering differs significantly. I’ve split these into two broad categories: testing and development workflow.

Testing

Testing of data transformation jobs can be generalised into categories based on a generic data pipeline structure.

An example of this structure could be:

DynamoDb of payments
- Input: CSV extract from this dynamodb
- Output: Parquet file to be put in a data lake.
A class that represents a row of data
A collection of joins and aggregations

Each section of the pipeline deserves some testing and can be roughly translated into a tier of standard software testing:

3 -> 4 Unit tests

2 -> 3 -> 4 (small data) Integration tests

2 -> 3 -> 4 (real data) End-to-end testing

Testing 1 ->2 tends to be very painful. As such, I try to put as little logic or transformation in this step and use reliable tools. CSVs at this stage should be heavily discouraged. Instead, I prefer some form of typed data-structure with a verified schema attached (like parquet or avro). Ideally, no data transformations should happen here, almost always retaining the schema that the original data source has. Therefore the bulk of testing will be concentrated further down the pipeline.

From my back-end days, I carry forth my desire for a large suite of automated tests that are easy to read, easy to write, and easy to change. Ideally, our tests could be read by someone with limited technical knowledge. To achieve this I think it’s important to be able to define some example input data adjacent to your expected output, and a declaration of what transformation you intend to do. This means that reading the full story of a test can be done in one place. That is why I don’t consider tests that read from an external file ‘unit tests’ as your example input data is stored elsewhere and therefore more complex to read as one cohesive test. Therefore the majority of the testing should be done at the 3 -> 4 -> 3 stage of the pipeline.

This is one place where the choice of technology makes a large difference. I like using Spark because it can fulfil this requirement using the breadth of functionality available in Scala.

We still need to test file reading and writing, so it’s worth adding some integration tests for this purpose. The core functionality should be done at the unit test level, and testing with real data sizes would take too long, so we naturally hit a sweet spot of integration testing with small files.

Testing with real data and actual files is essential for end-to-end testing and investigating the performance of your jobs. By definition, these tests will take longer so they aren’t feasible to put into your standard test suite. It can also be very difficult to automate from a data checking perspective. A common pattern I implement is to require a link to the full job being run successfully with production data for a pull request to be merged. Significant changes may require some time to code review, so the time required to get the cluster to run shouldn’t be a major impediment.

Development workflow

Now that we are thinking broadly about the business and have written our detailed code with a healthy set of unit tests, it’s time to start actually running the jobs and looking at how they behave with real data.

End to end test like a scientist. Meticulously. Running end to end tests for a big data process is time-consuming. As such the development process needs to adapt. My three favourite tricks are:

Reduce the scope of your test to make it faster. If you are optimising a subsection of your job then just run that part!
Test one change at a time. I often will make a table (on paper) of all the tests I’ve run, with what setting I have changed, the time it took and if the output has been verified. This way I can use one methodology for both performance and correctness tweaks.
Run multiple tests in parallel. While still maintaining that each test is verifying one change at a time. No one wants to re-run their 8-hour etl job because they changed two things and one of them broke the output.

Once you have an organised log of all the configurations and changes you have tested, you can fully explain on a pull request exactly why you have done, what you’ve done and what impact it had.

Code Quality

Once we’ve implemented good practices in our testing structure, it becomes a lot easier to focus on code quality. Unit testing and refactoring go hand in hand in this regard. There is an issue that your red-green-refactor cycle becomes a fair bit longer when testing your changes requires a two-hour job run. Some discipline is required to persevere with a focus on quality and minimize the impact of the slower cycle time.

In summary

Data engineering is a horizontal field that requires a breadth of knowledge and integration
Unit testing is essential and feasible with the correct setup
End to end testing requires meticulous attention to detail and isolated testing
Code quality is no less important and can be organised as usual when testing is sufficiently prioritised

— — —

Find out more about Kobalt and how we’re changing the music industry on our website.

Ben Parker: Thinking Like A Data Engineer

Written by Kobalt Music