ETL as a service

Yinon Eitan
Plarium-engineering
2 min readAug 11, 2022

--

This article describes why and how we created a lean ETL solution that can be easily set up and reused in many use cases.

Background

At Plarium, the data engineering department develops the company’s centralized data platform that provides data solutions and services.

Alongside the main data pipeline, which handles billions of events every day, we developed a side solution for data transformation for different use cases. The purpose is to allow us to give a fast way to ETL data without the need for complex design, development, and tests.

In many cases, the data is sent to our GCP bucket in its original structure (game server logs or third-party data) and at the end of the process, it will be available to query in BigQuery.

ETL as a service will allow generic and simple ETL without the need for development and tests

So how did we do it?

Using Jupyter notebooks, which can easily be scheduled, the user input all the relevant parameters-

Source and target path

Enrich columns

Deduplication based on a key

Data partitions- based on existing or enriched columns

Slack channel for alerts- further details can be found in this article.

After that, the user just clicks run and can query the data he just transformed

Input params, the user sets it up and is ready to go

In the background, the process written in Python reads the files in the source path, moves the files to a pending folder (using gsutil), and starts to transform and enrich according to the input parameters. The process handles parquet, ORC, CSV, NDJSON, and in order to support multiline JSONs, we use Spark to write them in ORC format.

Duplication is checked in each batch load and against the target table, and if exists moved to a maintenance table for further investigation. In case of failure, we can copy back the pending files and reprocess them after fixing the issue. If the process succeeded, the files moved to an archive folder.

One of the use cases-

Plarium runs marketing campaigns to publish and distribute our games.

The data for the cost of these campaigns are sent to us alongside our main pipeline(different source) and need to be processed in order to present this data in our KPI and in order to calculate different metrics such as ROAS.

We ETL this data using ETL as a service because the data can be loaded as-is (with small adjustments) and then continue to process and join it inside BigQuery.

Thanks for reading!

--

--