Designing a Dynamic Automated Reporting Pipeline — 2

Published in

MagicLab

4 min readMar 8, 2022

Previous post of this series can be found here.

There are many steps in any pipeline. Whether you are doing ETL or ELT you would have to use many different tools, techniques and technologies. Which type of pipeline you will use will ultimately depend on the tools you are using for pipeline. You can find beautiful comparison between them by Kevin Bartley here. It should be noted that in the end our pipeline became a mixed result between ETL and ELT.

Like any other design decision what we will be end up using would depends on many different factors. How soon do we want to reach our data, know-how of our team, learning curve of the tools we could be use, interconnectivity abilities of these tools and our resources whether monetary or people all took a role in the decisions we made.

Independently of our process order, we would need an Extraction phase, Transformation phase, Loading phase and Reporting phase. For every phase there are various different methodologies to implement; but as mentioned above, we were limited with many things and had to find optimized solution on our particular case.

Considering all of this, we have decided to stay in AWS Cloud for whole pipeline. Easy integration of its services; and especially AWS CloudFormation’s abilities to manage other services swayed us.

Extraction — What to Collect and How?

Which type of information we want to collect ultimately depends on how throughly we want to monitor our games. Since we were creating casual games; we needed as much data as possible. We will have a detailed look into our data events and their parameters in the upcoming posts but suffice it to say we have a data event for nearly every game action.

So, we created json schemas for our different data events. I only have a general understanding of how exactly we are creating our data events client-side and have to refer you to our game developers. Using our schemas they build a beautiful system that creates and validates our data events on the client.

After defining and creating data; we needed to collect it. After some research, adhering to our previous decision about AWS Cloud, we decided to use Kinesis Datastreams for collecting our data; Lambda for validation and partitioning and Kinesis Data Firehose for writing data files to our S3 Bucket in parquet format.

Obviously, we will go deeper on these in our relevant posts but for the time being we go on to our next step.

Load — How to Access Data?

Considering that our raw data sits on S3 Bucket in a neatly partitioned manner we began to think about how to easily access, process and query it. To keep our costs as low as possible we have decided on Redshift with external schemas. We have also implemented Glue Data Catalog for managing table partitions.

That way during our daily update jobs; we would only use relevant data partitions and avoid additional storage costs for our data warehouse.

We decided to store our fact-tables and aggregated tables on Redshift for ease-of-access to reduce our query durations.

Transformation — Lots of SQL, lots…

On a daily basis, we needed to check our raw data, update our fact-tables, increment our aggregated tables and create our reporting tables. To make that process as efficient as possible we needed an orchestration tool. We have decided to use Glue for it. By defining individual SQL queries to different jobs and creating a hierarchical workflow of these jobs with adhering to their dependencies we got ourselves a daily update ETL.

But writing the queries, jobs, dependencies and workflow for one game was a chore in and of itself. For multiple games it would quickly became a tedious errand. So we automate the whole thing by creating templates for queries, jobs and workflows. Which we will delve into in our relevant posts.

Reporting — Visuals!!!

We used Tableau. There are many things that can be said about its pros and cons, its capabilities and failures, advantages and disadvantages to different BI tools. But as explained above; we, like everybody else have had a limited resources, whether human power or monetary.

We used Tableau extracts for reports. Using Tableau Server Client, we have managed to automate most of the reporting and extraction phase. But in the end, this part of our pipeline remains our most weak phase automation-wise.

To sum it up, basic architecture of our pipeline (of course very crudely) looked like this.

On our next post, we will start to have a detailed look to our steps.

Next post of these series can be found here.

Designing a Dynamic Automated Reporting Pipeline — 2

Extraction — What to Collect and How?

Load — How to Access Data?

Transformation — Lots of SQL, lots…

Reporting — Visuals!!!

Written by Alperen Yüksek