Designing a Dynamic Automated Reporting Pipeline — 4

Published in

MagicLab

3 min readApr 1, 2022

Previous post of these series can be found here.

On the last post we have mentioned three tiered event structure. Onwards, we call our first tier as global, second tier as user_data and name of the third tier will depend on particular event. Now let’s go over them one by one to create minuscule version of ad_impression event for our games. With that end goal in mind, we start writing our global schema.

What information do we need on all of our data events? Some parts of these could depend on the companies architecture, goals and app type; for example if you are making multiplayer games, you would want to know players server or you may need to know device_type of the player. But most parameter that should be included in global schema are fairly common. We need to know id of our users, id of users session, name and/or id of the app, platform information, name of the market for distinguishing users, version of our app, name of the data event, timestamp of the data event etc…

If we start writing these down in a json schema structure with relevant constraints; it would look like these:

Now let us go over it line by line. Under the definitions key, we created a global key that will contain our parameters. Because we are going to reference another schema we have used allOf key in line 7. Then we used oneOf key in line 9, so that we can version our global schema if needed. After that, we gave it a title and start putting parameters.

We described user_id as 36 character string in uuid pattern. minLength and maxLength parameters are redundant at that point because of the regex pattern. Starting json schema draft 2019–09, “uuid” became a resource identifier; but we use regex pattern for backwards compatibility purposes.

session_id parameter is an integer with 0 as minimum value. That could depend on companies architecture. We could use another uuid as session_id but we have decided incrementally increasing integers would be easier to work with.

event_uuid is the unique identifier of that one particular data event. We will use these for deduplication purposes on the later stages of pipeline.

app_name and app_id are fairly straight-forward parameter. They are exist because we will use same data stream for all of our clients and we need to distinguish between apps.

Value of market_name parameter can only be ones in the enum array. Knowledge of users app market is important to us for many different business wise reasons from UA strategy to ab-test planning.

event_ts is epoch time in seconds precision. We want that in string to minimize data corruption in transfer.

event_name, event_schema_version, user_data_app_name, user_data_version are there to be placeholders at that point; we will give them constants based on game and event types.

global_schema_version is the version of our global schema, in this case it is 1.

is_tester is a good knowledge to have for QA purposes and to separate real users from your testers and admins.

platform can be android, ios or any other thing depending on the app.

app_version is version of your app. We version our apps with integers, but you can give floats or some regex patterns for constraints.

After defining our parameters, we wrote the required ones; in this particular case, all of them. If one of the parameters in required array is not in the data event, our validation will throw an error.

As of right now, we don’t have another version of global, so we close the oneOf array, and in the line 106; we give reference to user_data schema to include in global schema. This way, final global schema should have to include definitions in user_data schema. It will come in handy later.

This concludes our first tier, in the next post we will wrote our user_data schema to continue on our journey to ad_impression event.

Designing a Dynamic Automated Reporting Pipeline — 4

Written by Alperen Yüksek