Data modeling in the world of the Modern Data Stack 2.0

Nikolay Golov
Manychat Tech Blog
Published in
6 min readJun 16, 2022

In my previous article, I raised a question about the necessity of data modeling in the world of the modern data stack. In this article, I want to switch from abstract questions toward a very practical case.

Readers of the previous article commented that comparing data modeling to data stack is like comparing cars to smart navigation systems. In this article, I want to illustrate how the technologies of modern data stack give analysts the ability to search for insights without any modeling (without any permanent data transformations).

Imagine we have to analyze some data about the items clients see in a list on your web application. It can be search results, items for sale, most relevant posts in the news feed, or a list of new conversations for customer support, for analytics they are all the same.

Such data can be used by analysts to estimate a CTR (click-through-rate, or impression-to-click rate) for developing some kind of recommendation algorithms in order to estimate the observability of different positions in the list (1st in comparison with 10th, for example).

In the modern data stack, this type of data is easy to collect:

  1. Let the front end or a mobile application show a list for a user, and at the same time send to the analytics an event that contains the following JSON:

2. “Send to the analytics” means to send to the API endpoint. It will deliver a JSON to the data delivery pipeline (based on Kafka, or more modern tools like Airbyte or Fivetran).

3. The data delivery pipeline will add each incoming JSON to the target table in the modern analytical database, Snowflake. Let’s call this table RAW_Events.

4. That’s all!

The beauty and power of the modern data stack are that you don’t need to think about the scalability of your data delivery pipeline, volumes of data in the target database (it is literally boundless), or about transforming the incoming data.

How can an analyst work with such data? Modern analytical databases, like Snowflake, work directly with JSON and want analysts to know only SQL.

Let’s return to our example of an event:

It means that the user with ID = 3299223 opened an application at ‘2022–04–21 12:31:55 UTC’ and has seen a list of six items, where 1037287139 was the first one, 1727228887 was the second and 1665899805 was the last. The items are stored as an array, where a position in the array corresponds to the position on the screen. Such an approach to the storing of data violates the First Normal Form, but who cares nowadays? In the past, analysts were unable to work with such data (JSON, arrays in a single field) using only SQL. But now we can easily “unpivot” such an array.

imagine that analysts want to check how many users have seen item 1037287139 in the top position today. How we can do it:

With the modern data stack, a query like the one above will work, even if the “Events” table contains hundreds of billions of events, up to 10,000 new ones per second, as it does in the analytical platform of ManyChat that I currently work on.

So, to the original question: Why should we be bothered to think about modeling if everything is so perfect?

Query 02 answers this question. Its syntax is complex. To get the necessary data we have to read literally all the JSON from the Raw_Events table, unpack them, unpivot and perform a calculation, even if the answer is 42 and the table RAW_Events actually contains hundreds of billions of JSONs. This query can be long and expensive, especially in the modern data stack, tools of which internally prefer to solve performance problems by injection of more money.

Moreover, imagine that the analyst wants to continue their research: maybe they want to estimate the average position of items on the list, which had the value-added services activated on them at that moment.

To solve this problem, Query 01 should be joined with other tables (if they are available in our system!). Such JOIN is not only complex for analysts to write but also, again, it will have to process all the JSON from the RAW_Events table over and over again to get the result of a query.

The example above gives us an understanding, in which case some data modeling can help, even in the world of modern data stack: when data are big enough when queries to it are frequent, and when there are few analysts and you want all of them to do the same operation in the same efficient way and avoid producing contradictory results.

Consider the following data modeling step:

The Events table will contain all data from RAW_Events, already unpacked and sorted. It can be incrementally filled with new incoming events on a daily or hourly basis, which makes such operations extremely fast and cheap.

Analysts who want to find out how many users have seen item 1037287139 in the top position today need to issue the following query:

And here is the difference between Query 02 and Query 04, answering the same question:

The experiment above was conducted with a table RAW_Events of ~20 mln. JSONs.

Both queries were executed using a single Snowflake cluster of size X-Small. Even on such a small sample task, the difference is huge.

Real-life analytical tasks require analysis of billions of JSONs, forcing each query to process at least 1,000 times more data and spending correspondingly more time and/or money.

As the example above shows, a single launch of Query 2 (slow one, 47s) is acceptable, but if few analysts need to execute such queries multiple times a day, they’ll surely prefer Query 04, as it is more simple and faster.

To finalize an article: 10–20 years ago you can hardly imagine that an analyst could use queries like Query 1 or 2 to analyze hundreds of millions of rows without preliminary data preparation by a data engineer or data modeler. Today, with a Modern Data Stack, an analyst can play with data on his own. Data preparation and data modeling can improve performance (Query 3 and 4), but not fundamentally.

Important: in this article, we played with a rather simple model. In the next article, I’ll show how a more complex business model can benefit from data modeling.

--

--

Nikolay Golov
Manychat Tech Blog

A Head of Data Platform of ManyChat. Data Engineer. Researcher of Data Modeling techniques (Anchor Modeling, Data Vault). Lecturer of Harbour Space University.