Our journey to a new data warehouse

Published in

Omio Engineering

8 min readJul 20, 2020

Imagine a one-year-old travel startup that allows users to search for tickets. Some data is generated by our product but we need to survive and not focus on analytics, let’s leave it for later. Time passes and now we’re a couple of years old. We gather data into some sort of data warehouse. It’s probably MySQL or Postgres, designed by some smart pals. There are two or three main tables that we use and do not bother about the design. Who cares, it works, leave it there!

We’re about to turn four, we’re thinking about expansion and want to explore new markets. Users can book some tickets on our platform, it’s not only a search engine! There is a data warehouse designed for specific needs and it has main concepts defined and described as tables. We need to keep in mind all the filters that have to be applied. But it’s fine, our analytics department is not that big and we can explain all the edge cases and what does this strange exclude_ field mean.

Now we’re five years old. We allow users to book most of the available options onsite. We’re a trusted product with a strong base of loyal customers and growing engagement rates. We gathered enough data to analyze our path to this point. At that stage, we started hiring many more people each month and our teams grew quickly with many new starters. In addition to it, we need to answer more complicated analytical questions. We feel that something is wrong with our data warehouse. Designed some time ago it has a lot of information but with the product evolving it isn’t able to change accordingly due to the complexity of changes needed. It’s high time to check our data and try to model differently.

Where should we start?

Before jumping right off the boat we need to consider possible models. The first that comes into mind is a classical 3rd normal form concept(3NF).

Example of modeling the source *.JSON log to follow 3rd normal form

The idea is to decouple entities and relations between them into separate tables. It allows us to keep track of all the changes that could happen with a product by changing the corresponding table. It’s one of the basic models that you learned with SQL and business analysts are used to it. The downside is the high number of tables and the complexity of queries when you need to understand the user journey. It’s a nightmare if your product has many steps and each of them is tracked in a separate table. Thus it’s not the best solution in our case, we will have a way too many entities and relations between them are even more complex. It will cost more than we would benefit from it.

Should it be a snowflake schema (SF)? SF defines entities and their parameters as fact tables. Parameters are supposed to be shared between entities and stored in tables called dimensions.

Example of modeling the same source *.JSON log to follow a snowflake schema

Well, we have several key entities, we can describe them using different dimensions. It sounds like a perfect way to go. But it is difficult to adapt SF fact tables when the product evolves: sometimes new features change the definition of the entity in a way that it is a completely different fact. A data engineer cannot simply introduce a new fact table every third month and expect analysts to redirect dashboards. Our existing model has a snowflake schema. It’s proved difficult to maintain because our business is growing and new features are introduced quite often.

Could we have something that combines both the above concepts? Some sort of hybrid that will inherit benefits and level down weaknesses? Here comes a data vault. It defines key business entities as hubs, describes relations between them as links, and aggregates information from several entities using relations into satellites.

Example of modeling the same source *.JSON log to follow a data vault

It fits our needs best:

we know our entities and can define them in hubs
we know how entities relate to each other and describe those relations in links
business analysts can build dashboards on satellites that they helped to define

How do the benefits of 3NF and SF fit in here? We define hubs and links close to 3NF as much as possible in order to keep the model flexible and robust. Data engineers can change tables whenever they want, it’s a raw data level and it’s not that widely exposed. What is exposed are satellite tables. It’s important to keep them updated in order to maintain access to the latest data for all data consumers. Satellites remind fact tables of SF: they are designed for hassle-free querying, could be denormalised, and share hubs, which allows considering hubs as dimensions.

What to consider before implementation?

Great, we know how we will organise data, let’s start doing it. We decide to start by gathering information from our data producers. Several teams are handling different parts of the funnel and we need to understand if there is any new data they could provide. We choose the most massive (the amounts of data) part to begin with — the search platform. The search action is quite simple when you want to describe it as an entity: there are a couple of parameters that are specified when the customer describes where and when they want to travel. The complexity comes when we want to track search results shown to a traveler. As we want customers to have the best experience with their journey, we provide many options to select from: they can travel by bus, train, plane, or ferry. There could be several stops, the traveler might need to change the vehicle or even the departure station. Each journey might have several travel providers. Customers could travel one-way or select a round trip. The same route operated by a provider could have different prices depending on class and fare selected. All these possibilities make a model more sophisticated and have to be thought through.

In addition to it, imagine that customers perform a search every 30 seconds. Each search will generate on average 150 journeys. Each journey will have from zero to several stops, from one to several possible prices. Now you can imagine amounts of data needed to be processed. We need our model to handle all the mentioned difficulties as smoothly as possible.

Let’s start implementation! What do you mean by “don’t rush into it”?!

Now, when we know the domain, we use the 3NF approach to define entities and relations. We think about analytics use-cases to create satellites. We create a first version of the model and take our imaginary crystal ball to predict changes that we could have in the future. Then we also think about what could destroy the model and force us to redesign it from scratch. We want it not only to show the real condition of the product but also to be robust enough to handle changes. We ask questions, test a solution, find weak spots, redesign it with new requirements in mind over and over. After several iterations, we have a model that represents the product and follows concepts of antifragile architecture.

The next step will be to get data onboard. There are two main options to consider:

to pull data from production logs. They are published by data producers for their internal needs, schema and data format are not always defined, they can be changed according to team needs
to consume data published by data producers specifically for BI needs. A schema and data format are always defined, can be changed only with the approval of the data engineering team.

The second option is called a data contract. From experience, we know that a properly defined contract helps both data engineers and data producers. Data engineers can always know what data to expect, they can build data pipelines based on this knowledge. Data producers introduce new features and add them to logs only when it passes AB testing. Of course, defining the contract is not that easy: you can’t force data producers to rewrite their internal tracking just because it doesn’t fit your expectations. Either you can’t except the data that is not modelled and is a snapshot of the production logs. You need to find a middle ground.

I’m intentionally avoiding the step of selecting the tech stack here. I feel that a comparison of possible options should be left to another article.

Do expectations meet reality?

So, a model is there, a contract is defined, data is flowing to the data warehouse. We’re building aggregations (satellites) that help the business to answer questions. We could finish here and say how proud we are to complete this project. Let’s be honest, there is always something after “they lived happily ever after”. In our case, it took around 5 months for something new to come. A new feature changed the way entities related to each other. The first thought was that we have to redesign the whole thing. Then we met with our data producers and discussed together what this feature actually means and what new it brings to the table. Detailed analysis proved that we can use the existing model by introducing a couple of fields in a link table and adding several filters to queries that create satellites. We synchronised the changes with data producers so that when a feature would be rolled out we would be ready too. The key to success here was the shared plan and clear communication on the work progress. Of course on the way there we found a more elegant way to process data and optimise our flows.

The migration to the new model is not finished yet. We’re still maintaining the legacy model while building the new one. We constantly learn from our stakeholders that some aggregations need to be changed to improve analysis. We talk to data producers to keep track of new features. We document every single change that is done to the model. All of it together allows us to smoothen things down for new colleagues and build trust in the data and data engineering team. We’re still working on a new model: planning the deprecation of legacy, discussing new schemas with other data producers. It is even more challenging with the COVID-19 situation. Nevertheless, it’s exciting to see improvements and trust, to hear requests to speed up the process and deliver a completed model as soon as possible. Once we will have used the new model in production for a few months, I’ll post an update about our learnings. Make sure to check back! :)

Our journey to a new data warehouse

Where should we start?

What to consider before implementation?

Let’s start implementation! What do you mean by “don’t rush into it”?!

Do expectations meet reality?

Written by Ana Gulevskaia