Building a scalable DWH model for user event funnels

Published in

Team Taxfix

10 min readMar 15, 2022

Senior Analytics Engineer Jerry Nwabuilo takes us through the iterative process of data modelling and how to balance development velocity, use case coverage, and scalability for user event funnels.

Where we started

At Taxfix, we believe that doing taxes should not be an uphill task. Every person should feel empowered to manage their finances on their own. Taxfix provides a platform where you can complete your tax declaration quick and paperless, ensuring that your tax return will be accurate, complete and then securely submitted to your tax office instantly.

With this business model, there are a lot of stakeholder questions that a user journey can solve. These requests can come in the form:

‘I want to understand where our users have the biggest challenges in our app and are likely to drop.’
‘I want to see the drop-off rate in the journey by platform.’

Note: in this writing, ‘stakeholders’ and ‘analysts’ refer to more or less the same people. They are the ones who use the model within Taxfix and they are well versed in their respective fields. This term will be used interchangeably throughout this writing.

‘Users’ refer to the Taxfix customers who go through the Taxfix app.

Drawbacks of the old data model

Initially, we analytics engineers created models on a request basis. For each request—which inadvertently requires a funnel—we had to create new data models. This was done by left-joining each event table to the first event in the funnel which will then act as a cohort and ensure that the funnel goes in an orderly manner, just like the funnel in the diagram above.

The table that creates the above diagram looks like this:

Other drawbacks of this funnel include:

No scalability from implementing and maintaining many SQLs and joins.
Querying raw tables directly was rather slow.
No scalability for growing business demands, growing funnels, and numerous events.
The model was not incremental. We had to run full scans of the raw tables on every run of each model.

What was our first (good) approach?

We decided to build a generic event funnel model, which leverages all the tracking events generated by the app. This generic model can be used in any B2C, E-commerce, Fintech organisation. It is pictured in the ERD below.

A basic outline of model and data flow

The ERD shown above is the basic, not-so-detailed version of the model. dim_events: contains the list of all stakeholder/business-friendly event names determined by the stakeholders for tracking events. An example of a business event name that is defined by a stakeholder is registration_success.

event_mapping: More than one tracking event can be seen as a business event. This table ensures the mapping between tracking event tables and the stakeholder agreed names. Using the example given in the dim_eventstable above, two events can be determined to be a registration_success event by the stakeholders. For example, sign_up_success and sso_success tracking events could be regarded as the registration_success business event. The event_mapping table houses these sorts of mappings.

funnel_mapping: this table holds the list of different funnels, which is basically an app user journey from one point A to another point B. The events that make up the funnels and their order of supposed triggers are based on what the analyst wants to see. Given that the Taxfix product has many data points, our stakeholders can decide to analyse different journeys at different stages and different starting points in the app, hence the need for supporting multiple funnels.

fact_user_event_activities is the table that holds every possible event triggered by an app user. It serves as the log of all activities that the user made on the app, which makes it a huge table.

dim_user is separate from this model. It is a core model which holds several attributes. They contain dimensions used to slice and dice the funnel during analysis.

agg_user_event_funnel serves as the summary table, which we pass into our BI tool for further analysis. It is used alongside the funnel_mapping table to filter for which app user journey the analyst wants to see.

This table aggregates the fact_user_event_activities table on several predefined attributes and shows when a user first/last triggered a specific event.

The dim_events, event_mapping and funnel_mapping tables were derived from intermediate tables-tables which hold records temporarily either for further transformation or for subsequent insertion into final tables in your DWH. These intermediate tables get their records from the analysts/stakeholders inputing the required values on Google Sheets, which are later consumed into Snowflake by an in-house airflow job daily.

dim_events and event_mapping tables are further joined with the raw data sources to create the fact table.

Positives of the new model

The advantages this model brought include the following:

Stakeholders could maintain funnels on their own. It was easy for stakeholders (analysts and product managers) to create and maintain their funnels and add/remove events from the dashboard. This eliminated the need for the Analytics Engineering team to create a new model every time there was a new funnel required. All they needed to do was add the funnel and the order of events they wanted to see into the input tool, and it was ingested into the model without hardcoding or editing the transformation code.

Below is a template of how the funnel entry is provided on Google Sheets for the stakeholders:

More generic and robust — this model was a one logic fits all. The logic was generic enough to ensure it worked for all types of funnels needed by the stakeholders. It was also a powerful model because it was based on the tables in the data warehouse and not directly from raw events. Additionally, the model was incremental, so it was scalable and reduced full scans of the raw tables with every run.
The model could be split, sliced, and diced into different app user/device demographics.

Why did it fail quick adoption?

As technically good and powerful as this model was, intended users did not quickly or easily adopt it for a few key reasons:

We didn’t fully understand customer (stakeholder) needs, which included the following:

Ensuring that there was a “cohorting journey”, as we call it — The new event funnel model built did not cohort on the previous event, which was implicit to left-joining as was done in the old model. This inconsistency made people who started outside of a specified timeframe and continued with later funnel events inside the specified timeframe show up later in the funnel. Users with these scenarios appear as “ghosts” that came out of nowhere, and this caused “up and down” movements in the funnel, as seen in one type of funnel below:

Getting some type of “session” that can be determined by many factors —Analysts wanted to be able to define a session or see a journey based on specific criteria, say, the journey based on the first time that an app user triggered an event A for a particular session defining attribute X. This means that, down the funnel, we look for subsequent events (B,C,D, etc) that were triggered within a specific time frame, which happened in the same session defining attribute X.

Basically, analysts wanted to be able to define sessions in different ways and see journeys through various criteria. This was not initially solved by the new event model built.

We needed to establish a common understanding of how the model works.

There were multiple use cases that the model could solve and a lot of others that it couldn’t solve. However, there was no establishment from analytics engineering to the stakeholders of the types of questions that the model could solve, which led to the model not meeting their full expectations.

The model chose flexibility over simplicity and we didn’t get the strong need for documentation of usage right.

We underestimated the need for proper documentation of the model. It was not easy for the stakeholders to understand the whole flow of the model as it had different components. Understanding each component was integral to the adoption of the model. We relied on an in-person walk-through and high-level technical documentation, but this was not very effective. The stakeholders had no fallback layman documentation when they got stuck in the usage of the model.

How did we tackle this?

Process Change

We had to rethink how we went about building the model. We did many good things which included initial requirement gathering from stakeholders, the data and process flow design, proper technical implementation and execution, high-level documentation, and data testing. Still, we need to add additional measures to the whole project process to make it more user-friendly. This included:

Really listening to the stakeholders — The stakeholders are the end-users of your product. They are the ones who will determine the adoption rate of your model. If you cannot make your solution solve their most minute problems in an easy way that involves less input or effort from their end, they will most likely abandon the shiny model and go along with their ad-hoc approach as that is well known and trusted.
Having regular checkups multiple times a week — As a result of the above, we had to schedule checkups with the stakeholders to see how well they used the models. We took time to understand the challenges they faced, manage expectations and give timelines as to when major challenges could be fixed. This built trust for the Analytics Engineering team and also knowledge by the stakeholders that their problems will be resolved.
Playing around with examples given and challenging them — An integral part of the process was looking at the stakeholders’ challenges and seeing if it was easy to be solved or if there was an underlying issue with the logic in the model. This work also gave us insights that some of the challenges they faced were due to the documentation being too high-level and too technical.

Some handholding was also needed (do not underestimate the importance of this) as we found out that we saw things from a technical point of view which was a bit different from how a stakeholder will see things. Playing around with issues alongside stakeholder makes you understand how well your product is being used and understood and adopted.

Model Change

The process changes we made above brought us to the realisation that we needed some changes to the model, which can be seen below:

Adding new logic to aggregate layer based on examples given — We understood that the model wasn’t sufficient to answer every question that the stakeholder had. This included defining sessions based on events determined by the stakeholders themselves, filtering the journey that only includes how a user behaves on a platform, seeing what happens afterward in the same journey, etc.

With these questions came a need to extend the data model with more dimension tables. We added a journey table for app users that takes other session defining attributes into consideration.

This table shows the nominal order which the app user triggers an event for a particular session attribute (which implies the journey, for example: journey based on app users triggering the onboarding_started event - which is the first event for a specific funnel - for the first time; journey based on app users triggering the category_one_started event the third time, etc).

An extra logic that handled the cohorting journey was added to the aggregate layer. The updated data model looked like this:

Extending documentation on Looker and DBT

DBT is the tool we use for the transformation of data in the data warehouse, while Looker is the visualisation tool used for descriptive analysis at Taxfix.

We realised that detailed stakeholder-friendly documentation on both tools would go a long way in making the stakeholders understand the model, the tables, and how they are being represented in the BI tool. We also added technical documentation of the model to Notion (our documentation/project management/collaboration tool).

Here’s how we leveraged the table documentation feature on DBT for tables in our model:

Also, here’s how we leveraged the documentation feature on Looker using Quick Start Analyses and dimension descriptions for our model:

Conclusion

Data modelling is a continuous development process. What was described above is a healthy iteration process that balance between velocity, use case coverage and scalability which is very encouraged here in Taxfix.

After the initial requirement gathering, engineers often put frequent stakeholder engagement at the bottom of the list when carrying out a project end-to-end. However, this is the difference between making your data model good and making it exceptional.

The success of every analytics engineer’s work is in the usage and trust of data models being rolled out. This can only be achieved by recognising and working on the frequent feedback that your stakeholders give at every stage in the project.

Learn more about joining the Taxfix Data & Insights Team here.