Building a Scalable Data Strategy for a Product — A Case Study With Flying Taxi

Published in

The Startup

10 min readFeb 18, 2021

This article is the third part of the series and focuses on the thought process involved in building a data strategy for your product.

Objective

Our product has been massively successful. While the massive growth has been exhilarating, it looks like the original data pipelines to receive and process data cannot keep up with the current and future development.

Our main goal right now is to understand the needs of the product, the scale of its growth, and how we can build for the future.

This project will present my findings and the analysis and reasoning behind the choices to help the product succeed.

Key Deliverables

Let’s look at our deliverables as part of this project. As you can see, the project mainly involves identifying who needs data, its purpose, the current state of data collection, and analysis of the data to strategise our data infrastructure.

Part 1: Identifying Data Consumers

To build an end-to-end data pipeline, the first step is to understand who needs data and why they need it.

Let’s identify the primary data customers/stakeholders, why they are the primary data stakeholders, and how they want to use them (primary use-cases).

Let’s look at our primary stakeholders

Identifying the primary stakeholders and pinpointing a use case will help you plan your data model.

Part 2: Identify the type of data

The next thought experiment is to consider what data these teams would need.

On a high level, we have two kinds of data:- Entity & Event data.

Entity Data

An entity is a unique object such as a customer, product, or order.
Each entity can have attributes associated with it.

For example, A customer entity has attributes such as name, address, and phone.

Tables are used to represent entity data.

Event Data

Event data mainly describes the actions performed by entities.
You might capture information such as action, timestamp, or the state in an event data.

Example:

The Action is when a customer booked a ride.

The Timestamp is when it happens — 6:15 pm October 6th.

The State will refer to all other relevant data such as mobile, tablet, or computer and their browser.

Difference between event and entity data

Now that we know what event and entity data are let’s see if we can assign data types to these teams based on their use cases.

Mapping the data type alone is not enough; we also have to decide the most critical tables and establish relationships.

Let’s take an example where marketing wants to do targeted campaigns. What information would they typically need for this?

They would need the booking details to understand the location, timing, and customer information.
They would also need the customer demographic information such as age or address.

As you recall, entity data is represented in tables in the normalised form to avoid redundancy.

So let’s assume we have four tables with the below attributes in our case.

Customer Id, RiderId, BookingId, and vehicle are the primary keys of the respective tables.

CustomerId, RiderId, vehicleId are the foreign keys in the Booking Table.

Now the marketing team can use CustomerId to join the customer table and the booking table to get the relevant details. This preserves the data integrity and avoids data duplication in tables.

Part 3: Understand the current state of data collection

Now that we have the requirements from your stakeholders, we want to understand the current state of what data we collect. That is how we recognise which additional information is needed to achieve the future state.

The engineering team can provide the data currently collected in the pipelines. This data could be events generated by the customer’s activities on the app.

Let’s say our engineering team shared the file with you. What are your steps for extraction?

The engineering team placed the file in the FTP folder, and we can use python to extract the file.
The file has only a week’s worth of data, so python seems like an appropriate choice for ad-hoc analysis.
We can use excel writer in pandas to load the file with no file conversion.

Looking at the first few rows, we see that these are event details for a user in a particular stage. We also have age and user_neighborhood, which is irrelevant in assessing a data strategy.

Let’s answer the below questions to assess the growth of the data.

How many events are recorded per day?

2. How many events of each event type per day?

3. How many events per device per day?

4. How many events per page per day?

5. How many events per location per day?

Part 4: Understand the growth of the data

After performing ETL, we understand the growth of the data of our app. However, in order to assess the scalability and account for future data needs, categorising our data into three types will be more effective.

Event data: No of events generated by the app. This could be a user searching for a ride, booking a ride, or completing a ride.
Customer data: How many users actually signed up in the app?
Transactional data: Not all event data or customer data results in a transaction. A transaction only occurs when the customer actually completes a ride. You might sign up for a flying taxi, but never actually use it.

Pro Tip: Always plan questions before any analysis. This will keep you on track and save a lot of time in the long -run.

Let’s ask some questions to our data

How much is the customer data increasing?

Let’s assume we gain a customer when he\she interacts with the app. So any event type can result in a customer, in that case, we can answer this question by calculating the event count/day.

Analysing the data in Tableau we see a spike on October 6th, but the data seems to be steadily decreasing.
Maybe an effect of a campaign?

2. How much is the transactional data increasing?

If we consider the transactional data to be when the customers began the ride, we can filter for the event type =”began ride” and calculate the event count/day
It looks like the transactional data seems to be steadily increasing and then decreasing from the mid-week.

3. How much is the event data increasing?

Event data seems to follow the same pattern.

Clearly a week’s worth of data is not enough. We need more data to come up with a solid plan.

For making business decisions, you might like to have all the data they want. However, for any ecosystem, it is impossible to collect or provide everything that the customers need. In most cases, we have to get creative and work with the minimal data set that will give the most output.

In our case, let’s ask the engineering team for the event data for the past one month to date. We don’t need all the information here. We just need the event type, date, and number of events. So in the next iteration, we can ask for a summarised view instead of the raw data.

Let’s look at a sample of how the summarised data looks like

After loading the next iteration of data, let’s ask a few more questions.

What is the story the data is telling you about the growth?

From the total number of events recorded, we see that there is a compound growth rate of 9.72% from last month. That means more users are interacting & engaging with the app.
There seems to be a massive spike in October, compared to September- will need to analyse the reason for this spike and re-use them to drive growth.

2. What is the fastest-growing data and why?

The fastest-growing data seems to be the event data. Especially the Open event data. This suggests that users have heard of the app and are taking the first step to download and open the app. Our Marketing efforts are paying off.
‘Search and choose car’ seems to be rather flat, which could suggest potential optimisation in the user funnel.
Compared to the open events, begin ride events are very less. We need to optimize our user flow to convert our lead customers to actual ones.
Even though our app is successful in acquiring new customers, we are facing some challenges in converting leads to actual customers.

3. What other observations can you formulate?

There seems to be a significant spike in October compared to September, which might suggest the impact of a marketing campaign. However, this needs further analysis to test for statistical significance.
The marketing campaign caused a significant spike(29.8%) from September. There must have been huge traffic on the site that day, and the engineering team should ensure that the app is stable enough to handle all these requests.
Marketing teams, product teams & engineering teams need to work together before rolling out any marketing strategy to make sure that campaigns do not negatively affect the performance of the site. If the marketing campaigns are successful, there will be a traffic-overload which might be challenging to estimate. However, if the site cannot handle the incoming traffic, then all the effort goes to waste and users might develop a negative association with the app.

Now that you have :

Identified data stakeholders and their data needs.
Identified what data is currently being collected and what data needs to be collected.
Gathered data insights and growth trends.

It’s time to tie all these together and decide which Data Warehouse our product should invest in.

Before we go there, let’s try to understand if we should go for Cloud or On-Prem.

What do you think our startup should invest in based on the above key points?

Hybrid Approach to the rescue

Our startup should invest in a hybrid approach using a robust on-premise data warehouse house tool for ingesting, storing, cleaning, and serving data rapidly.

It should have a self-service front tool for leisurely exploration, discovery, and data mining. This would allow them to have centralised data access while leaving data stores distributed as they are.

Apache Hadoop is an excellent solution for on-premise because of the below reasons:

· Cheaper and no license since it is an open-source framework.

· Provides massive parallel processing.

· There are two vendors Cloudera and Horton works with qualified support to build the service.

· Hadoop is highly scalable with the ability to add nodes/disks.

· It also has amazing fault tolerance: If one node fails, then the same job is triggered on a different node on which replica of data is present.

Hadoop should be used to store structured and semi structured data which needs faster real time analytics or ML use cases.

Snowflake should serve as the front end for business users or data analysts.

Snowflake can store aggregated, transformed data on different data marts, which will suit the need of different business users. This data can be directly visualized in tableau.
Snowflake supports a majority of data formats such as JSON, parquet. It works much better with JSON formats.
Sizing and storage are separate for Snowflake so it requires no sizing.
Snowflake has security features according to the edition purchased. This means our start-up can tailor the security needs based on their customised data strategy.

We have finally reached the end of this article. We have successfully assessed the data needs of our startup and suggested a data warehouse to suit our needs. Every step of the way is backed by data, which makes it easier to convince our stakeholders.

In the next article, we will use analytics and user research to optimize the user flow and propose a multivariate experiment.

Building a Scalable Data Strategy for a Product — A Case Study With Flying Taxi

Objective

Key Deliverables

Part 2: Identify the type of data

Written by Sneha Mehrin