Building Reference Architectures for User-Facing Analytics

Building scalable analytics infrastructure to deliver low-latency user-facing analytics with Apache Pinot, Kafka, and Debezium

Photo by Mikail McVerry on Unsplash

This article is going to be a long one. So let me summarise the things that matter.

  • What is user-facing analytics?
  • What’s the business value of user-facing analytics
  • Why achieving real-time user-facing analytics is hard?
  • What are the options for building an analytics infrastructure to deliver user-facing analytics?

Airbeds — an Airbnb clone

Let me take a realistic example to set the stage for our discussion.

Assume we are building “Airbeds,” a clone of the popular vacation rental platform Airbnb. Airbeds is a typical two-sided marketplace where hosts list their properties on Airbeds (listings), allowing guests to search and book them before arriving.

Airbeds domain model

To measure how their listings perform on Airbeds, we need to provide hosts with an analytics dashboard displaying the following metrics for a particular time frame.

  • total_nightly_revenue: Total revenue generated from the bookings made in that duration.
  • total_nightly_revenue_by_listing: Total booking revenue by individual listing. E.g., which listing has brought in the highest and lowest income?
  • average_nightly_rate: dividing total nightly revenue by the number of booked listing nights.
  • average_length_of_stay: This is the average number of nights guests stay per booking.

These metrics should be plotted on a UI like this:

All hosts in the Airbeds platform are allowed to see this dashboard. So how do we build this?

Before getting to the technicals, let’s see the importance of “user-facing analytics.”

The business value of user-facing analytics

“Everyone is a Data Person, but only a select few have access to the insights data” — this is about to change!
— Kishore Gopalakrishna, StarTree Inc

Let’s clarify the term “user-facing analytics” further to understand its importance in a business context.

The great data divide

Historically, there have been two camps in a data-driven organization, the operational camp, and the analytics camp.

The great data divide

The operational camp consists of the machinery and operators interacting with customers and end-users. These interactions generated data as a byproduct, extracting and moving them into analytics systems via ETL tools.

The analytics camp consists of analysts, decision-makers, and C-suite executives who utilize analytics tools to gain insights from processed operational data.

So far, the analytics camp was the only entity with access to analytics. But not for long.

Internet-scale companies like Google, Facebook, and LinkedIn have paved the path to expose analytics to their customers and users as data products.

For example:

  • Google exposed its search and ranking of web pages as Google Analytics.
  • Facebook exposed its user engagement metrics to advertisers.
  • LinkedIn exposed “Who viewed my profile” to its users.

Customers and end-users, are they the same?

Yes, and no.

A customer directly benefits from a product by consuming and ultimately paying for the value it delivers. A user consumes the value but may or may not pay for it.

Sometimes, customers can be users — a consumer browsing amazon.com purchases a product with the highest stars, playing both roles simultaneously. But, consider a situation where a manager of a pizza joint uses the ERP system for inventory restocking. In that case, the manager is an end-user of the ERP system but not a direct customer.

From a business point of view, analytics should be accessible for both parties.

Analytics for users — what value does it create?

For simplicity, I will use the term users to refer to both customers and end-users from this point onward.

Now let’s get back to our Airbeds use case.

An analytics dashboard will be crucial for a host to measure the listings’ performance and take corrective actions to improve the overall business.

For example:

  • The metric total_nightly_revenue helps identify the listings that perform well and those that need attention.
  • The metric average_nightly_rate helps compare the current rate with the market median. If the market is booming, the host can increase the rate.
  • The metric average_length_of_stay helps determine why guests don’t like a particular listing.

In essence, having access to insights helps users do their business better, take timely actions to correct the course, and provide a better value.

If you are still not convinced, I recommend you read the following article.

Why delivering user-facing analytics hard?

When analytics was limited to the analytics camp, analysts were trained to deal with slow queries and dashboards with stale data. There were only a handful of decision-makers in the organization. These pain points didn’t matter because it is internal.

But, it should not follow the same when delivering analytics to the operational camp.

For example, the Airbeds platform can have millions of hosts, and a great majority of them would be accessing analytics concurrently. Also, the results should show up within sub-second latencies for a better user experience. The underlying analytics infrastructure must be scalable, performant, and reliable to withstand high query throughput (QPS) and deliver results with ultra-low latency.

In this article, I want to lay out several approaches for architecting analytics infrastructure to deliver analytics for the users. That may or may not apply to your organization directly. But at least you can use them as guiding principles to avoid re-inventing the wheel.

Option 1: Serving analytics from the OLTP database

Let’s start with the most straightforward option, where we serve analytics directly from the operational database.

The architecture consists of the following:

  • Frontend: The analytics dashboard is served to the user over the web and mobile channels.
  • API layer: Proxies the requests between the frontend and Microservice. Also, it handles API authentication (OAuth), protocol translation, and rate-limiting.
  • Microservice: Queries the database and responds to the API layer with analytics data.
  • Operational database: This OLTP database keeps the booking records in the reservations table.

Let’s assuming database is MySQL and the reservations table has the following schema.

Schema for reservations table

The following query generates the total_nightly_revenue metric.

SELECT sum(total) as total_revenue
FROM reservations
where date_in > '2021-01-01' and date_in < '2021-01-31' and host_id=1;

That doesn’t seem complicated apart from the aggregation and the filter predicate on date_in and host_id (the host who’s viewing analytics).

The following calculates the average_length_of_stay.

select avg(datediff(date_out, date_in)) as length_of_stay
FROM reservations
where date_in > '2021-01-01' and date_in < '2021-01-31' and host_id=1;

The query has an aggregate function (avg) and a function in the projection, apart from the filter clause.

Finally, this query returns total_nightly_revenue_by_listing.

SELECT listing_id, sum(total) as revenue_per_listing
FROM reservations
where date_in > '2021-01-01' and date_in < '2021-01-31' and host_id=1
group by listing_id
order by revenue_per_listing desc;

That is the most complicated query so far, having an aggregation, filter clause, group by, and ordering altogether.

Challenges

When the reservations table grows over time, these queries will become slower and slower, causing a bottleneck at the databases. That results in a sluggish performance at the front end. Intelligent indexing will take us to some extent. But soon, it will also hit a limit.

OLTP databases are designed to manipulate fewer records at a time, not to handle OLAP queries with aggregations, filtering, grouping, and sorting at scale.

Option 2: Feeding analytics from a NoSQL database

The major limitation of the previous one was the poor read performance of the OLTP database. Option 2 addresses that by having a read-optimized database to feed the analytics Microservice.

The new architecture will have two new elements, a NoSQL database, and an ETL pipeline.

Limited by data freshness

The ETL pipeline periodically extracts records from the reservations table (in MySQL), applies transformations, and loads the final result into a NoSQL database like MongoDB, Cassandra, or AWS Dynamodb.

We can use technology like Apache Spark, Beam, Hive, or even a Hadoop job to build the ETL pipeline. Its goal is to transform the source data into a read-optimized format, leveraging denormalization and pre-aggregation techniques. For example, we can pre-compute metrics for each host during a pipeline execution to prevent the NoSQL database from aggregating them on-demand.

Challenges

Option 2 brings two challenges; the first one relates to data freshness. The ETL pipeline operates in batch mode, causing stale data to be shown in the front end. As the data grows, the pipeline execution time also increases, forcing us to maintain long intervals between pipeline runs.

Secondly, we will have two distributed systems to manage, the NoSQL database and the ETL pipeline. That adds an operational burden.

Option 3: Feeding analytics from a real-time OLAP database

Option 2 is almost perfect, except for the data freshness issue. The third option addresses that by utilizing a real-time OLAP database with a streaming CDC pipeline.

Getting timely and consistent analytics is the goal of this architecture

These are the new components added to the architecture.

  • Change Data Capture (CDC) mechanism (Debezium)
  • Event streaming platform (Kafka)
  • Streaming ETL pipeline (Flink, Kafka Streams, etc. This is optional)
  • Real-time OLAP database (Apache Pinot)

Capturing OLTP changes in real-time with Debezium

Unlike a periodic batch ETL process, A CDC tool like Debezium enables capturing the changes made to a database as they happen. Debezium is deployed on Kafka Connect.

In our case, we can use the MySQL connector for Debezium to stream changes from the reservations table. Database changes are encoded as events and written to a configured topic in Kafka.

Synchronizing real-time change events into Apache Pinot

Apache Pinot is a real-time distributed OLAP datastore used to deliver scalable real-time analytics with low latency. It can ingest data from streaming sources like Kafka and batch data sources and provides a layer of indexing techniques that can be used to maximize the performance of queries. Also, stream ingestion takes place in real-time, making data available for querying within seconds, allowing us to maintain data freshness at a higher magnitude.

Pinot can be configured to ingest events from the Kafka topic where Debezium streams change events from the reservations table. The ingested events are immediately indexed, allowing the frontend to query while maintaining data freshness. Pinot also supports upserts, instrumental in reading the latest value over a stream of changes.

For example, suppose a guest makes a booking now. That will be reflected in the analytics dashboard within seconds, giving the host to see timely insights.

In-flight transformations with a stream processor

Sometimes, change events coming from Kafka need to undergo processing steps before landing on Pinot. That includes transformations, joins, and temporal aggregations.

We can achieve this by adding a stream processor like Apache Flink in between. Then Flink consumes from change events topics, applies the transformation, and writes to the final topic where Pinot ingests events from.

Streaming ETL

However, this is optional for most deployments. Pinot can also perform lookup-joins and transformations with UDF once data is ingested.

Challenges

The only challenge in this architecture is the complexity of the deployment, many components to manage and monitor. But when organizations scale, the benefits of having timely and consistent analytics outweigh the operational complexity.

Summary

The data-as-a-product concept will play a critical role in this decade. Everyone in the organization, including the staff and customers, will use data for decision making. Therefore, democratizing analytics is a must-have for today’s organizations.

When implementing user-facing analytics, there are no hard and fast rules. I recommend starting with the cheapest and most straightforward option and seeing where it fails. Even OLTP databases will be a good starting point if you start with little traffic. You can consider moving to options 2 and 3 based on your organization’s budget, skillset, and the need to have fresh analytics.

References

The Challenge of Getting Real-time Analytics at Scale

Data Mesh Principles and Logical Architecture

--

--

--

EdU is a place where you can find quality content on event streaming, real-time analytics, and modern data architectures

Recommended from Medium

The journey of a speculative Tweet

Lessons from a real Machine Learning project, part 1: from Jupyter to Luigi

Stock Return Prediction Using Transfer Learning on Textual Data

What I learned from my Data Science CodeAcademy course

12 common JMP charts you can make in Python with Plotly for free

Using Kalman filters to derive predictive factors from limit order book data

AI2FUTURE: How Data Could Save the World

Aspiring data scientist? Master these fundamentals.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Dunith Dhanushka

Dunith Dhanushka

Editor of Event-driven Utopia(eventdrivenutopia.com). Technologist, Writer, Developer Advocate at StarTree. Event-driven Architecture, DataInMotion

More from Medium

Understanding Reverse ETL

Why and How I Integrated Airbyte and Apache Hudi

An exciting example that shows the depths of CDC technology

Data Streaming with Flink