Building a Serverless Data Lake in AWS to Capture Product Engagement Metrics

What is product analytics? Why, and to whom it matters? How to use Segment to instrument digital products and route metrics to a S3 data lake to be analysed by different AWS analytical products.

Photo by airfocus on Unsplash

This post gives you an idea about product analytics at a high level, why, and to whom it matters. We take a fictitious digital product to learn what metrics to measure from user engagement and then walk through a solution architecture to build a product analytics data lake on AWS.

What is product analytics, and why does it matter?

Product analytics is the process of analyzing how users engage with a product or service. It enables product teams to track, visualize, and analyze user engagement and behavior data. Teams use this data to improve and optimize a product or service.

— Sam Tardif, Atlassian

Although conventional wisdom states that “if you build it, they will come,” it might not work for digital products in the modern era. To gain a competitive edge in the market, you, as a product owner, must track and analyze how the product performs amongst its users. Constant user engagement measurements reveal where the product falls short, what features are more loved by users, and what needs improvement. You surface these metrics with product analytics and feed them back to engineering to go on improvement cycles.

Why is Product Analytics important?

Gone are the days when product owners built products based on their gut instinct and experience. There was a lot of guesswork involved. A product had a high chance of being successful if luck played a critical factor.

Today, data science and analytics give us more intelligent and more accurate decision-making support. Product teams heavily depend on analytics to understand how well you’re meeting user needs. All these sophisticated product analytical tools finally take the guesswork out of the equation. Also, you can avoid expensive user interviews when making decisions.

Who can benefit from product analytics?

Product analytics is not siloed into a single team or confined to the growth hacker’s realm. It is a universal and cross-functional thing. Every team ranging from the C-suite to engineering, sales, marketing, and customer success should own it and be aware of what’s going on with their product.

Amplitude identifies the following as the most common roles that regularly benefit from product analytics.

  • Product Managers
  • Marketers
  • Engineers
  • Analysts
  • UX/UI Designers and Researchers
  • Data Scientists
  • CEOs and Executive Leadership
  • Operations Leaders
  • Customer Success Managers
  • Sales Leaders

Airbeds: A fictitious online vacation rentals booking engine

Alright, now that you’ve learned why product analytics is essential for an organization. Let’s expand our understanding by taking a fictitious digital product.

Assume, Airbeds is an imaginary online vacation rental booking engine. It has a two-sided market where property owners (hosts) post listings to be searched and booked by customers worldwide.

A typical user journey starts from the home page, where they search for specific keywords. Alternatively, a user may land on different promotional pages from search engines like Google.

Airbeds is a fictitious vacation rental search engine

Users browse different listings on the site to end up doing either of the following:

  • Make a booking
  • Abandon the search and leave
  • Ask questions from hosts
  • Leave a review
  • Engage with the onsite customer support (via a tool like Intercom)

Airbeds is mainly available in the market over mobile and web channels. Now, try to put yourself into a product manager’s shoes and develop a plan to track and analyze user engagement across mobile and web channels.

Deciding what metrics to measure

Measuring anything and everything leads to data that is unmanageable. That makes it difficult to cut the signal from the noise.

As a product manager, you should carefully plan and choose what metrics to measure and how they impact the business overall. Based on our case above, we can list down a few metrics to get started.

Funnel analysis

Funnel analysis is about tracking and visualizing how users progress through a series of steps. In the Airbeds case, we can measure how many users land on a promotional page followed by users who engage with the CTA or drop out. If there are severe drops in between steps, you should pay close attention to fixing them.

Conversion analysis

With a conversion analysis, you can look at users who complete all stages of your funnel within a “conversion window” and compare them to customers who do not convert. Funnels provide a basis for the conversion analysis.

For example, we can consider a conversion window of how many users searched for listings -> viewed the listing page -> engaged with the host -> made the booking.

Cohort analysis

Cohort analysis is an analysis that allows you to segment your users into groups with common characteristics.

For example, you can group the users based on the country or the month they signed up. From there, you can start to understand which customers are high value and which may need nudges to become high-value customers. Perhaps, the users who have signed up at the beginning of the summer tend to make more bookings.

Retention analysis

A retention analysis reveals how many of your customers return to your product over time: on Day 1, Day 2, and weeks or months later. You can also run a retention analysis using your cohorts to understand what behaviors lead to overall product retention.

For example, you can track search keywords or most browsed listing categories of repeated users to understand what keeps them attracted to Airbeds. Maybe it’s the price, or the variety of listings, or something else.

You can use this data to develop retention strategies and enhance retention marketing efforts.

Building a serverless data lake

Now that you’ve understood what product analytics is, why it matters, and what metrics to measure in your product.

The next step is to figure out how to do that.

Why build when you can buy one?

The answer is always “depends.”

Many niche SaaS products exist in the product analytics space. Segment, Mixpanel, Heap, Amplitude, and Snowplow are some key players.

All these products come in different features, licensing models, and instrumentation strategies to take the burden away from you when collecting, to store, and measuring product metrics across different channels.

Go for an off-the-shelf product analytics platform if you come from an organization that focuses more on getting metrics done quickly rather than spending too much time on building a metrics platform.

This post addresses the rest of the folks who already work for a digital product company and want to go beyond the metrics offered by existing SaaS products. You have a talented team of engineers who are passionate about building their own stuff to address problems, and they take pride in doing that.

The high-level solution architecture

The solution in a nutshell. S3 acts as the central data lake here.

Let’s keep it simple at the beginning. I want to take a crawl, walk, run approach where we celebrate small wins at the beginning and then move on to achieve big things.

This architecture has two primary design goals:

  1. Re-use a stable and popular instrumentation layer to capture product metrics.
  2. Use a serverless platform for metrics storage and analytics.

Using Segment for product instrumentation

We can use Segment to instrument the Airbeds web and the mobile application to emit different metrics. Segment is a hosted platform that provides a variety of client SDKs and APIs to track and collect metrics across different platforms.

Using Segment eliminates the need to build and maintain an instrumentation layer by yourself. You need to add the Segment SDK to the application and emit metrics when the user performs specific events.

Here’s an example sign up event tracked by Segment:

"type": "track",
"event": "User Registered",
"properties": {
"plan": "Pro Annual",
"accountType" : "Facebook"

And here’s the corresponding Javascript event that would generate the above payload:

analytics.track("User Registered", {
plan: "Pro Annual",
accountType: "Facebook"

Segment will take care of routing the collected metrics to a destination of your choice. Some popular choices are data lakes, data warehouses, and some analytics platforms.

In our case, we route to an S3 bucket in AWS.

Serverless analytics infrastructure in AWS

Storing and analyzing a vast amount of metrics could be a daunting task. That requires a considerable amount of storage and computational power, which often results in management overhead. Hence, we will choose a serverless platform since we can get rid of provisioning and managing infrastructure.

I chose AWS since I’m most familiar with it. But you can go ahead with other cloud vendors like GCP or Azure without any problem. The key thing is that you learn the process and technologies from AWS and then apply it to a platform you are most familiar with.

From clicks to insights: the complete journey

Detailed storage and processing architecture

In a nutshell, the complete journey to surface metrics would look like this:

  1. The product manager decides what metrics to capture in terms of a business standpoint. That is then communicated to the engineering team.
  2. The engineering team adds the necessary instrumentation to Airbed web and mobile applications with Segment SDK. Also, they create a new S3 bucket with necessary access policies granted for Segment so that Segment SDK can write metrics to S3 directly.
  3. The engineering team makes new releases for newly instrumented applications.
  4. The user engages with the web and mobile applications. The SDK captures and emits engagement events to the Segment backend, forwarding them to the S3 bucket hourly.
  5. Airbeds’ data engineers could build ETL pipelines using a managed ETL service like AWS Glue to clean and transform the metrics to an optimized format for analytics. Currently, Segment writes the metrics in JSON format.
  6. An ETL process can read from an S3 bucket, transform them into a format like Parquet or ORC, and then write them to a different bucket.
  7. Then second ETL process reads the transformed data and writes them to the AWS Redshift data warehouse.
  8. Airbeds data analysts and data scientists can use two options to analyze the processed data.
  9. Use a BI tool like Tableau to connect Redshift and perform exploratory analysis, generate dashboards, etc.
  10. Use AWS Athena to directly query data in S3 to run experiments, train ML models, etc.

Takeaways and where next?

  • Product analytics is inevitable if you build a digital product and understand how it meets user needs.
  • Building a product analytics platform is not for everyone. But, if you care, consider using a proven instrumentation layer and serverless platform as the backend.
  • This post just gave you a conceptual overview. I will walk you through a detailed example in a future post series with adequate samples.

Further readings

The Amplitude Guide to Product Analytics

What every product manager needs to know about product analytics

What is Product Analytics?




EdU is a place where you can find quality content on event streaming, real-time analytics, and modern data architectures

Recommended from Medium

An Extensive Product Launch checklist

How do you identify a Product-centric company: Patterns and Anti-patterns

Hone your product skills with these fine podcasts

Podcast cover icons

Christine Luo of OJO Labs: Product Managers are Decisive

Myths of Product Operations

Problem Prioritisation

Daniel Wu of athenahealth: Product Managers Focus On Value, Not Output

Product management, art or science?

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Dunith Dhanushka

Dunith Dhanushka

Editor of Event-driven Utopia( Technologist, Writer, Developer Advocate at StarTree. Event-driven Architecture, DataInMotion

More from Medium

Modularization Using Graph-Style Serverless Pipeline Framework

Metadata — Meet Big Data’s Little Brother

We Just Cut 85% of Our Data Streaming Pipelines Cost! (Part 2)

Why and How I Integrated Airbyte and Apache Hudi