Data Engineering

Unveiling User Behavior: How We Build Products in a Data-Driven Manner

Explore how we leverage Snowplow, a robust website and app tracking tool, to comprehend user behavior here at Afya.

7 min readJan 5, 2024

Have you ever wondered how users engage with your product? What are their preferences, needs, pains, and desires? How can you improve user experience to boost retention and conversion rates?

These questions are crucial for every product or business team to address, ensuring that solutions meet customer expectations. And to achieve this, data is essential — a lot of it!

However, it’s not just about collecting data. It’s about translating it into actionable insights. Insights that inform strategic decisions and tactical actions through clear and precise analysis of user behavior.

That’s where Snowplow comes in! It’s our preferred Platform-as-a-Service (PaaS) solution at Afya, allowing us to map user behavioral data and track interactions across our sites and applications. In this article, we’ll delve into what Snowplow is, how we capture tracking, and the data architecture we’ve implemented.

What is Snowplow?

Snowplow is a platform designed to track user behavioral data within web applications or while browsing websites, created to empower product teams. Launched in 2012, Snowplow initially started as an open-source project with the goal of liberating data professionals from the limitations of proprietary solutions, which often act as “black boxes,” providing limited transparency or control over data. This contrasts with widely used tools like Google Analytics. The platform operates in full compliance with data privacy laws such as GDPR and CCPA, ensuring that data handling adheres to these regulations.

Snowplow was first implemented here in 2019 by the former PEBMED, a company that was acquired and integrated into Afya in 2020. One of its key features is the flexibility to customize its pipeline according to specific needs and the freedom to be used across any cloud service, in our case, AWS.

For instance, if you want to understand user behavior in your digital product, you can create events for each action they take, such as clicking a button, filling out a form, or watching a video. These events are then sent to Snowplow via an API. Snowplow handles organizing, cleaning, validating, and even enhancing these events before sending them to a Data Warehouse, in our case, AWS Redshift.

Data-Driven Product Development

Here at Afya, we offer several digital products such as iClinic, Medical Harbour, Glic, among others mentioned in this post. In this section, we’ll spotlight Whitebook, which is a product extensively utilizing Snowplow at Afya. Whitebook is a medical reference application and ranks among the most downloaded apps in major app stores.

Before delving into the lifecycle, I want to stress that this process only thrives when there’s mutual commitment among the Product, Technology, and Data teams. They must collaborate to define objectives, hypotheses, metrics, experiments, and actions. Each team plays a pivotal role, contributing their skills, knowledge, and perspectives.

What distinguishes the Data-Driven approach at Whitebook, something I hadn’t witnessed in previous companies, is that each team recognizes the significance of the other two. The approach shifts away from isolated departmental work with distinct expectations to a relay race where focus, goals, and outcomes are shared among all.

With this in mind, we collectively devised a workflow for implementing events, which I’ll outline below.

1. Conception of new events

When the Product team is in the ideation phase, with no clear vision of what will be developed or which hypotheses will be tested, both the data analysis and technology teams are already involved. This initial context is our differentiator because being engaged in these discussions forms the basis for understanding what data (Data) and how (Technology) information will be collected. The type of information collected varies greatly from case to case, but all share a common goal: making informed decisions for the business.

For each scenario, we have a well-defined workflow by the ProductOps team, as depicted in the figure below.

Where: P — Product, D — Development, A — Data Analysis, and E — Data Engineering.

2. Events Deployment

Once the information to be captured is defined, the development team checks if there are existing schemas for these trackings within Snowplow. If not, a new schema needs to be created or adapted, which in this case, is handled by the data engineering team.

To check the existing schemas and their versions, we use Snowplow BDP, a tool that visually provides this information. On the Snowplow page, it’s possible to review the concepts of events and entities, which represent the types of schemas available.

When we open any schema, we have a JSON that not only informs us about the expected data but also applies very current concepts of Data Governance, such as Data Contract. These contracts will define something we call Good Data and Bad Data. Below, we have an example of this JSON.

{
  "description": "Event describing an user login. Default value for data-time
                  is: Timezone UTC, format YY-MM-DDTHH:MM:SS.mmmZ",
  "properties": {
    "time": {
      "type": "string",
      "format": "date-time",
      "description": "Date and time when the login occurred, ISO formatted"
    }
  },
  "additionalProperties": false,
  "type": "object",
  "required": [
    "time"
  ],
  "self": {
    "vendor": "Whitebook",
    "name": "login",
    "format": "jsonschema",
    "version": "2-0-1"
  },
  "$schema": "http://<API>/schema/jsonschema/2-0-0#"
}

3. Capture, Enrichment, and Availability

As mentioned earlier, events are captured through API calls inserted into web page codes and applications. These requests can also come from third-party applications, sending events via webhook. Below is an overview of the solution’s architecture.

Iglu Server

The Iglu Server serves as a repository of schemas that define the structure of the events you want to track. It ensures that all events are validated against the defined schema before being processed. Remember the schemas we registered in Snowplow BDP?

Stream Collector:

The Stream Collector is responsible for receiving events from various sources, serializing these events, and then forwarding them to a real-time processing platform. In our architecture, EC2 instances are exposed through an Elastic Load Balancer and then route this data to Kinesis.

Stream Enrichment

After raw events are collected, they’re sent to Stream Enrichment. This component validates, cleans, and enhances each event with additional information (such as geographical data, user data, etc.) before forwarding them to the next stage.

At this stage, Kinesis validates the received event as we registered in Snowplow BDP. If it’s invalid, it’s sent to a Bad Data S3 bucket. If it meets the defined rules, it becomes Good Data and is then sent to a Spark cluster in EMR for further enrichment.

S3 Loader

Finally, the S3 Loader loads the enriched events into an S3 bucket. Here, the data is ready for analysis and used for business insights, as it’s available in the Data Lake.

There’s also a process called Shredder, which runs multiple times a day and is responsible for copying these files to our AWS Redshift.

The entire flow, from capture to loading into Redshift, is managed and supported by Snowplow, a Platform-as-a-Service.

4. Validation and Observability

This stage holds the most responsibility for the Whitebook’s data engineering team, ensuring that everything occurs as expected.

As mentioned earlier, engineering is the custodian of the schemas; therefore, we are responsible for validating, along with the technology team, if events are reaching the data lake as expected and for addressing schema-related queries. To perform this validation, one of the primary tools we use is Kibana.

However, there are cases where we need to better understand the behavior of Bad Data; for these instances, we use AWS Athena.

Regarding observability, we have several dashboards built in Grafana with integrated alerts in Slack. We primarily monitor EC2 instances, Kinesis, Redshift, and data quality. For instance, one of our main alerts is related to the percentage of events going into the Bad Data queue.

Once these data are in our Data Warehouse, the data analysis team, closely aligned with the product team, can generate insights, models, and dashboards to continually deliver value to our customers — the most crucial part of this workflow.

I appreciate your attention and hope this article has contributed to your knowledge on the subject. If you have any suggestions, criticisms, or compliments, please feel free to contact me via LinkedIn.

Now, if you want to be part of the country’s largest medical ecosystem, with cutting-edge data technologies, in an innovation-friendly environment, come join Afya!