Feature Segmentation Infrastructure Leveraging Kafka

Published in

apree health (Castlight) Engineering

6 min readApr 20, 2021

Examples of different product features and how each relates to unique segments of the population. — Feature segmentation across Castlight’s product features

As Castlight moved beyond transparency and health plan benefits to providing a comprehensive health navigation platform, we needed more advanced user segmentation capabilities to create appropriate user experiences for large employers with diverse populations.

The Problem

While many teams leverage this segmentation infrastructure, this post will use Challenges as an example. The Challenges feature allows customers to run competitions that encourage employees to be more active. The core problems encountered were:

We needed to offer different challenges to users based on various attributes that could originate from an eligibility file, personalization, or user interaction.
When the criteria for the challenge change or specific attributes for a user change, we require the experience to reflect that change in near real-time.
Challenges require a consistent view of which users are eligible across the application, outbound communication, reporting, and analytics. For example: What challenges a user can participate in? Which coworkers can a user invite? How many people are eligible for the challenge?
Excellent runtime performance, traceability, and observability are critical.

Guiding Principles

During evaluation and design, we established guiding principles:

Pre-calculate as much as feasible. For this application infrastructure, that meant account to feature mappings and a consolidated view for accounts. By pre-calculating, we simplify runtime logic, improve performance, provide observability before a user interacts, enable accurate bulk outbound notifications, and render a single answer for all teams to leverage.
Calculate only what is required. If customer Acme runs a “Walk with Friends” challenge for the month of February, we should not continue calculating complex segmentation rules when it’s no longer in use. Connecting the calculation with its usage creates levers for increasing performance and scalability while assuring the system’s long-term maintainability and state.
Enforce the single-responsibility principle to keep domain boundaries transparent between the association of accounts and features. Teams demand various ways to access and leverage the data as application infrastructure, but domain-specific logic and administration live with the owning team.
Leverage standard workflows, technology, and tools wherever possible and avoid home-grown solutions for commodity problems.

While breaking down the problem, there were two notable sub-projects:

A real-time data pipeline for sourcing changes to account-associated attributes across data sources and teams to build an account aggregate.
Given an account aggregate, provide application infrastructure to calculate and expose an account-to-feature mapping.

Account Aggregate Data Pipeline

The data pipeline system ended up being the more challenging piece for the team. Our initial dataset to source was around 11 database tables across three data sources and processes. The total volume of data was around two billion records and continues to increase. We now process a few additional tables in support of new segmentation requirements.

To accomplish this and to provide near real-time updates, the team leveraged the Kafka ecosystem (Kafka, Connect, Streams). Database replicas were set up, and Debezium was used to source data into Kafka. From there, we created a large topology of joins, filters, and aggregates to generate a single account aggregate.

A data pipeline visualization showing how 11 sources of data end up creating a single account aggregate entity. — Representation of the data pipeline to create the account aggregate

The output is a single aggregated record per account, which we indexed in an Elasticsearch cluster to flexibly handle requests from our segmentation sub-system. We set up numerous checkpoints and monitors to guarantee data quality at every step of the pipeline.

Segmentation of Accounts

The segmentation portion involved developing a data contract and tooling for teams to leverage the segmentation infrastructure and internally keep our answer up-to-date.

As part of calculating only what is required and allowing for independent scaling and minimizing noisy neighbor effects, the segmentation criteria and calculation live and die with the feature configuration. Some duplicative calculation exists in this system, but the benefits of simplicity and decoupled setup easily outweigh the costs of the duplication. We produced tooling and libraries to make this an automated process for the teams and equipped them with full control and monitoring of their integration with the segmentation infrastructure.

Feature teams communicate segmentation criteria via Kafka. While exposing domain information for Kafka consumers, the segmentation infrastructure requires a subset of data to meet our Avro schema. Once they start publishing, we can consume and ensure backward-compatibility using Schema Registry. Below is an example for the challenges feature team.

A JSON example of a “Walk with Friends” challenge where it provides the eligibility information being Acme employees who live in the US. — An example challenge configuration with its eligibility information

For the above challenge, a scheduled job examines the challenge configuration, queries an Elasticsearch index to determine which accounts were eligible, and persists records by associating them with the guid ‘abc’ challenge.

However, account attributes change often, and tomorrow a user’s home country may have changed because they moved to, say, Canada. The second part of our scheduled job processes Kafka messages for all account attribute changes (that also seeds Elasticsearch) and recalculates accordingly.

Exposing the data

Provides a flow using the challenges example of everywhere the data flows including via REST, MySQL, Elasticsearch, and Kafka. — An example of the various avenues eligibility data is exposed

The segmentation infrastructure, as mentioned above, has many clients and, therefore, diverse needs for how to get the data. Usability was an essential requirement, so we expose our data via:

REST endpoints
Kafka topics
Elasticsearch indices
Database replication

Tradeoffs

As with most designs, we end up with a list of tradeoffs:

Since we provide output tables and jobs per integrating feature, we can scale and support feature-specific SLAs and needs independently. The downside is that each integrating team must incorporate and maintain the eligibility criteria as the source of truth. These issues become somewhat mitigated by tooling and automation but still exist.
Because of the system’s asynchronous nature and the batch job timing and order, it can take up to a few minutes for changes to update. Changes to accounts and updates to their features are near real-time per feature or attribute change, but the segmentation calculations run in a certain order, thereby causing some delay. This problem is solvable but has yet to warrant the cost.
The segmentation infrastructure’s only source of truth is the account-to-feature mapping. Since we intentionally limited its responsibility, some elements have to be solved by the team or by separate infrastructure (e.g. ensuring that a user is not eligible for a challenge before its start date)

Learnings

Ensure observability, monitoring, and tooling are first-class concepts

Despite our investment in support tooling from day 1, we’ve found that complex data pipelines in Kafka are difficult to debug. Examples include:

Kafka topics are not optimized for direct queries, and we found it best to leverage Kafka Connect sink connectors to PostgreSQL or Elastic at key points in the pipeline.
A tip to conserve disk space and IO bandwidth is to sink only the key, topic partition, and offset, so one can quickly know where to look in Kafka to see the entire message. This is very similar to how secondary indexes work in many relational databases.
We tried KSQL and found the overhead and limitations to be too large at that time for monitoring (it’s built on Kafka Streams).
Most of the default Kafka consumer and producer configurations need tweaking from default, so ensure you understand your use case and delivery requirements.

Take the time to do thorough capacity planning

While we did a capacity planning exercise, we ended up requiring additional system resources and re-planning topics. Examples include:

Partition sizing for stream applications needs to be planned carefully. Changing partition sizes for topics is difficult to change after the fact (lowering loses data, increasing makes data out of order). For stream applications, join operations require the same partitioning on both sides and other more stateless operations run into write issues if reading from say 32 partitions and writing into 8 partitions.
Unless purely doing stateless operations, Kafka Streams requires flash storage since it uses RocksDB read/write operations on aggregation and join type operations.
As part of the project, we had to scale up, harden, and add sufficient Kafka monitoring (Cruise Control, Prometheus, Burrow).

This segmentation infrastructure has been live in production and iterated on for the past few years. Some big shout-outs to folks who made this possible: Roopa Venkatesha, Rajiv Kutty, Edward Wang, Hina Ramani, Balaji Kolli, Richard Berger, Bhumi Shah, Robert Stewart, the resilience team, infrastructure team, and Castlight R&D leadership.