Eligibility at Guild — Adventures in Event Processing on AWS

Published in

Extra Credit-A Tech Blog by Guild

7 min readOct 15, 2020

Introduction

Guild Education is a marketplace that partners with employers to provide education benefits to their employees. Guild’s value proposition derives from our domain expertise and the infrastructure that we’ve built around administering these benefits. Operationally, some of the most important questions we strive to answer effectively and accurately are:

What is available to employee X as part of their benefit?
Are employee X’s current courses covered under their benefit?

We group these and related questions under a domain that we, at Guild, refer to as eligibility.

Eligibility

In practice, the aforementioned questions end up looking more like:

Based on Employee X’s job title, tenure at the company, and cumulative GPA at University Y on November 16th, 2019 — how much money does Company Z owe to University Y to cover Employee X’s tuition for the term starting that day?

To answer these questions, Guild aggregates data daily from both sides of the marketplace (employer partners and academic partners), and evaluates that data against rules defined as part of that employer’s benefit policy.

This process of aggregating and persisting eligibility data has gone through a number of iterations since Guild’s inception. Today, our engineering organization consists of over 100 engineers and many squads focused on solving different technical problems, most of which either provide data that is an input to eligibility, or utilizes eligibility data in some way. On the eligibility squad, we set out to create a single source of truth where operations, engineering, business intelligence, and our students can find the answers to these questions.

Problem

In a single relational database, doing a query that joins data from a few tables with some date filters and boolean logic would be fairly straightforward. At Guild, the data we need to derive eligibility decisions is spread across several different systems and databases. Our goal was to find a way to utilize data from these various platforms to provide current day eligibility information to frontends and student teams, and historical eligibility snapshots to batch processes such as invoicing.

Architecture

Guild engineers have done some great work over the past year that has enabled us to scale rapidly and decouple systems and teams. One major advancement in our engineering infrastructure was the addition of our AWS Kinesis-powered Event Bus. Several systems from which we needed data were publishing events to the event bus already, so we decided to fully commit to an event-driven architecture and rely on those events to determine eligibility.

The high-level idea was to listen to events from a bunch of systems, recalculate someone’s eligibility whenever a relevant event occurred that may impact their eligibility, and persist that calculation to an append-only history of eligibility in Postgres. We built this system entirely on AWS using Kinesis, Lambda, SQS, Aurora Serverless, AppSync, with CDK as our tool of choice for configuring and deploying infrastructure.

Components

sink

sink is the main entry point into the eligibility processing pipeline. sink listens to multiple Kinesis streams as an event consumer within the event bus. It provides these functions:

Fan-in events from multiple upstream event publishers
Filter out events we don’t care about from those publishers
Take events that we do care about, and normalize data from those events into a standard message format that gets pushed onto our queue for processing

In sink we defined an abstract class that could be used to implement listeners for various streams. Using CDK for our infrastructure-as-code allowed us to do some cool things like dynamically introspect all Stream subclasses and set up the required AWS resources to enable sink to pull events off those Kinesis streams.

queue

queue is a streaming collection of eligibility calculation jobs that organizes our workload using AWS SQS. When weighing architecture options, we considered driving our workload directly off of Kinesis streams but decided to add this intermediate buffer for the following reasons:

Observability — a single queue of normalized messages representing our workload simplified metric collection, dashboard creation, and alerting. We know that invocations of the downstream writer Lambda map 1:1 to jobs on the queue. This makes it easy to trace eligibility calculations as they flow through the system.
Replay — having one single queue for our workload with a single associated dead-letter queue made it very straightforward to retry failed jobs by just pushing them back onto the queue.
Decoupling — as long as the sink received an event from an upstream and the associated job made its way onto the queue, processing and scaling from that point forward was entirely within our domain and we could tweak our internels as needed without having to change configuration on the centralized event bus.

writer

writer's responsibility is to process jobs from queue. writer Lambdas round-robin jobs off queue as they come in, run data from those jobs through the eligibility rules Lambda, and append new eligibility data into the database. If there was a change in a user’s eligibility from the latest one we have recorded, it publishes an event back onto the event bus to notify other systems that are interested in this.

poller

poller is a Lambda that sits outside of the event bus data stream, and whose job is to invoke an eligibility calculation in circumstances where we don’t receive an event that will trigger one. An example is tenure-based eligibility. Typically we just receive hire date information from our employer partners, and need to make sure that after N days have passed from their hire date, their eligibility is re-evaluated.

eligibility

eligibility is the brains of the pipeline. It houses employer eligibility policy rules, and evaluates them against user state to provide an eligibility determination with metadata around why output is the way it is.

Learnings

Data is hard

Our work was only half done once we built the system, deployed it, and backfilled millions of eligibility decisions from 2 years worth of data. The process of testing and QAing data, verifying data integrity both against policy logic and legacy systems that were currently in use, and triaging discrepancies/deploying fixes as we discovered them was long and tedious.

We developed a couple of data quality dashboards that evolved with the help of our business intelligence team. We also performed a healthy amount of querying in our data warehouse to track down tricky issues from race conditions, upstream bugs, and other things.

Time travel is hard

If we only had to worry about starting from today and updating eligibility as we marched forward into the future, things would be pretty straightforward. However, it is occasionally the case that we receive updated/corrected historical data and need to make some change to a past eligibility decision we made. Also, other service teams have some internal logic around validating and “promoting” data after it’s received, that we had to account for — it’s not always just “a new event happened”.

When historical data changes, we still need to keep the previous version of that day’s eligibility around, as there’s a chance someone may have seen it or it may have gotten used in a business process. Doing all this in the context of a database table that is effectively “append-only” proved to be tricky in certain edge cases.

Uptime is hard…when you’re a needy service

Prior to the full persistence pipeline being complete and data being available in our database, the eligibility API calculated an employee’s eligibility inline during a request. This meant our uptime was the product of all our downstream services’ uptimes, and we were directly impacted by any outages or degradations they experienced. While in this world, we struggled a lot to figure out what we wanted to get alerted about (and how), what we were responsible for versus what issues we delegated to our downstream services to troubleshoot, and how demanding we were on our downstream services in terms of performance and SLAs.

AWS X-Ray visualization of Eligibility Dependency Graph

To ease the addition of new API clients into our codebase while also protecting us from breaking changes, we introduced a Downstream abstraction that required API clients in our code to implement some functionality that automatically added them to our health checks, and reported latency of their requests to a dashboard.

Conclusion

While this system is still nascent, we have tens of millions of eligibility records persisted for several million employees and process thousands of events per day to keep our data up-to-date. We have used this data to enable new UIs and scale our backend financial processes.

In addition to achieving our goal of creating a centralized source of truth for data within the eligibility domain, this processing pipeline and paradigm has since been extended and generalized to serve other use-cases in the organization where persisted, auditable derived/calculated data as a result of events in our ecosystem is a requirement.

I look forward to seeing this system evolve to meet new challenges and help the business scale as we add more partners on both sides of our learning marketplace.