Get on the Bus! | Guild’s Road to Event-Driven Architecture 🚌

travishaby
Extra Credit-A Tech Blog by Guild
6 min readJun 25, 2020

--

Guild’s technology landscape started like lots of startups do: with a monolith. Over time, we took out our proverbial scalpel and carved off chunks of the monolith, creating standalone services to own domains like our academic program catalog, the student application process, employee registration, etc. We also built MVPs of some systems in less-than-ideal ways in the service of short delivery timelines, and often ended up needing to move those systems onto different infrastructure, codebases, etc.

As we built on our initial MVPs and our engineering org matured, we often ran into problems with two systems having a direct integration when we went to change or replace one of the systems. Also, as the number of discrete services in the Guild ecosystem grew, we kept running into problems with communicating from one system to many other systems. With a direct integration model, this would mean (N — 1) ^ N (e.g.: for N = 5 services, (5–1) ^ 5 = 4 * 4 * 4 * 4 * 4) direct integrations from each service to send messages to each other service. Clearly, this is not a way to live.

What a tangled web we weaved.

Enter an event-driven architecture. While not a panacea for all things, communicating when events happen in different systems through a centralized “Event Bus” was an architecture the team landed on that would allow us to be more flexible and scale our communications much more effectively. This architecture is of course not unique to Guild, and we ourselves had been testing out various architectural concepts like event streams and event consumers that would be used in an event-driven world. Furthermore, we had already stood up a few web services that leveraged some of the technologies that would become the bedrock of the Event Bus.

Sample diagram of how user-triggered events flow onto our event stream, where they’re available for any consumer listeners.

Working off input from many engineers across a number of teams, one of our principal engineers, Rich Haase, spearheaded bringing the Event Bus to life. He created an engineering requirements document to put a stake in the ground for what we would build. With that in hand, myself and another engineer, Johnny Coster, began work on what would be our MVP.

Over the course of about two months, we built a very lean, no-frills system that did one thing and did it well: validate published events and send them to event streams. It provided three main features:

  • An interface for publishing events that supported both HTTPS and direct invocation
  • A event schema registry and event validation system
  • Event streams that allowed for multiple consumers of the same event, with a one-event-to-many-streams mapping configuration (meaning event A, when published to the event bus, would be sent to stream A, stream B and stream C)

While we were ruthlessly focused on building something simple in a short time frame, we also knew that this system would be relied on as a core piece of infrastructure for many years to come. It needed to be able to scale with our business and technology usage as Guild’s active student numbers continue to increase, and as more systems rely on the event bus to communicate information out to consumers in the Guild ecosystem. To this end, we relied on a few AWS technologies our team already had some experience with in production, and are built for scale: API Gateway for our HTTP interface, Lambda for running our service’s code, and Kinesis as our event streams. Each of these services is managed by AWS and provides a layer of abstraction that means we don’t really have to deal with problems like how many compute resources the Event Bus is using or how much network traffic load is hitting the Event Bus.

“Streamlined”

Let’s take a look at how the system works from the perspective of a “user-verification” event published to the Event Bus. This event is an important one in the Guild ecosystem as it allows the user to confirm their identity based on information that we received from their employer. In effect, it’s the first touch point on the Guild platform where a user’s identity can be trusted and many different systems can start taking action to set up data and features to support that user’s interaction with Guild’s platform.

First, a system publishes an event via an HTTP POST request to the Event Bus’ API Gateway URL. API Gateway receives the request, and passes the payload of the request transparently to the Event Bus lambda function. (Alternatively, systems can skip the API Gateway step and invoke the lambda directly).

Then, a few important pieces of configuration determine what happens next. Based on the name of the event, and optionally a version, a schema is chosen to validate the shape and contents of the event. Here’s an example of this configuration mapping for our event:

Example of the Event Bus’ config that connects event names to schemas and event stream destinations
Example of the Event Bus’s config that connects event names to schemas and event stream destinations

To validate the events sent to the Event Bus, we rely on the JSON Schema standard. There were a number of other schema definition technologies considered but JSON Schema is accessible to a wide range of engineers, and we happened to already have some JSON Schema definitions in use in production. If it ain’t broke, don’t fix it! Here’s a snippet from the user verification event schema definition used to validate the corresponding event:

Example JSON Schema definition for the user verification event
Example JSON Schema definition for the user verification event

Validation against the schema ensures the event’s JSON has the necessary fields, values are of the expected type, length and more. This validation process is very important so that further downstream, event consumers don’t have to be programmed defensively — they can trust that the shape and contents of the events they consume are as expected.

If the event doesn’t pass validation, detailed errors on what parts were invalid are sent back to the publisher. Assuming it does pass validation, the final step is to put the event on the Kinesis stream specified in the configuration file shown above. In the case of our user verification event, and in our production environment, the event will be put on the Kinesis stream called “user-profile-service-prod.”

That’s it!

While there are certainly more details to the system that I glossed over, you really did get the gist of it. In search of making this a reliable, core piece of infrastructure that we will be able to use for many years, it’s engineered to be simple. In the spirit of the Unix philosophy, it does one thing, and it does it well. In recent months there have been new features added to the Event Bus, which we will be sure to update you on. We also hope to share more about how we’ve used the Event Bus to communicate between systems, as well as patterns we’ve adopted for event consumption. Be on the lookout for updates here on Extra Credit for those topics and more!

--

--