Realtime Monitor for EventBridge Traffic

Published in

CloudPegboard

9 min readAug 18, 2020

Continuously monitor and pretty print Amazon EventBridge traffic.

Given its simplicity, Amazon EventBridge is a deceptively effective tool for building event-driven serverless architectures. I find that by just choosing to incorporate it in a serverless architecture, you are naturally guided to design a decoupled and extensible solution.

Decoupling is great, but it also adds a challenge for debugging. How do you know what’s getting posted on the bus? For example, say the result of my target never occurred. Did the target function fail or did it never get the expected event? Do I look at the target logs (not all target types even have logs) or the source logs? Or, if like me, you like to create sequence diagrams to design and visualize important or complex event flows across multiple actors, in a distributed event-driven system, how do you verify that your system is behaving according to your sequence diagram, or when you know that it’s not, how do you narrow down the error?

Rule #3, “Quit thinking and look” in Debugging by Dave Agans (one of my favorite tech books) means that you need to be able to see into the black box.

Observability is always key in app development and operations, but it’s even more crucial in decoupled and distributed event-driven systems. Useful as it is, we still have a blind spot for the critical information that is flowing across our event bus.

To solve this, I created a simple and effective pattern that makes it easy to see what’s happening on your bus. I think you’ll find it surprisingly useful.

My goal is to describe what I’ve done so that you can implement your own version that matches your environment. That is, this is a pattern, not a fixed solution. I have provided code that you can either use directly with local modifications, or just use as inspiration to build your own bus monitoring solution that matches your needs and technology choices.

There are four main elements to this design

Your Amazon EventBridge environment
A snoop rule: An EventBridge rule to route a copy of all events to an AWS Lambda function (called ‘snoop’)
A snoop function: A Lambda function that dumps events to Amazon CloudWatch logs
A log viewer: A local script that continuously pretty prints events from the snoop function’s CloudWatch logs

Architectural pattern for monitoring EventBridge traffic for debugging

Let’s walk through each of these in sequence following the figure above.

Your EventBridge environment

Your EventBridge deployment is abstractly described as a collection of Event Sources, an Event Bus, routing Rules on that bus, and Event Targets. The bus may be the ‘default’ bus or could be a custom bus (our pattern will work with both cases). Events travel across the bus and selectively route to various targets (or may be ignored).

In operation, the actual events traveling on the bus are not observable until and unless they are logged by one of potentially many targets. Since some events are ignored by all targets, and typically no single target subscribes to all events, we need a means to observe what’s happening on the bus to aid in our application development and debugging.

To gain this observability, we’ll introduce a special ‘snoop’ target.

The snoop rule

A snoop rule is a kind of wiretap that lets us passively watch all or some of the event traffic on the bus. This is simply an EventBridge rule that subscribes a special ‘snoop’ function to all events that we might want to watch. For simple environments, you can create a rule that captures all events. Here’s one way you can specify a “watch everything” rule (or you could just check for the existence of “account” or “version.”

{
  "account": [
    "123456789012"
  ]
}In plain language, this pattern reads: match all events where the "account" attribute equals my account number 123456789012.

This is most likely applicable if you are using a custom event bus where all traffic is likely relevant. However, if you are using the default bus, or a custom bus with more traffic than is relevant, you’ll want a more targeted rule that ignores traffic that’s not relevant to your debugging or monitoring.

There are lots of ways to write rules. In my case, since I have a source naming convention, I find it effective to just watch all traffic from the sources that I care about (are part of my solution). You can write your wiretap rule however you like as long as it captures all the events that you want to see and ideally not too many more (we don’t want to add noise to obscure our signal). Don’t be too restrictive though, since the idea here is that the rule is fairly static and we can do any extra filtering in our snoop function as a localized change. This is especially important if running this in production since we don’t want to be changing production rules if we can avoid it — and maybe due to separation of duties, we don’t even have an easy way to quickly change rules.

Example event pattern for my snoop rule for a Slack application:{
    "source": [
        "slackbot-events-dev",
        "slackbot-auth-dev",
        "slackbot-news-dev",
        "slackbot-home-dev"
    ]
}In plain language, this pattern reads: match all events with a "source" attribute that equals any of the listed items.

The target for our wiretap rule is a special lambda function that we’ll call ‘snoop’

Snoop function

This function is barely more complex than Hello World! It simply captures all events routed to it and prints them to a CloudWatch Logs group. However, instead of being completely dumb about the printing, we can apply some simple logic to decide what events to save to the logs. As noted above, this lets you be more permissive with your wiretap bus rule since you can always ignore a particular event using the snoop function’s logic for the small cost of a Lambda invocation (which is insignificant in the development and debugging scenario where this pattern is typically used). In my case, I use an environment variable ( SNOOP_EVENTS) to control what events are logged so that I can change it dynamically without having to redeploy the function. The variable can take on values of: ALL, NONE, or a pipe-delimited list of event names (the value that appears in the event’s detail-type property).

The only other bit of logic that occurs in the snoop function is to format the event into a standardized form to make it easy to parse later. This can be a zero or nearly zero transformation, or more complex depending on your needs. The key is to remember that the log format is a contract with the log viewer that we’ll discuss in the next section.

In my case, I’m using Python and the ‘logging’ module configured with this format string:

'%(asctime)s|%(levelname)s|go2:%(module)s|%(lineno)d|%(message)s'

You can grab the annotated code as a starting point from here.

At this point we have a system that passively monitors events on the bus and prints them to log streams in a common log group. That’s swell, but using the CloudWatch logs console to watch realtime events is a horrible and inefficient developer experience. The console is just not designed for this usage pattern.

What we really want is something that lets us filter the logs for the specific statements printed by our snoop function in near realtime, and then formats matching statements in a developer friendly way so that we can focus on understanding the state of our application and not squint at a jumble of raw log text.

If you happen to have a third-party or custom log solution that does all that, then by all means, use that. If not, then the next and final element in our pattern is a local script to surface the signal that we’ve stuffed into our logs in a way that lets us focus on the development and debugging task at hand with maximum efficiency.

Log Viewer

For many event-driven designs, we can’t just look in a debugger to understand the state since it exists outside of the scope of any single process context. However, just like in-app state, we have a need to see the state in realtime as part of the development and debugging cycle. Therefore, conceptually, what we’d like is a way to do the equivalent of a “tail -f” to continuously display our logged events as they occur. And this doesn’t just mean all log statements, we only want to see the “events” (by comparison, if you put a watch on a variable, you don’t want to see all variables and all memory state). Finally, we want to gain a semantic understanding of the behavior of our system. Therefore, we’d ideally like to see these events formatted in a way that focuses on that goal.

To make this happen, I created a Python script (busmon.py) that continuously pulls recent entries from our snoop function’s log group (without needing to know what log stream the event is in). This script only pulls the specific event log entries (e.g., that contain the string “SNOOPED”) that our snoop function created so that we filter out all other log statements (those are useful, but not the purpose of this solution). Finally, since we know the format (the “contract” mentioned above) we can write a simple function to pretty print the information in a way that lets us see only the information that we need and do so in a way that works efficiently with our development flow. For example, the event type, the most important aspect is printed on a separate line and in bold color, while the rest of the attributes are available, but left in an indented JSON format. You of course can customize the pretty printing for your unique needs.

Sample output for CloudPegboard.com’s Slack app for reading AWS news

I could have used the SDK to pull logs, but for expedience, I use the awslogs open source project and call that from the busmon.py script. To reduce one extra dependency, I also attempted to use the aws logs tail command from the AWS CLI v2. However, the --follow flag that is required to get continuous output does not work if you try to pipe the output to another process (I’ve submitted an AWS Support request and will update this post when resolved).

Note that there are latencies for the time it takes for events to show up in CloudWatch Logs as well to poll to get new events using awslogs. Empirically, this seems to be about 12 seconds.

As you can see, this is all quite straight forward. Wrap this all up with whatever infrastructure-as-code approach you use and adjust the code slightly to meet your individual needs (the sample code shows specifics for the Serverless Framework which I happen to use and love, but anything will work). The sample code annotates areas that you will likely need to customize. Porting to your runtime of choice should only take minutes if you don’t use Python.

Once deployed, when you are developing or debugging your event-based application, simply open a terminal window and run busmon.py to see an easy to read display of your bus activity.

Some final tips

It’s usually a good idea to set an expiration on the log group used by the snoop function.
Remember to adjust the SNOOP_EVENTS environment variable to limit or disable event reporting without having to change your EvenBridge rule
For production, you probably want to disable the snoop wiretap rule until and unless you need it. You could set SNOOP_EVENTS to NONE, but in a high traffic production environment, that could be a lot of noop Lambda invocations and log streams that have no value.
I haven’t tried this, but I suspect that the latency can be reduced by using the AWS SDK to (aggressively) pull recent events instead of using awslogs or the AWS CLI v2. Twelve seconds is not that long, but half that would be much better. Please share if you decide to make this enhancement.

Summary

Amazon EventBridge can be an important capability in your serverless and event-driven toolbox. However, it provides no means to observe the content of the traffic on the bus. This is a critical blind spot for efficient debugging of distributed event-driven systems. To close this gap, consider deploying your own version of the wiretap snooping pattern described here at the start of your development cycle.

By the way, the app development that drove my need for this pattern is a Slack app that I’m developing to make a better way to consume AWS news that is personalized to your needs. If you are interested in providing early feedback and shaping the solution, you can DM me on Twitter, @CloudPegboard.

About Cloud Pegboard

Cloud Pegboard helps AWS practitioners be highly effective and efficient even in the face of the tremendous complexity and rate of change of AWS services. We want to keep AWS fun and feel more like dancing in a sprinkler on a hot summer day and less like being blasted with a firehose. We make the AWS information you need amazingly easy to access, and with personalization features, we help you stay up to date on the AWS services and capabilities that are important to you and your specific projects.

Free sign-up at CloudPegboard.com