Auto-Generated Monitoring of Event Data with Annotations

Automating the creation of monitors for eventing data with Auto-generated Event Monitors at Udemy

Published in

Udemy Tech Blog

9 min readApr 28, 2022

With over 49M learners tallying-up over 680M course enrollments, it’s imperative that there are no errors in our platform’s code that could disrupt learners’ ability to access educational content. Meanwhile, as developers, we have an abstract representation of what our code does. We spend so much time inside our heads, it’s very easy to make mistakes when mapping our thinking to reality. As we continuously ship our code into production, we don’t actually know if users encounter any problems on their side. One method to overcome this blind spot is to monitor user events such as clicks, page views, etc., which lets you see the granular details of your user activity and in turn, any errors users encounter trying to navigate these events.

In this blog post, we will look at how we make life easier for both event developers and designers by creating Auto-generated Event Monitors. We will also examine why monitoring is necessary to provide a better user experience and explore how we collect our monitoring data, aka our metrics.

Monitor Anything

At Udemy, we are always improving our system by adding new features and functionalities, thus our system is becoming increasingly complex and distributed. Consequently, end-to-end monitoring of our systems is essential to provide a better service for our users. You may be asking, “why don’t you use testing strategies like unit or integration tests?” And we are using different sets of testing strategies, but these testing strategies may not be successful at preventing:

Missing or incorrect translations
CSS issues ( button off-screen)
Interaction between A/B experiments
Service outages whose interfaces are mocked in our tests

This is where event monitors come into play. Setting up the event monitors helps us minimize these issues. To set-up, we supply these event monitors with metrics — for the purpose of this blogpost, metrics are the total number of user events by their type — as input from our servers. On top of this, we define a set of alarms on the monitors to alert when any threshold is exceeded. If any alarm is triggered by these monitors, we take immediate action to fix the issue so that no one using our platform is having a bad experience.

*[Image Caption] Notification of monitor alert via Slack or PagerDuty depending on its severity*

We use Datadog as a SaaS-based monitoring tool for the functionalities mentioned above. Datadog has plenty of features for monitoring, alerting, notifying, and even collecting metrics in various ways. Since metrics are essential data for monitoring, let’s take a step back and see how we collect our metrics at Udemy.

Metrics in Action

Let’s make an analogy using a scenario where you go to the hospital for your annual health check-up. Most likely one of the exams will be an electrocardiogram (ECG) test in which the electrical signals of your heart are recorded. These heart signals are equivalent to the metrics and graphs (ie — monitors) on the test report.

*[Image caption]* Sending metrics to cloud w/ DataDog agent

The illustration above which includes Datadog Agents running as pods in the Kubernetes clusters is similar to the ECG test. Each pod created with DeamonSet acts as an ECG machine, which collects the metrics from other pods and sends them to the Datadog Cloud. There can be different types of metrics collected, such as server health metrics, application metrics, or even our event metrics which are automatically created by the event tracking system after new event registration.

We will walk-through an example about how AddToCartEvent is created, from which service the metrics are being sent, and finally-how we use the relevant event monitor:

*[Image caption] Journey of an event for monitoring*

Creation of a new event schema: Events at the event tracking system are managed through Avro schemas which hold specific properties for the event. Avro provides efficient binary encoding, validation, and evolution of the schema. In this case, we created AddToCartEvent which has a Course ID indicating which course the user added to the cart and its price.
User clicks a course in the UI: User adds one of the courses to the cart and as a result of this, our JavaScript tracker library sends an AddToCartEvent to the Event Collector. We also have different tracker libraries to send events from different systems.
Event collector receives the event: Event Collector is our internal service that provides a simple endpoint to publish events to the event tracking system. In this example, Event Collector receives the event, publishes the event to internal systems, and sends a timestamp metric to Datadog Agent without Course ID and price — because we are only interested in the total number of AddToCartEvent received in a given period of time to create our monitors.
Datadog Agent forwards the metric to Datadog Cloud: Its whole purpose is to collect metrics from the pods and send them to the Datadog Cloud.
Event monitor is updated in Datadog: If there is any monitor defined for the event, Datadog will update these monitors in the UI and check for thresholds, alert anomalies, and so on. If any threshold is exceeded, it will notify the related Slack channel or on-call person defined in Pager Duty depending on its severity.

Still, having these metrics in Datadog Cloud does not necessarily mean that required event monitors are created; someone needs to go to the UI and create them manually.

P.S. You can read more about our event tracking system’s architecture in this post.

Encourage Teams to Event Monitor

At Udemy, each team is responsible for creating and managing its own event data. Although there are some generic monitors for events that are applicable to all types of events, such as event serialization exception rates, each team should create its own event monitors in some cases. One example is an event monitor focused on traffic pattern anomalies as the team are subject-matter experts on their domains and know what is expected and what is not expected in terms of their data.

On the other hand, creating a new event monitor requires extra effort since it includes a manual process. First, you need to be aware that such event monitors can be created, and then you should know which metric to choose and how to configure it in the UI. This method may be an easy step for some individuals who are familiar with DataDog monitoring but it may not be that simple for others. Either way, it includes a manual process.

To encourage teams to create their own monitors, we had to think about how we make this process easier. As a requirement, everyone should be able to create monitors with minimal effort.

Automate the Manual Part — Auto-generated Event Monitors

In the event tracking system, event schemas are stored in the single repository on GitHub as Avro IDL files. These schemas may change according to business requirements over time. In order to reflect these changes on the system, we have an internal service called ESM (Event Schema Manager) which works with GitHub webhooks. GitHub calls these webhooks when a new comment is added to a pull request including schema change. Commenting as esm register to the pull request takes care of the registration process.

Having this background, we have two options for the monitor automatization:

Define a new annotation type for each monitor on the schema, then register changes with the existing ESM command
Implement a new set of ESM commands to create, edit or delete monitors

After reviewing these options, we chose the first one since the annotation-based configuration is a relatively more intuitive approach for developers. The disadvantage of the latter is that developers wouldn’t be able to know if there are any monitors created for their schemas by just looking at the schema structure. We love thinking of our schemas as the source of truth in our system, meaning that they should keep track of anything related to events including their monitor changes.

*[Image caption] An example to register an event monitor change on the schema with our self managed ESM service*

Add a monitor

Now let’s explain how the annotation-based approach works by following the above illustration step-by-step:

Add a new @traffic_anomaly_monitor({}) annotation to AddToCart
Comment esm prod registeron the PR to register changes
Our lovely ESM tool will parse this schema into a Python dictionary and create a monitor on the UI using Datadog API
A new monitor is created with the provided parameters (on top of default parameters) in the Datadog UI

Note: Anomaly monitor is an algorithmic feature that identifies when a metric is behaving differently than it has in the past. Each annotation we define is mapped to one specific DataDog monitor type such as anomaly, log, or forecast monitor. Please check out here for more information on this.

Edit a monitor

Developers might sometimes need to change event monitors that are already present for some adjustments and tunings of the parameters. For this reason, we can pass an Avro map to the monitor annotation where we can edit our parameters which eventually overrides the default ones. As an example, let’s say our traffic_anomaly_monitor triggers so many alarms and we need to increase the alarm thresholds of our monitor. We can do this by increasing the deviations parameter.

Although we updated our parameter in the monitor annotation of the schema, our event monitor in Datadog hasn’t been updated yet. When registering the new schema changes with ESM, we need to get the monitor ID and pass the parameters to it to reflect changes in the Datadog as well. Instead of storing these monitor ids in a database, we chose to use name and tag pairs that are specific to each event monitor. This enables us to search the monitor with the unique pair and eventually get its ID. For example:

AddToCartEvent Traffic Anomaly Monitor → Name: Traffic Anomaly with AddToCartEvent, Tag: esm-prod

Delete a monitor

After a while, a monitor may not be needed anymore. To delete a monitor, you will only need to remove the monitor annotation from the schema and register it again. After the registration, the monitor will be deleted on the Datadog UI as well.

More Details

Default parameters: While creating the example event monitor above, we didn’t pass any parameter to annotation when adding a new monitor. We thought that requesting all parameters did not seem like an easy-to-use approach for the developers. So each monitor annotation only takes parameters to be overwritten, the ESM service will fill the rest and build a Datadog monitor which will be used as a body in our POST request. Most of the logic part lies on the query parameter in the monitor, an example for @traffic_anomaly_monitor would be:

avg(last_4h):anomalies(sum:<metric_name>{<tag>:<value>}.as_count(), 'agile', 2, direction='both', alert_window='last_10m', interval=60, count_default_zero='True',  4seasonality='weekly') >= 1.0

Notification of the teams for alerting: By default, Datadog requires you to specify channels in the message parameter of the monitor like @slack-{channel_name}. For convenience, we also override this parameter by using our @alert_channel annotation which is already presented in the schema.

Although Auto-generated Event Monitors work great when creating predefined event monitors, it is not a silver bullet for every monitoring need. More complex monitoring requirements should/can still be handled by Datadog UI or some other approaches like Terraform.

Conclusion

At Udemy, we are consistently working to be more data-driven in our decisions and we want to use our data to create more resilient and robust systems. Monitoring is one way of doing this and we hope this blog post clearly covered why event monitoring is so important, how we supply data to these monitors, and how to create monitors for eventing data easily with annotation-based Auto-generated Event Monitors.

Thank you for your time and hopefully, this blog post is useful for your projects. One final word, we encourage you to use automation for monitoring when it applies!

Acknowledgments

This work was made possible with the efforts of the Udemy Event Tracking Team. Additionally, huge thanks to Udemy Customer Development Guild to come up with the idea!