Building a highly-scalable Synthetic Monitoring Solution

8 min readAug 18, 2023

By Vikas Kumar, Lead Software Engineer @Maersk

Introduction

As Maersk continues on its path to digital transformation, the reliability of the systems has taken center stage. Now, reliability is a multi-faceted term, but one central part of reliability is observability. In Maersk’s effort to standardize and centralize all the observability offerings, we needed to build a highly scalable and performant Synthetic Monitoring solution.

I am writing this blog to share Maersk’s journey toward building this API Synthetic Monitoring solution. In this blog, I will touch upon the design considerations, key features, and challenges we faced while designing and developing this solution.

What is Synthetic Monitoring?

Synthetic Monitoring is the mechanism for continuously testing your workflows that a user can take while interacting with your system using synthetic (mock) data. Synthetic monitoring can be for APIs as well as for end-to-end UI workflows. In this blog, we are going to focus on Synthetic Monitoring for APIs.

Maersk’s Synthetic Monitoring Landscape

At Maersk, we were using various paid tools to cater to Synthetic Monitoring use cases. In an effort to bring standardization into Synthetic Monitoring practices, we wanted to offer a centralized solution as part of our Observability offerings. So we started defining the scope of this alternative. The major features that were expected out of any Synthetic Monitoring solution at Maersk were:

Health Checks for APIs
Ability to schedule tests — Simple and CRON
Assertions
Test History and Reporting
Multi-Step Testing or Workflow Testing
Environment and Variable Support
Global Location Support
Easy Migration Support

We explored lots of Open-Source options for this use case, including but not limited to tools like Gatus, Uptime Robot, Kuberhealthy, Elastic Synthetics, Grafana Synthetic Monitoring, etc. However, the conclusion of this exploration was that there was no completely free tool that provided all the requirements we had, and neither was there anything that could be easily extended to be able to cater to these requirements. Thus, we decided to build a custom tool that will be able to cater to all the requirements and will also provide easy migration from paid vendor tools along with easy integration to all the other platforms at Maersk.

What we wanted to achieve was a highly scalable system that can be fault-tolerant and can cater to a high load of Synthetic Monitoring at Maersk, not just from a current standpoint but also from a futuristic standpoint, as the digitalization of Maersk will eventually lead to more and more APIs being built and eventually monitored by this solution.

Resource Hierarchy

We had a concept of a Team in which you could add team members and assign roles to them. We provided simple RBAC capabilities based on these roles. Test Suites are collections of tests. Tests represent a flow, such as an order-purchase flow. The Steps represent each step of the flow (API call). In the above order-purchase flow Test, the steps could be searching for an item, adding an item to the cart, making payment, verifying payment, dispatching the order, etc.

We also had the concept of Environments, which were basically data templates that enabled the re-use of the same tests in multiple environments with different data. This was enabled by variables that were part of an environment, where you would put different values for different environments. Now, when you were running a test or scheduling a test for the run, you had to choose an environment, and accordingly, the variables would be substituted at test run time. You could also choose to mark these variables as secrets. The environment is also where you could configure what locations you want the test to be run in. Environments belonged to a Test Suite.

Architecture

We wanted to design this system to be highly pluggable and loosely coupled. We also wanted to build abstractions over most of the components that we have used as dependencies so that they can be easily swapped with an alternative if the need arises. So for example, if we want to use some other solution for storing secrets instead of Vault, the only changes needed will be in the Secret Manager component, and the rest of the systems will work as they are.

The above architecture with all these considerations enabled us to build a highly performant, scalable, and distributed system while at the same time catering to use cases such as Global Location testing and Internal Endpoint testing, which I will talk about in a bit.

We had built a UI, which was mostly used for configuring the test, visualizing the results, and checking the details of any failures that might have occurred while the test ran. The API was consumed by the UI and by direct API calls to persist test configurations and schedules at which the test was supposed to be run. For persisting the test data, we used Postgresql.

Scheduler, the brain of this system, used Quartz Scheduler internally to manage the schedules and fire the triggers at the scheduled time. We had support for simple time-based and CRON-based triggers, both of which are inherently supported by the Quartz Scheduler. We were also using the Quartz scheduler in clustered mode, which enabled automatic load distribution and fault tolerance across the scheduler instances.

Whenever a trigger happened for the first time, the test details were fetched from the database and cached before sending the test to run via Kafka. This ensured that the DB calls could be minimized, as most of the time the data that is being pulled for a test does not change. However, whenever it was changed, the API ensured that it was evicting the cache related to that test and forcing the scheduler to pull and cache the updated data.

The properties we used for quartz are listed below:

quartz:
    job-store-type: jdbc
    wait-for-jobs-to-complete-on-shutdown: true
    overwrite-existing-jobs: true
    properties:
      org:
        quartz:
          scheduler:
            instanceName: Scheduler
            instanceId: AUTO
          jobStore:
            driverDelegateClass: org.quartz.impl.jdbcjobstore.PostgreSQLDelegate
            tablePrefix: qrtz_
            useProperties: false
            isClustered: true
            clusterCheckinInterval: 1000
          threadPool:
            class: org.quartz.simpl.SimpleThreadPool
            threadCount: 100
            threadPriority: 5

The details on these properties and how you can optimize them to best fit your use case are very well documented in the Quartz Scheduler documentation and can be found here:

Job Scheduler

A full-featured, Java-based, In-process job scheduler.

www.quartz-scheduler.org

The Agent is what provided the ability to run the test from multiple locations (a.k.a., global locations). The job of an agent was to run the test, perform the assertions, and create and publish the results to another Kafka topic. Having the agent component separated from the scheduler is what also enabled us to cater to another use case called Internal Endpoint testing, which is basically to test the endpoint that was deployed in the internal network of Maersk. We could simply put one agent in the internal cluster that listened to Kafka from the cloud and published back to Kafka on the cloud.

The Secret Manager component was an abstraction on top of Hashicorp Vault to persist any secret data that might be required to run a test, such as tokens, credentials, etc.

The Test Data Manager component’s job was to persist the test result data into MongoDB, which we chose as the historical database for test results. This also catered to the requests from the API to provide historical or aggregated data for visualization and reporting purposes. Another responsibility of this component was to provide the required data for further integrations to the integration service.

The Data Cleaner provided us with the ability to retain historical data for a maximum of either X days or Y entries, whichever is the maximum. This cleaned up all the data that did not meet these configurable criteria.

The Integration Service consumed the historical data and used it for further integrations, such as Alerting, in case the test failure matched configurable criteria, such as consecutive 5 failures or 5 failures in the last 10 results, which the user could configure while creating a test.

Other Important Features

Workflow Testing

The way we implemented workflow testing was by providing the option for the user to extract the data from the previous step and use it in the subsequent steps of the Test. This was essential for the user to be able to test a complete workflow. This was also essential in dealing with situations of secured endpoints where you might want to first generate a token using service principal and then pass the token to be able to access the secured endpoint.

We also provided an option to extract the variable to a global level (environment level), which enabled the reuse of things like token generation and consumption across tests.

Variables and Secrets Support

As mentioned above, we had the option to create variables in the environment. Once the variable was added to the environment, it could be used anywhere in the test configuration with the syntax {{variable_name}}. We then internally used HandleBar to replace the variable at runtime. While creating the variable, you could also mark it as a Secret, and for the secret variables, the value was pulled from the Vault at runtime and substituted in the test data.

In the UI, we also provided an option for Variable Autocomplete, which basically lists all the variables you have defined as soon as you start typing the variable syntax, which is {{.

Post-Response Scripts

This is another feature we want to include in the system. Here, we will provide an option for the users to write custom scripts that could be executed after the response is generated from the API call. These scripts would be used to perform custom assertions or extractions. So, a user will write JavaScript code to perform complex assertions on the response and extract any data.

We can implement this by using GraalVM. This provides us with the capability of running JavaScript code from within Java code.

Final Result

We started with the goal of creating a system that was highly scalable, durable, and performant. Using the above design choices, we were able to achieve the goal and also benchmark the tool to be able to cater to not just current but future loads as well.

We tested the tool to cater to around 200 Tests per second; however, since this system is very easily scalable both horizontally and vertically, we can easily extend it to cater to any load as per the need.

We also did a cost analysis, and we were able to run the complete stack above, including the networking cost, for under $700 per month, which is substantially low for the load mentioned above.

Conclusion

Thanks for being with me so far. This is it for the API Synthetics Tests. We are in the process of extending this tool to also cater to the UI Synthetic capabilities, and once we are able to do that, I will also be writing another blog to explain how we did that.

Apart from that, we also have the plan to open-source this solution as part of Maersk’s initiative to give back to the open-source community, and I will be updating the link to both in this blog once we have them.

Thanks for reading, and I hope this helps. Please feel free to reach out to me if you need any more insights into this.

Contact: vikas.k@maersk.com; krvikas1011@gmail.com; https://www.linkedin.com/in/thisissvikas/