SRE Culture in API Governance Using Anypoint Platform

Key concepts of API governance using MuleSoft’s Anypoint Platform

Published in

Another Integration Blog

18 min readOct 17, 2022

Introduction

API, or Application Programming Interface, is at the core of any digital value chain. Digital application developers rely on APIs to create a better customer experience. MuleSoft’s Anypoint Platform can be used to manage the complete API lifecycle, from design to deployment to management. This platform enables organizations to create API led solutions for digital initiatives with scale and agility.

Organizations want to unlock core business offerings through performant and resilient APIs. Longevity of APIs also depends on ease of use, high availability, scalability during increased demand, no operational disruptions, and general core functionality. Since APIs constitute the neural system of any digital transformation, setting up a robust strategy for production operation is a key to success of such transformation initiatives.

MuleSoft has out-of-the-box offerings to support your operation team to manage APIs at scale. MuleSoft also supports seamless integration with many standard third party tools and applications to ease mission-critical operational activities. In addition to these tools, a culture is needed to establish a production-critical operational strategy that helps organizations realize APIs as another offering to the end-customers.

Site Reliability Engineering (SRE) was originally introduced by Google to maintain the reliability of Google offered services. This blog is an attempt to showcase how a SRE technique, blended with MuleSoft’s out-of-the-box capabilities can create a robust strategy in API governance to include APIs as a product strategy. Now, I will discuss common SRE techniques within the context of MuleSoft API governance techniques.

Break the Silos Between Development and Operation

There is always a need to release new features on API offerings to stay ahead in the competition. However, stability of the API environment is also a primary goal to continue business without any disruptions. Implementing changes can break stability but not introducing any change can make the organization irrelevant in business. SRE culture can be applied to strike a balance between stability and new release.

Involvement of the operation team in the API design phase is absolutely necessary. While design discussion often focuses on the business functionalities, operation teams would ensure that non-functional requirements are also taken into consideration. They can share how to design APIs to for easy tracing, faster error debugging, and maintaining the overall SLA (Service Level Agreement). Input from the operation team during the design phase often help create more resilient, scalable and observable APIs. In addition, the operation team will understand the proposed change within the context of the business and can prepare for the impact of the change on the production environment.
Setup an error budget and have it approved by all stakeholders. An error budget is the acceptable percentage of error wherein the API is operating and considered stable. 100% error free is not a practical target to aim for; there should be an acceptable threshold line for all the KPIs. If the KPI goes below that threshold for a certain period, then only that one needs to be considered as unstable. The operation team can allow new releases to be rolled out if the API platform is functioning within the error budget. For example, if the availability SLO (Service Level Objective) is 99.99% for a business month, then less than 5 hours of outage (0.01%) is acceptable. If APIs are having outages for more than 5 hours during a month, then the development team should focus on making the existing API stable and hold off on new feature releases. A process needs to be established where error budgets with error budget consumption rates are presented to all stakeholders during every release plan meeting. Based on the error budget consumption rate, go/no-go decisions for a release should be made. It is important to remember that most end-users experience the digital applications only, and not the API. Therefore, it is important to measure the error budget across the digital value chain and not only on the API layer.
Introduce changes in phases rather than everything in one go. It is recommended (and most suitable with MuleSoft) to adopt an agile delivery method. A sprint based delivery method should be adopted where development aims for a minimal viable product (MVP) in every sprint. Since failure is inevitable (as per SRE concept, we need to accept this fact), gradual introduction of change will reduce the cost of failure. It is important to fail fast to keep the overall digital transformation timeline on track. MuleSoft’s out-of-the-box capability allows you to share mock services with end-customer/application developers to allow API users to be involved in the early phases. This makes user feedback available before the actual development and allows you to rectify the API design accordingly. Establish an agile release train with seamless coordination between DEV, Security, QA, and Operation teams to move changes in controlled and regulated fashion. Incremental change also enables faster roll back, which in turn helps achieve faster recovery time. MuleSoft advocates on an API led connectivity approach to foster reusability, composability, and overall agility in delivery.
While introducing new features in an existing API, it is recommended to add a new method or endpoint to expose the new feature. Make the API backward compatible so that existing consumers of the API are transparent to this change. If additional fields are added on existing data models, it is best to keep them optional (as much as possible). Alternatively, you could add new resources in the API with an updated data model to isolate existing consumers from the changes that are intended for new consumers only. After discussing with the stakeholders, share a deadline to all existing consumers , on the deprecation of old features. Many organizations make API versioning a common practice; however, API versioning needs to be used judiciously as too many versions of one API often leads to maintenance overhead and redundancy in terms of infrastructure resource allocation. Instead, it is better to add endpoints or methods incrementally to support releasing incremental features for an API (while keeping the number of end-points manageable).

Measure and Reduce Toil

A key factor to success is to plan for new features at right time. If the features are not released when customers are ready (either too early or too late), you will likely lose business. Most digital transformation projects are a race against the timeline. Therefore, it is absolute necessary to reduce the toil from the release pipeline as much as possible. Toil is a repetitive activity which requires low engineering skills to perform and can be automated. For example, using Anypoint Platform, toils can deploy changes on runtime, promote APIs on higher environments, restart the API, and periodically scan and measure infrastructure capacity consumptions. To optimize the time-to-market cycle, it is recommended to identify the toil continuously, implement automation, and measure the overall time-to-market cycle periodically.

When measuring toil and planning for automation, you should start with toils that can be automated easily by out-of-the-box Anypoint Platform capabilities. It is not a good strategy to target to automate everything in one go. Prioritize automation based on the ROI, effort, complexity, and maturity of development team. MuleSoft recommends companies to establish a C4E team to analyze the toil and plan for automation to reduce the mundane activities of the delivery team. As part of C4E, charter automating toils, which are very prominent, are common to target first. Plan to establish a CI/CD pipeline that will help in control MuleSoft releases with minimal manual intervention.

CI/CD pipeline establishment is an excellent strategy to expedite the overall release pipeline by introducing automation. The Mule Maven plugin enables integration of packaging and deployment of MuleSoft applications with a Maven lifecycle. Any standard CI/CD build server can be used, such as Jenkins, GitLab, and Azure DevOps. GitHub, Azure, Bitbucket, and many other standard source code versioning tools can be used as a code repository. A proper branching strategy should be built to enable developers to manage code versions across multiple releases. For example, it is common to use a release branch to manage releases. This may involve a development branch to deploy MuleSoft implementations on the development environment and feature branches to maintain different features while the changes are under development. Once a developer has completed local testing of the feature from Anypoint Studio, they can merge the feature branch with development branch, triggering the CI pipeline. MUnit is plugged in with pipeline to complete unit testing before deployment. If the test results pass with a predefined test coverage percentage, then the deployment is allowed to continue. Once development testing is complete, the development branch is merged with release branch. On approval of the pull request, the CD pipeline gets activated. MuleSoft implementation (code) gets deployed to the QA environment and the build package is created and pushed to an artifactory, such as Nexus, JFrog, or Azure. Build packages are tagged with the release version. Once the QA team completes the QA testing and provides approval for UAT, the release pipeline will deploy the build package from artifactory to the UAT environment. Upon successful UAT testing, the same build package will be released to production. MuleSoft supports seamless integration of many third party tools to make a more robust release pipeline for automated code review or security scanning.

Manual review of code quality and standards is often time consuming and elongates the development life cycle. Reviewers or SMEs need to manually review all code from every developer, often creating a bottle neck. Avoiding the review process also has a risk of rework at the later stage and unnecessary delay of the production rollout. One way is to improve this situation is to introduce a template-based development strategy. For every integration pattern, a template project can be created with all best practices and standards embedded. Templates are published to Anypoint Exchange for easy discovery by developers. Templates are prefilled with standard exception handling, logging, and other common best practices of the specific integration pattern. Using Maven archetype, these templates are converted to actual projects and developers can focus more on implementing the functional requirements. Reviewer or SMEs can then focus on reviewing the core functional implementation, since all the common standards are prefilled with the same framework. This will reduce the overall review effort by 40% to 50% and expedite the overall delivery timeline.

Another approach to reduce the review effort is to use a static code analyzer, like SonarQube. Using configurable rules based on the coding standard, these code analyzers can check if the implementation has any deviation from the standard. This static code analyzer can be integrated with the CI/CD pipeline to ensure deployment of quality code without much manual effort. MuleSoft API governance is the out-of-the-box solution on Anypoint Platform to apply governance rules on the API specifications. This out-of-the-box offering ensures standardization across the organization and easy detection of deviation from standard API specification. This feature allow you to publish a common ruleset for consistent API specification on Anypoint Exchange to reuse the same standards across the board.

These are several ways to introduce automation to expedite the release life cycle. There is no limit in automating and removing toils from maintenance and operational responsibilities. Workloads needs to continuously analyzed and toils must be continuously eliminated. Anypoint CLI (Command Line Interface) and Anypoint Platform API are the out-of-the-box solutions to automate platform management. In addition to the Anypoint Platform capabilities, MuleSoft has expanded automation offerings with MuleSoft Composer and MuleSoft RPA (Robotic Process Automation).

Intelligent Monitoring and Distributed Observability

Continuous monitoring of APIs is a key success factor for the adoption of APIs. However, APIs can not be manually monitored. On the other hand, if outages remain undetected by the API provider, the consumer of the API would be impacted adversely, reducing the API subscription down the line. Therefore, an intelligent monitoring scheme is an integral and critical part of overall API management.

We are often use the term SLA (Service Level Agreement) to indicate the commitment of API performance by providers. In other words, SLA is an agreement between the API provider and API consumer. It is the API provider’s responsibility to maintain the SLA as per the commitment. However, to maintain the SLA appropriately, it is important to understand two measurement parameters: the SLI (Service Level Indicator) and the SLO (Service Level Objectives). It does not matter if the API is a paid or free- missing SLAs cause reduction of API consumption and in turn, API providers will lose their reputation.

The Service Level Indicator (SLI) is the direct measurement of the service performance (and user happiness). SLI is measured with respect to the successful probes of the API. During design phase, it is important to decide on SLIs. The simple way of measuring SLI is as follows:

SLI = (Number of Good Events/Number of Valid Events) * 100

For example, if an API is observed to give the 200 OK response 80 times out of 100, then the SLI is 80%. Depending on the SLI identified (API availability, API response time, etc.), associated events should be continuously monitored. Often one SLI specification can lead to multiple SLI implementations. For example, a valid response SLI specification might require measuring both the number of HTTP 200 responses the API provides and the number of responses the API can provide within the accepted time limit. One note of caution: it is advisable to monitor a manageable number of SLIs since measuring too many will increase the overhead and can be misleading.

The Service Level Objective (SLO) is the internal target for the API provider to maintain so that the API never breaks the SLA. SLO needs to be settled at a lesser value than the SLA. That means if the SLO for API response time is identified as 200ms, the SLA should be published as 300ms. If the API response time goes over 200ms and below 300ms internal, the operation team should start doing the RCA and take corrective measurements before the response time exceeds SLA level. MuleSoft alerts can be set to SLO thresholds and send notifications to the internal DevOps group if the SLO limit is crossed. Another set of alerts can set to the SLA level and the level 2 DevOps team can be notified when the SLA limit is crossed.

Error budget (as discussed earlier) is another key concept in SRE. APIs performing within error budget are considered to be reliable. Error budget can be defined as below:

%Error Budget = 100 — %SLO

Error budgets are defined for a fixed time window, like business month or quarter. Consider an API whose availability is decided to be maintained as 99.9% (SLO). The API is planned to be under scheduled outage on 3rd weekend of every month, so the overall business month has 28 days (30 days — 2 days of planned outage). As per the error budget (100–99.9 = 0.1), the API can not be down more than 2.8 days (28 * 0.1) in a business month. The error budget is to be agreed upon by all stakeholders and the SLO should be a reasonable target for the engineering team. If the planned outages are part of the overall error budget, then it is of utmost importance to ensure that new features are released or some activity is contributed to the reliability of the API during the planned outage. SLO and error budgets need to be measured continuously for the API platform. Before planned releases, the error budget consumption should be presented to the release manager and other stakeholders. One key criteria on the go/no-go call for a release is to check the consumption of the error budget. As part of good SRE practice, periodically analyze the SLI and SLO to understand if those are still relevant within the context of current scenario.

Monitoring and end-to-end observability are also important for a successful API management policy. The four golden signals of monitoring are:

Latency: time to respond back to consumer
Traffic: number of requests per second
Error: rate of failed requests
Saturation: rate of capacity consumption

MuleSoft’s Anypoint Monitoring is an out-of-the-box dashboard for the API that allows for graphical representations of the measurements for the signals mentioned above. Both the development and operation teams can monitor these signals and decide between reliability and new release. MuleSoft’s Anypoint Platform also has the out-of-the-box capability to setup alerts and notifications in case any of the four signals pass a set threshold during a specific period of time. For example, alerts for high CPU utilization or high memory usage are related to saturation, and can trigger alerts. Similarly, message response time alerts (latency), message count alerts (traffic), and message error count alerts (error) are also supported. Alerts can be set on various levels, such as app, server, or individual API levels. Metrics are available based on the source type. Severity can be preconfigured so that the operation team can prioritize alert response to avoid false alarm.

With the Titanium license subscription, MuleSoft also supports setting up custom dashboards and alerts if the out-of-the-box alerts are not sufficient. For example, an alert notification to the operation team can be configured in case a preconfigured threshold limit is crossed. Email notifications can be configured with proper severity, a meaningful message, and with links to dashboard. The operation team can click on the link from the body of the email and can directly open the dashboard to immediately start root cause analysis. MuleSoft’s out-of-the-box Slack Connector can also be used for better notifications to operation team during an incident. Slack provides a better way of collaboration with meaningful dashboards and smart communication channel. Using the connector, a common notification API can be designed to grab error or incident events across the platform and push them to Slack. The Slack Connector also supports two-way communication, so MuleSoft also can listen to events from Slack platform. A common framework can be designed to generate notifications on operational KPI changes, service disruptions, or error events.

Anypoint Monitoring aggregates logs to manage, search, filter, and analyze logs. Log aggregation, in conjunction with monitoring, will further help the operation team quickly identify the root cause of each issue. Anypoint Monitoring provides a log search capability that allows for a log search query, UI based filtering, and query creation to easily find log messages of interest. Additionally, MuleSoft also supports seamless integration with third party log aggregators and APM tools, like Splunk, ELK, and New Relic. Third party application performance monitoring enables distributed observability. Some design consideration needs to be accommodated to tie up transactions (flowing through source to target applications via MuleSoft layers) and to create indexes to form a log search query. Custom HTTP headers on APIs for observability is a common practice to implement. For example, you can include custom headers, like x-correlation-id and x-client-name, on the API for better observability and traceability. Proper definitions of custom headers on API documentation (on Exchange or ACM), and sharing mock service with those custom headers to end consumers, help to quickly understand how to utilize these headers to enhance the observability of transactions end-to-end.

Distributed Observability on Hybrid deployment architecture

MuleSoft professional services have created a JSON logger as a readily available logger component to enhance the logging capability without adding effort to developers. JSON logger has the capability to publish log messages on a destination queue, making it very easy to apply custom analytics on the log messages. Also, JSON logger enhances the log information, such as the application name, flow name, severity, trace point, and time elapsed. This can empower the operation team to expedite the turn around time in case an incident is declared.

High level logging framework (JSON Logger at core) to aggregate log on Data lake

MuleSoft Log Point features in Anypoint Monitoring also support real-time log generation for MuleSoft applications and APIs, without any need for any additional coding or re-deployment. MuleSoft has a published list of connectors which would support this feature (note: this feature is only available with Titanium license subscription). This capability helps the operation team to further debug any issue without any need for restarting the API. Enabling this feature consumes additional capacity, so it is recommended to use it judiciously.

Proactively monitoring the API health and being aware of any outage before the end-customer is essential for the API provider to stay ahead of the competition. APIs can be added with a health check end point at the time of implementation. These health check endpoints can be used to ping the API and get the health status as the response. It can be automated through a scheduled script (using curl command) and can send notifications if the health of the API is not ok. API health check endpoints need to be implemented explicitly. These endpoints can also be used by the operation team to ping and validate the correctness of the API deployment, as part of post production deployment validation.

MuleSoft API Monitoring also has an out-of-the-box functional monitoring feature. This framework assures reliability of the Mulesoft API. Critical end points running on VPC, or exposed to public network, can be scheduled for periodic monitoring and can generate notifications if the expected response is not received. API functional monitoring supports notifications on Slack, SumoLogic, PagerDuty, and email.

Performance or load testing is one of the key test strategy that should be adopted to test the performance of the API under workload. Apache JMeter, an open source application, is widely used for MuleSoft performance testing and shares the load test result in multiple views, including result tree view, summary report, and graph results. To mimic the production-like load on the API, the number of threads and ramp up periods can be setup to view the performance result. In conjunction the CPU utilization, Heap utilization, Garbage Collection, and many more parameters can be analyzed. Other third party testing automation tools, like Selenium, can also be used for continuous functional testing. This will help understand the resiliency during request bursts and plan for the infrastructure capacity that the API needs to be backed up with.

MuleSoft has many other out-of-the-box capabilities to provide insight on API performance. Anypoint Runtime Manager provides dashboards for all deployed APIs with many KPIs, and graphs related to performance, failure, JVM, and infrastructure. ARM provides a troubleshooting tool called ‘Insight’ for in-depth visibility of business transactions that an API is carrying forward. Another important out-of-the-box offering is ‘Visualizer’. This feature empowers the operation team with the topology view of the API network, transaction flows for easy troubleshooting, and an easy view to figure out gaps in API policy implementation.

Finally, monitoring and observability is a process that should be periodically reviewed and updated as per the requirement. As the API network grows and the API management platform architecture evolves, the technique of smart monitoring will need further adjustments. However, a robust initial planning involving a C4E early in the game often results in flexible monitoring and observability strategy that can easily bend in the future if a situation demands, without any disruptions.

Effective Incident Management:

Effectively managing incidents is also a key in the SRE culture and is important while manage APIs. The impact of the incident needs to be neutralized as soon as possible. Patch work or a quick rollback of the latest release can be implemented to resolve the incident. Once the incident is mitigated, a permanent solution should be developed and a detailed RCA should be created before the incident is declared over. Effective and periodic communication to all stakeholders is also required. Once the incident is over, the team should focus on a blameless postmortem. The report should focus on the process or system gaps that caused the incident and should not point a finger at any team or individual. Postmortem reports should also contain a detail steps on how to prevent similar incidents in future. The steps to prevent the incident should be included in the release check list. The postmortem report also should be published on the organization’s portal or a shared location accessible by all stakeholders.

Conclusion

Site Reliability Engineering (SRE) is at the cornerstone of API product culture. Adopting SRE culture in early in the MuleSoft API implementation phase can make the overall API offerings robust, scalable, and resilient. Establishing a Center for Enablement (C4E) and including SRE into the charter to evangelize the concept across the board is a key strategy for building a successful API economy. The goal of this blog is to share how to use out-of-the-box Anypoint Platform capabilities and a robust SRE culture to create and publish reliable APIs at scale. I hope this blog will help you implement MuleSoft APIs with product philosophy at the core.