Setting up Toll SDK Dashboard Metrics

MapUp Team
TollGuru
Published in
10 min readOct 13, 2023
Setup dashboard to monitor TollGuru SDK

Use this guide to set up dashboards to monitor the performance and overall health of the TollGuru SDK and its associated endpoints

TollGuru SDK stores logs in AWS CloudWatch. If you are unfamiliar with CloudWatch, you can learn more about CloudWatch dashboards using the official Creating a CloudWatch dashboard resource. You can also learn about creating cloudwatch widgets under Creating Cloudwatch Widgets from Metrics section of this document.

Our analysis dashboards can be divided into three sections:

  • Performance Metrics
  • Reliability Metrics
  • Data Updation Metrics

Before moving forward, it is important to note that the dashboards utilize data from metrics within the AWS namespace as well as custom metrics, published under tollguru-sdk namespace.

Custom namespace — tollguru-sdk

Let’s start with setting up our analysis dashboards. In the first section, we will create widgets for performance metrics.

Performance Metrics

Performance metrics are quantitative measures of how well a system or process is functioning. Following are the metrics we monitor in our performance dashboard:

  1. Latency (ELB metric)
  2. Traffic (ELB metric)
  3. Calculation Times (custom metric)
  4. Transaction Count (custom metric)

Latency

Latency measures the total time between the initiation of the request and the response received at the source of the request. It includes latency in the network, processing, queuing, etc. MapUp uses p95 and p50 as measures for latency.

  • p95 response time is the time it takes for 95% of requests to be completed.
  • p50 response time is the median response time, which means that 50% of requests will take longer and 50% of requests will take less time.

Latency metrics can be directly found in the CloudWatch metrics for ELB. You may select the exact metric by searching for the name of your load balancer along with the Latency metric name.

Calculation Times

Calculation time is the amount of time it takes our API to complete a request. We monitor the p95 and p50 response times.

To get calculation time widgets in the dashboard, we’ve published a custom Performance metric called CalculationTime from our code. The custom metric offers real-time insights into calculation times.

Traffic

This is a count of the number of API calls made per minute (e.g. requests/minute). Just like latency, we can get this metric directly from ELB metrics.

Here is a line plot for the Request Count widget:

You can go a step beyond and get a breakdown of these requests by plotting other ELB metrics like HTTPCode_Backend_2XX, HTTPCode_Backend_4XX, HTTPCode_Backend_5XX, and HTTPCode_ELB_5XX along with Request Count. Once done, your final widget will look like this:

Transaction Count

There are a certain amount of transactions (tx) associated with each request. This indicates how much calculation is being done in the backend for a particular request. Users are billed based on the number of transactions each month.

The transaction metric can be found under the tollguru-sdk namespace.

Reliability Metrics

Reliability metrics help us evaluate the stability and trustworthiness of the system. They measure the consistency of our services and are crucial for running operations smoothly. We analyze Errors and Warnings as a part of reliability metrics. Details of Errors and Warnings can be found in the TollGuru SDK Runbook for SDK Issues document.

Errors

Errors represent the requests that fail to process successfully. By monitoring errors, you can identify potential reliability issues and take steps to prevent them from impacting the performance and availability of your systems. As mentioned in the SDK Runbook doc as well, we provide four categories of error.

  • INPUT_ERROR: Errors due to invalid input provided to the SDK.
  • ROUTING_ERROR: Error that arises during route creation, that typically indicates difficulties in mapping the provided GPS or polyline data to form a complete and accurate route.
  • SERVICE_ERROR: These errors indicate that the server is not working as expected.
  • TOLLING_ERROR: The tolling error arises when the system cannot calculate toll fees for the specified route due to issues with the input route.
  • 504 GATEWAY_TIMEOUT: Arises when either the pod crashes or there is some unhandled input error.

Apart from the error metrics, you can run a log insight query and extract more information about your errors. Here is a sample log query to get the count of errors based on status code and value.

fields @timestamp, @message
| filter @message like /ERROR/
| parse @message 'Status: *, Code: *, Value: *"}' as Status, Code, Value
| stats count() as count by Status, Code, Value
| display count, Status, Code, Value

Error Rate

Error Rate can be measured for each error metric separately. Refer to Troubleshooting 101: A Comprehensive Guide to Addressing Toll SDK Issues for thresholds. These pertain to error rate thresholds, encompassing areas such as service errors, tolling errors, and gateway timeouts. Feel free to follow the instructions outlined in this blog post, which guides you through creating a metric math alarm using Amazon CloudWatch. This will help you set up a widget for tracking error percentages and establish alarms for your error metrics.

Warnings

Warnings are messages that indicate potential issues or considerations related to the data or request sent to the SDK.

Following are the warnings generated by TollGuru SDK:

  • locTimes_error
  • points_straight_line
  • duplicate_locTimes
  • missing_id_from_pair
  • missing_old_toll_data

To create a widget for input error count in the dashboard, we’ve published separate custom metrics from our code for each of the warnings above.

Alerts and notifications

504 GATEWAY_TIMEOUT errors are critical in nature. We recommend setting up an alarm for this error type. Additionally, you should configure alarms for high-priority errors such as ROUTING_ERROR and SERVICE_ERROR. Refer to Troubleshooting 101: A Comprehensive Guide to Addressing Toll SDK Issues for thresholds for guidance on thresholds and SLA.

For warnings you might not require alarms and email alerts, but it’s advised to look for a high warning count. It will ensure the correctness of your input format.

Data Updation Metrics

MapUp logs the data update information on the backend. These logs are also part of the cloudwatch log stream for TollGuru SDK. We use logs like the following to track times taken to update toll data in different regions. These logs are for continent-size data. Given that the number of countries of the Client’s interest in each region is less than the total data, data update delays are expected to lower for the TollGuru SDK available to the Client.

When data updates are processed, API latencies may temporarily increase due to decryption, decompression, and parsing tasks. This might cause latency for each SDK instance to rise by about 40 to 45 percent during these updates. Even so, our experience shows that the processing time for the largest updates on an EC2 c6a.4xlarge instance typically falls within about 20 seconds. We see this brief latency increase as a part of ensuring our data remains secure and up-to-date.

When an increase in latency impacts SDK instances to the point where the load on an EC2 instance rises above 50%, Auto Scaling is configured to automatically add more instances. This manages the extra load, allowing us to accommodate more requests and maintain a consistent level of service.

To monitor such updates, we provide custom metrics like the count of tolldatazones being updated and the amount of data being downloaded. There are two subcategories in these metrics as well, pre and post-startup. Pre-startup represents the data updation before server startup and post-startup represents routine data updation. Below are the graphs for the discussed metrics.

Data Updation Time

MapUp uses AWS CloudWatch to track the cadence, delay, and volume of data downloads. The CloudWatch records contain the timestamps and toll data update size, which are then aggregated to determine the total amount of data downloaded per interval. The timestamps of these logs reveal information about the cadence or frequency of updates.

  • Cadence: US express lane tolls are updated at least every 5 minutes, indicating a high update frequency.
  • Delay: Updates are propagated across all SDK instances within 2 minutes, ensuring a maximum delay of 2 minutes for update propagation.
  • Amount of data downloads: Toll data is highly compressed and encrypted, reducing the volume of data users need to download during updates. Updates for the US express lanes tolls are up to 256 KB, while for Europe, it is around 20 MB. This does not include map data updates (for origin-destination endpoints), which can be a lot higher.

The following query extracts timestamps from log entries related to the completion of processing for “tolldatazone,” parses the processing time, and computes the average processing time for each instance of “tolldatazone.”

Region-wise data updation bar graph

(query same as above just using log table instead of bar graph)

Region-wise data updation log table

Dashboard Overview

We keep many of these metrics as part of one dashboard. It allows our customers and us to measure overall health of the SDK endpoints.

Are there any other metrics you want to see in the dashboard? Email api@tollguru.com with your request and the reason why the performance metrics matter to you.

Guide to Create Cloudwatch Widgets from Metrics

Let’s discuss each step to create a widget in the Cloudwatch dashboard.

Step 1: Click on Add Widget option in your cloudwatch dashboard.

Step 2: Select widget type Line and data source as Metric.

Step 3: Under the Custom Namespaces you will find tollguru-sdk Namespace.

Here you can find all the custom metrics that are currently published.

Step 4: Click on Performance and then select CalculationTime.

Under the Graphed metrics tab, you can see that by default we use Average as our statistic.

In this case, we need p99, p95 and p50 statistics. To get them on the same widget, you can duplicate the metric by clicking the copy icon under the Actions column.

Step 5: Create one more copy of the same metric and change the Statistic to p99, p95 and p50 respectively. You can change the Label name to something meaningful as well.

Step 6: You may now click on the Create widget at the bottom right corner. Once done you will end up with a similar-looking widget on your dashboard.

Widgets for other metrics can be created using similar steps.

Step 7: After creating your widget you can add alarms to it. Hover over the metric on the graph, and click the bell icon to add an alarm.

Step 8: A new page will open, here you can specify your conditions, click next.

Step 9: Add your email and create a topic, click next.

Step 10: Add alarm name and description, click next, and create your alarm.

--

--