How we found application bottlenecks with AWS X-Ray

Published in

affinityanswers-tech

4 min readJul 19, 2022

When our enterprise application OnTrack came of age, we had some idea about the kind of load the application we must be prepared for to provide an optimal experience for our enterprise users. It was time for load testing.

Load testing is when we put demand on a system and record how the system responds to the load. We found a nice tool to do so, called Locust (may be a topic for next blog?). This tool turned out to be very useful to know how our application behaves when several simultaneous users start using the application. When we ran the load test with Locust, although the application was able to gracefully handle the (estimated) concurrent users, the API latency was something that we were a bit surprised; there was room for improvement. But here the challenge was that we didn’t know where to start! Because an API is built with many components (in our case Mongo DB, API server on Gunicorn, MySQL and a few other micro-services) and Locust will record the latency of the API, not the time taken by each component individually. So, we do not know where to fix.

That is when we found this AWS service called X-Ray which totally solved our problem.

What is AWS X-Ray?

AWS X-Ray collects the information about the requests your application is making & provides a nice tool to view, filter and gain insight about the data.
It also helps to identify the issues in your application and gives and opportunity to optimise it.

The data it can record are

The host — host-name, alias or IP address
The request — method, client address, path, user agent
The response — status, content
The work done — start and end times, sub-segments
Issues that occur — errors, faults and exceptions, including automatic capture of exception stacks.

How does it work?

Basically X-Ray works as a wrapper on your application code. It collects the basic info about the request. We use Python and AWS X-Ray has SDK for Python. Suppose you use PyMongo in your one of your component, X-Ray has listener that traces all the PyMongo DB commands and records all available info provided by PyMongo. X-Ray console has nice way to show the insight collected in the form of traces in service map.

Our application has been written using Flask and we chose to work with Python for generating and sending trace data to the X-Ray daemon.

Step 1:pip install aws-xray-sdkStep 2:from aws_xray_sdk.core import xray_recorderfrom aws_xray_sdk.core import patch_all,patchxray_recorder.configure(service='My app')patch_all()

After installing aws-xray-sdk we need to import it to the code & give the service a name (e.g ‘Ontrack-Dev’). The patch_all() function records the details of all the modules that X-Ray supports. One can record a specific module as well. In that case the code changes a bit.

from aws_xray_sdk.core import xray_recorderfrom aws_xray_sdk.core import patch_all,patchxray_recorder.configure(service='')libraries = (['botocore'])patch(libraries)

To start recording , we need to add few lines of code just at the beginning and the end of code. And that’s how easy it is!

subsegment = xray_recorder.begin_subsegment(<name of the component>, 'remote')
subsegment.put_annotation('function_name', function)
          <your code goes here>
xray_recorder.end_subsegment()

Service Map by AWS X-ray (representational at one of our stages of optimisation)

The service map (like the one shown above) helped us determine where the bottleneck was. For example one of the (simple) bottlenecks was due to MongoDB being hosted in a different region and it was very easy to see that in the service map. We “co-hosted” MongoDB in the same region and that saw dramatic improvement in the overall latency but discovered new bottlenecks. After eliminating two more bottlenecks like this, we were able to improve the performance of the application by at least 5X.

How does X-Ray prevent itself from creating new bottlenecks?

One might think — If multiple users are making request to an application AWS X-Ray had to collect the information for all those requests which can slow down the performance creating new bottlenecks!

X-Ray can handle this situation using a software application called “AWS X-Ray Daemon”. The daemon works in conjunction with the AWS X-Ray SDKs and must be running so that data sent by the SDKs can reach the X-Ray service. Instead of sending trace data directly to X-Ray, each client SDK sends JSON segment documents to a daemon process . The X-Ray daemon buffers segments in a queue and uploads them to X-Ray in batches.

Conclusion

AWS X-Ray is definitely a good tool to trace the the flow inside an application across several component. The next time we start a project from scratch with several components, we would instrument the code with AWS X-Ray from the very beginning instead of waiting for a problem to occur where we need to trace. If you have found a better way to trace across components or have some best practices while using X-Ray, feel free to leave your comment.

Tailpiece: Locust and AWS X-Ray teaches us a thing or two about appropriately naming deep technical products relating them to common man!

How we found application bottlenecks with AWS X-Ray

What is AWS X-Ray?

How does it work?

How does X-Ray prevent itself from creating new bottlenecks?

Conclusion

Written by Debolina