IDE-first application metrics to provide actionable feedback for developers

Norbert Podsadowski
May 26 · 19 min read

Developers tend to rely on local unit testing to ensure correctness of isolated, independent modules; but data combinatorics in production environments are very different to what tests can simulate locally or in staging environments [1].

Any modern application is therefore continuously monitored by a combination of tools that allow operations and development teams to inspect various production metrics on their deployed code:

  • Request and response times
  • Exception rates
  • Logging lines

One example of a tool which aims to fulfil the above requirements is an Application Performance Monitoring tool such as New Relic. The main issue with APM tools, is that they are largely targeted at operations people and system administrators, instead of developers. Runtime information delivered in the form of centralised and external dashboards are simply not actionable for software developers.

Glancing at the above, one might be able to get a snapshot of how their deployed application is currently doing. You might be able to see how long a particular endpoint takes to respond on average, and how many exceptions were thrown in the past hour. However, developers might tend to look at such a dashboard only when a problem arises, or when someone from the operations team (if your company still has one) tells them to. The displayed dashboards are not fully leveraging the information provided by the code which they are instrumenting, and result in clunky UIs which developers aren’t likely to interact with on a daily basis.

A current trend in software development is the blurring of the boundaries between building and operating software products — DevOps. The next logical step in the evolution of this paradigm is to integrate runtime monitoring data from production deployments into the tools that developers utilise in their daily workflows.

This blog post will demonstrate the initial iteration of a new take on how developers should interact with and utilise the vast amount of production metrics available on their applications, as they run in production environments. The approach taken here focuses primarily on developer-friendliness and actionable feedback displayed on top of code in the IDE.

The Problem

In order to illustrate the kinds of actionable feedback we are talking about, imagine that you work in a small team of developers responsible for creating core microservices that power the technology backbone of a small-scale private cab hire startup, such as Uber. Your team works in Python, regularly merging to master in a continuous delivery fashion.

As you arrive in the office in the morning, you team finds out that your customers have been complaining about having to wait a very long time for the app to find rides. These concerns surfaced overnight, and are coming from customers in many different locations — your team double-checked driver availability in those zones to be normal. An APM tool which your team runs to monitor request-response times shows that an array of services have experienced a spike in response times.

What do you do next? You fire up the IDE in order to identify the change which introduced this performance regression. The APM tool may have given you an initial hint, but not much more than that. Perhaps you can figure out how to navigate its interface in order to drill down to more concrete details, but that requires regular context switching between your browser and your IDE. Even then, the APM tool is unlikely to provide you with an easy way to understand and compare performance in a way which is aware of version control and displayed in a context-aware manner. Eventually you ditch the APM tool, start up mocked-up instances of the affected services and run a profiler to see if you can pinpoint the root cause of the performance regression.

The local instances you run are unfortunately far from the truth. Your profiler doesn’t pick up anything out-of-the-ordinary — certainly not anything that your extensive test suite wouldn’t have picked up. But that was expected — how close is your local environment going to be to the full-blown environment that the production instances run on, with key inter-service interactions mocked-out, big differences in configuration, hardware, etc? The point is — no matter how much you try, your application will never behave in the same way as it does in a production environment, making issues like this a lot harder to debug.

Ideally, you’d like something that fulfils the following requirements:

  • Collects profiling information continuously with very low overhead, with a per-line granularity
  • Integrates with version control in order to allow your team to make meaningful comparisons across versions
  • Displays this information in a context-sensitive way, in the place you care about the most: your code, not an external dashboard with an overblown UI

The Solution

Our solution is based around leveraging the place which developers care about the most; the place where they spend the most time, writing code: their IDE. This results in an instrumentation layer which actively collects feedback on deployed application instances, and makes them available to an IDE plugin for intelligent display.

Let’s pretend that your ride-hailing company has recently given our tool a try. You have configured the instrumentation layer to collect various production metrics on all of your deployed services, and are ready to try out the IDE plugin in order to help you solve the performance regression negatively affecting your customers.

Set-Up

The set-up is the following — when a customer requests a new ride, a pipeline of inter-microservice interactions kicks off in order to serve this request:

  1. A customer chooses a start location and an end location on a mobile app, and taps Request Ride
  2. A load balancer forwards this request on to an instance of a ride service
  3. The ride service sends the start location and end location to the pricing service and asks for a price estimate of this ride
  4. The pricing service uses the map/location service to estimate a price for the ride, which in turn uses a Google Maps client
  5. The ride service sends the start location and end location to the driver matching service
  6. The driver matching service works with the map/location service and various driver-related services to return a list of candidate drivers, sorted based on certain criteria, such as proximity to the customer, the driver’s own preference zones, star-rating of the driver, etc.
  7. A driver is returned to the ride service which communicates with the driver operations service to notify the driver of the request
  8. If the ride is declined by the driver, the ride service asks the next-best driver candidate to accept the ride using the driver operations service
  9. If the ride is accepted by the driver, the ride service notifies the mobile app, periodically sending driver proximity updates (with the help of the map/location service)
  10. … etc.

You fire up the IDE and navigate to the ride service, the service which acts as the primary back-end your customers’ mobile phones talk to. It looks like this:

Here, we have some logic that implements the main endpoint for the ride service: find_ride. It does exactly what I’ve described above: estimates the cost for the ride, gets a list of candidate drivers, picks the first one which accepts, and returns a response to the customer’s mobile application which displays the ride information to the customer. If we fail to find a driver, we throw an exception which can be caught by the mobile application (we shouldn’t ever use exceptions for flow-control, especially from a RESTful API, but bear with me for this contrived example).

Displaying feedback

The first thing we can do is open up a new panel provided by our freshly-installed IDE plugin: Production Feedback.

The plugin has overlaid a wealth of feedback directly from our live production servers. It allows us to set an auto-refresh interval and observe how this feedback changes in real-time. Let’s see if we can use any of this information in order to solve our performance regression problem.

Investigating performance

The first thing that stands out to us is that the compute_distance_km call is taking, on average across all requests, a whopping 4.5 seconds! That doesn’t sound right. If we hover over this line, we get a helpful chart:

The plugin tells us how the performance of this line has changed over time. At about 7:20pm, the average performance of this call has skyrocketed. Let’s continue investigating by jumping to the declaration of the the function being called:

The compute_distance_km call sends a request to the map/location service. In the figure above, we’ve jumped to the relevant portion of the map/location service which computes distances between two points. This service uses the Google Maps API in order to fetch a distance, in kilometers, between two locations. Before it can do that, it has to authenticate with the Google Maps service and receive an auth_token. The authentication takes a long time — on average 4 seconds — so the author of the service created a simple bit of logic in order to re-use auth tokens across multiple requests. We simply authenticate once and use that same token until X amount of time has passed, where X is a requirement imposed on us by the Google Maps API. Once we have a token, any calls for the next X amount of time will not have to wait 4 seconds to authenticate, and will instead complete in just 1 second (the time it takes for Google to compute a distance for us):

The IDE plugin is able to tell us that the if-statement which governs this logic executed 100.0% of the time for this version of the service. There is something fishy going on here — our auth_token caching seems to be broken. It doesn’t take very long to figure out that Sloppy Simon was doing some late evening coding yesterday and accidentally changed the time interval from 1 hour to 1 second, which meant that we were asking Google Maps for a new authentication token for pretty much every ride request.

Fixing performance

We can revert this change, make a pull request, merge to master and trigger a new deployment of our service:

Once the deployment completes, our customers will again start kicking off the chain of interactions between our suite of microservices. The instrumentation layer collects fresh production metrics and associates them with our newly deployed version. The IDE plugin immediately displays the newly-collected feedback (for the same version we currently have checked out, i.e. HEAD of master). We can see that the if-statement, which was previously being executed 100% of the time, is now quickly decreasing towards a minimal amount as more customers request rides. A single point is visible on our graph this time, as this call was made only once in the past hour, as expected.

If we go back to our main find_ride function, we are able to see the new performance characteristics of the call to compute_distance_km, and the performance graph looks happy again. The gutter on the left also displays a big decrease for the find_driver_candidates call (which also uses the map/location service to find nearby drivers), shaving off multiple seconds from the find_ride call.

This may be a slightly silly and contrived example, but it’s the only way to showcase what this tool is trying to achieve, without the unnecessary details. The key takeaway is that, unlike any APM tool, the IDE plugin together with an always-on instrumentation layer is able to provide developers with a tool for the analysis of performance regressions, mapped on top of real code artefacts. It provides meaningful feedback on a line-by-line basis, and integrates with existing IDE workflows extremely well, such as version control, jump-to-declaration, etc. No browser-based APM tool is able to provide this level of developer-friendliness and actionable feedback.

Exceptions and logs

The instrumentation layer collects a lot more than just profiling information. We’ve already seen the IDE plugin display an estimated branch probability for the if-statement in our compute_distance_km function, but it actually uses the same profiling information to figure this out. Why stop here?

Remember how we said that if we couldn’t find a driver for our customer, we would throw an exception. Let’s hover over this line:

The plugin is able show us the list of exceptions which have been thrown by this line for the current version of the application. Although we have already fixed the likely culprit which caused our performance regression, it’s very easy for us to hover over this line to double-check that everything is okay. In this case, only two occurrences of the exception are seen, and combining this information with our own knowledge of the wider business context, satisfies us that this is the expected behaviour. If we saw that this exception was suddenly being thrown on every find_ride call, we could immediately start investigating the root cause of why eligible drivers aren’t being found.

All of the lines in the entire stack trace are highlighted — such that if we navigated to the caller of find_ride, the same information would be displayed. The aim is for future versions of the plugin to include rich filtering features that would enable developers to answers questions like “How often does the code I’m currently looking at throw exceptions, under scenario X”, and later help them answer even more important questions like “Why is this code throwing exceptions in scenario X?” Furthermore, the effects of any deployments made are immediately relayed back to the developer as soon as the first customer starts interacting with the new version of the application. This massively decreases the amount of time that developers have to wait to see if their most recent deployment fixed the last set of issues they were working on — with the increased confidence that it’s the real system running in production, and not just an automated test suite. This can not only increase the rates at which we continuously deliver software, but also make us more confident about their correct operation.

Why stop at exceptions? If you have been paying attention, we have a logging call right at the end of our find_ride function. It displays the number of drivers which decline a ride prior to us finding one. Hovering over this line shows the exact time, type and message for each logging call that has been made in this deployment:

As with exceptions, having such information readily available in the IDE eliminates the need for developers to constantly switch between a terminal, or perhaps a log-aggregator tool, and their code. The logs they’re interested in are positioned in the exact place where they occur and inform the developer when making various development decisions.

Smart IDE suggestions

The above demonstrated features tie in various first-class metrics such as performance, exceptions and logging data in a sleek IDE plugin that allows us to diagnose a wide array of issues. We are also not limited to using this information for debugging — we can envision this kind of information helping developers to write better code rather than simply being a post-mortem tool.

Any regular user of the IntelliJ IDEs should be familiar with inspections. These are helpful hints which the IDE gives us on anything from code style to auto-refactoring suggestions. What if our production feedback plugin implemented a similar suite of inspections, that could warn the developer that they’re about to call a function which has recently experienced performance issues, or has recently started throwing exceptions or otherwise misbehaving? There are a lot of possibilities to explore beyond this initial v0.0.1 prototype.

What you’ve seen in the above only scratches the surface of what is possible in terms of integrating rich production metrics with developers’ daily workflows. There are many other questions which can be answered through the use of such a plugin — we explore some of these at the end of this post.

Technical Details

There are three main components to this entire pipeline:

  • A language and framework-specific instrumentation layer which hooks into our web application and collects profiling information, logging calls and exceptions
  • A metric backend which serves a RESTful API with endpoints that the instrumentation layer hits with raw, instrumented data. This data is processed by the metric backend and stored in a database
  • An IntelliJ plugin which asks the metric backend for feedback information on the files we currently have open and manages the UI where we display our processed feedback. It also handles integration with Git in order to properly version our feedback and display rich feedback histories

I will go over each of the above briefly — you can also refer to the below diagram:

Instrumentation layer

From the user’s perspective (the user of this system), it all starts with a simple configuration file which looks like this:

This tells the instrumentation layer a few things.

  • A unique name to identify the production metrics for this application
  • A URL to the metric backend where feedback will be stored
  • The path to the root of the git repository. This is necessary in order to tag each bit of feedback with the current running version
  • A base path for sources — likely the same as the git base path, but not necessarily. All paths (such as those collected from our Python profiler) are relativized against this, because the full path of the running files (e.g. in /opt/ride_company/ride_service) will be different from how the IDE sees them (e.g. in /Users/simon/git_repos/ride_company_repo).
  • A list of file globs which should be instrumented. We might not care about all the things going on in our application, such as calls to various libraries, so the config file allows you to define which directories should be profiled.

The user then opes up the file where they initialise their Flask application (or any other supported* web framework), import the instrument_flask function from the instrumentation package, and wrap their flask app in a call, passing in the feedback configuration file:

*currently supporting Python + Flask, but the tool is very extendable to other frameworks and languages.

This call does a few things:

  • Hooks into the Python root logger in order to capture logging information using a custom handler
  • Hooks into Flasks’ internal exception handling functions in order to capture exception stacks
  • Injects a before_request and after_request function to every defined endpoint in order to launch the Python profiler

The last thing left for the user to do is to open up a configuration dialog in IntelliJ and point the IDE to the same feedback configuration file:

After that, the feedback toolbar and all of its associated actions will become available.

The profiler we’re using is called PyFlame, and is a very low-overhead sampling profiler which is safe to run on every request in production. It runs as a separate process and uses ptrace to inspect Python stacks and infer execution times. It runs for the duration of the whole request, after which it is terminated for that request. I am currently working on a way to run this as a parent of the Flask process in order to allow ptrace to work without any elevated privileges. Another TO-DO item is replacing this with an even better and more actively maintained profiler, PySpy.

All instrumentation is collected on a per-request basis. At the end of each request, the collected profiling data in the form of a blob of text are sent to the metric backend for parsing, together with current version information, logging records captured during the request, any thrown exceptions, start and end timestamps, etc. The tool can be easily adapted to work outside of a web-application context; however, I’ve decided to focus on web applications, as they allow individual requests to become first-class citizens with a well-defined start and end.

Metric Backend

The metric backend acts primarily as a parser of instrumentation data such as that produced by PyFlame and stores this versioned information in a PostgreSQL database.

The hardest thing it does is the parsing of PyFlame outputs into meaningful profiling information. The output from PyFlame is essentially a mapping between a particular stack trace and the number of samples it collected in the period of time that it ran. From the information sent by the instrumentation layer, it is able to translate each profiled PyFlame line into an execution time.

Another part of the API allows a request for any filename to be sent by the IntelliJ plugin, along with versioned feedback for the last N requested versions.

IntelliJ Plugin

The bulk of the code is actually in the front-end plugin. The plugin essentially creates a feedback loop, both conceptually and literally. Every X seconds, it asks the metric backend for feedback on the currently opened files. Once it receives this feedback, it performs line translation in order to create a view of a single logical line.

To illustrate what I mean by this, think back to our earlier focus on the line which called the compute_distance_km function. As code changes, this line will be moved around the file, or possibly be changed/deleted. However, we would like to display a performance history (as well as power other features) based on the previous N versions of the same line available in the metric backend. If during version A the line existed on line 35, and in version B the line was moved down to line 39, the profiler would collect profiling information for two different lines. The IDE plugin uses textual differences between consecutive versions of the file to figure out that line 35 is logically the same as line 39. It is then able to translate the feedback collected from line 35 in the earlier version and combine it with the current feedback collected for line 39. This allows us to track changes in performance (as well as other things) even as code evolves.

Some might question the validity of this approach. Is it really possible to perform this translation in 100% of cases? Shouldn’t we be doing some form of static analysis to confirm that line 35 indeed equals line 39? The answer is based on a few facts:

  • This actually works very well in most cases. If a line looks exactly the same, it can be considered the same line from the developer’s perspective, as it was called with the same arguments, etc. There may be some edge cases to cover, but these can be dealt with other, simpler ways than static analysis
  • As soon as the line changes even slightly, it becomes a new logical line — and this does not hinder the usefulness of the plugin, as feedback is collected on a regular basis from a constant stream of requests.
  • It would be extremely complex to implement a static analyser to give us 100% certain line-to-line mappings across versions. Even if one could use a very sophisticated static analyser to do this, code execution can depend on many things other than code: think environment variables, external configuration, etc.

All of this line translation means that as developers actively edit code in their editor, they can still view existing feedback in the place where they expect it, despite the fact that the instrumentation layer has no awareness of how code is being changed and later viewed by developers. It also allows us to track how feedback on a particular logical line evolves through multiple deployments, allowing teams which adhere to continuous delivery practices to immediately see the results of their changes.

Future Work

What we are proposing here is an entirely new method of creating a very quick and effective feedback loop that aids developers in their daily decision making. Some other features which are being planned:

  • Integration with request payloads: being able to look at a function and filter based on the payload which kicked off its execution would allow developers to lean how their function behaves with respect to different inputs. This information can be smartly overlaid on top of lines, with the plugin making certain intelligent suggestions without the developer needing to think of how to filter in the first place
  • Individual request inspection: one should be able to isolate any single request in order to see performance, exceptions and logs which occurred for just that request. This would allow one to debug any problematic request out of the box
  • Code inspections: informing the developer that the code they are currently writing may be problematic based on current feedback
  • Richer profiling information: e.g. displaying the number of times certain loops get executed
  • Support for try-catch blocks: seeing not only the exceptions which bubble up to the top, but also those that get caught
  • UI improvements,: e.g. showing how many versions behind a piece of feedback is, allowing the user to check-out that version with the click of a button, etc.
  • A lot more!

Furthermore, none of this work is limited to Python, or the IntelliJ platform — these were picked for reasons of simplicity and familiarity. The metric backend and IDE plugin have been written in a generic* way to allow extension to any other language, web framework, or profiling tool.

*as generic as one would expect for v0.0.1.

Closing Thoughts

This work is done for the purposes of my own final-year Masters project and supervised by Dr. Robert Chatley at Imperial College London. This blog post is being written as a self-contained demonstration, but also as a way to reach out to developers and gauge interest. Given enough interest, I will continue its development as an open-source alternative to current proprietary APM solutions. Feel free to star the GitHub repository for updates on a potential stable release:

Thanks for reading.

[1] J. Cito. Developer targeted analytics: Supporting software development decisions with runtime information. In 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 892–895, Sep. 2016.

Faun

The Must-Read Publication for Aspiring Developers & DevOps Enthusiasts

Norbert Podsadowski

Written by

Systems Engineer at BlueCove

Faun

Faun

The Must-Read Publication for Aspiring Developers & DevOps Enthusiasts