IDE-first application metrics to provide actionable feedback for developers

Norbert Podsadowski
May 26, 2019 · 19 min read

Developers tend to rely on local unit testing to ensure correctness of isolated, independent modules; but data combinatorics in production environments are very different to what tests can simulate locally or in staging environments [1].

Any modern application is therefore continuously monitored by a combination of tools that allow operations and development teams to inspect various production metrics on their deployed code:

  • Request and response times

One example of a tool which aims to fulfil the above requirements is an Application Performance Monitoring tool such as New Relic. The main issue with APM tools, is that they are largely targeted at operations people and system administrators, instead of developers. Runtime information delivered in the form of centralised and external dashboards are simply not actionable for software developers.

Image for post
Image for post
New Relic [2], a proprietary APM tool. The data is there, but does it help while writing code?

Glancing at the above, one might be able to get a snapshot of how their deployed application is currently doing. You might be able to see how long a particular endpoint takes to respond on average, and how many exceptions were thrown in the past hour. However, developers might tend to look at such a dashboard only when a problem arises, or when someone from the operations team (if your company still has one) tells them to. The displayed dashboards are not fully leveraging the information provided by the code which they are instrumenting, and result in clunky UIs which developers aren’t likely to interact with on a daily basis.

A current trend in software development is the blurring of the boundaries between building and operating software products — DevOps. The next logical step in the evolution of this paradigm is to integrate runtime monitoring data from production deployments into the tools that developers utilise in their daily workflows.

This blog post will demonstrate the initial iteration of a new take on how developers should interact with and utilise the vast amount of production metrics available on their applications, as they run in production environments. The approach taken here focuses primarily on developer-friendliness and actionable feedback displayed on top of code in the IDE.

The Problem

In order to illustrate the kinds of actionable feedback we are talking about, imagine that you work in a small team of developers responsible for creating core microservices that power the technology backbone of a small-scale private cab hire startup, such as Uber. Your team works in Python, regularly merging to master in a continuous delivery fashion.

As you arrive in the office in the morning, you team finds out that your customers have been complaining about having to wait a very long time for the app to find rides. These concerns surfaced overnight, and are coming from customers in many different locations — your team double-checked driver availability in those zones to be normal. An APM tool which your team runs to monitor request-response times shows that an array of services have experienced a spike in response times.

What do you do next? You fire up the IDE in order to identify the change which introduced this performance regression. The APM tool may have given you an initial hint, but not much more than that. Perhaps you can figure out how to navigate its interface in order to drill down to more concrete details, but that requires regular context switching between your browser and your IDE. Even then, the APM tool is unlikely to provide you with an easy way to understand and compare performance in a way which is aware of version control and displayed in a context-aware manner. Eventually you ditch the APM tool, start up mocked-up instances of the affected services and run a profiler to see if you can pinpoint the root cause of the performance regression.

The local instances you run are unfortunately far from the truth. Your profiler doesn’t pick up anything out-of-the-ordinary — certainly not anything that your extensive test suite wouldn’t have picked up. But that was expected — how close is your local environment going to be to the full-blown environment that the production instances run on, with key inter-service interactions mocked-out, big differences in configuration, hardware, etc? The point is — no matter how much you try, your application will never behave in the same way as it does in a production environment, making issues like this a lot harder to debug.

Ideally, you’d like something that fulfils the following requirements:

  • Collects profiling information continuously with very low overhead, with a per-line granularity

The Solution

Our solution is based around leveraging the place which developers care about the most; the place where they spend the most time, writing code: their IDE. This results in an instrumentation layer which actively collects feedback on deployed application instances, and makes them available to an IDE plugin for intelligent display.

Let’s pretend that your ride-hailing company has recently given our tool a try. You have configured the instrumentation layer to collect various production metrics on all of your deployed services, and are ready to try out the IDE plugin in order to help you solve the performance regression negatively affecting your customers.


The set-up is the following — when a customer requests a new ride, a pipeline of inter-microservice interactions kicks off in order to serve this request:

  1. A customer chooses a start location and an end location on a mobile app, and taps Request Ride

You fire up the IDE and navigate to the ride service, the service which acts as the primary back-end your customers’ mobile phones talk to. It looks like this:

Image for post
Image for post
Our find_ride function, a key portion of business logic which glues all of our microservice interactions.

Here, we have some logic that implements the main endpoint for the ride service: find_ride. It does exactly what I’ve described above: estimates the cost for the ride, gets a list of candidate drivers, picks the first one which accepts, and returns a response to the customer’s mobile application which displays the ride information to the customer. If we fail to find a driver, we throw an exception which can be caught by the mobile application (we shouldn’t ever use exceptions for flow-control, especially from a RESTful API, but bear with me for this contrived example).

Displaying feedback

The first thing we can do is open up a new panel provided by our freshly-installed IDE plugin: Production Feedback.

Image for post
Image for post
Kicking off the feedback loop.

The plugin has overlaid a wealth of feedback directly from our live production servers. It allows us to set an auto-refresh interval and observe how this feedback changes in real-time. Let’s see if we can use any of this information in order to solve our performance regression problem.

Investigating performance

The first thing that stands out to us is that the compute_distance_km call is taking, on average across all requests, a whopping 4.5 seconds! That doesn’t sound right. If we hover over this line, we get a helpful chart:

Image for post
Image for post
Hovering over a profiled line gives you rich per-request history information.

The plugin tells us how the performance of this line has changed over time. At about 7:20pm, the average performance of this call has skyrocketed. Let’s continue investigating by jumping to the declaration of the the function being called:

Image for post
Image for post
Something looks fishy.

The compute_distance_km call sends a request to the map/location service. In the figure above, we’ve jumped to the relevant portion of the map/location service which computes distances between two points. This service uses the Google Maps API in order to fetch a distance, in kilometers, between two locations. Before it can do that, it has to authenticate with the Google Maps service and receive an auth_token. The authentication takes a long time — on average 4 seconds — so the author of the service created a simple bit of logic in order to re-use auth tokens across multiple requests. We simply authenticate once and use that same token until X amount of time has passed, where X is a requirement imposed on us by the Google Maps API. Once we have a token, any calls for the next X amount of time will not have to wait 4 seconds to authenticate, and will instead complete in just 1 second (the time it takes for Google to compute a distance for us):

Image for post
Image for post
Key logic in the Map/Location service.

The IDE plugin is able to tell us that the if-statement which governs this logic executed 100.0% of the time for this version of the service. There is something fishy going on here — our auth_token caching seems to be broken. It doesn’t take very long to figure out that Sloppy Simon was doing some late evening coding yesterday and accidentally changed the time interval from 1 hour to 1 second, which meant that we were asking Google Maps for a new authentication token for pretty much every ride request.

Image for post
Image for post
Sloppy Simon does it again.

Fixing performance

We can revert this change, make a pull request, merge to master and trigger a new deployment of our service:

Image for post
Image for post
Reverting the offending commit. While we wait for the deployment to succeed, we can have a stern conversation with Simon.

Once the deployment completes, our customers will again start kicking off the chain of interactions between our suite of microservices. The instrumentation layer collects fresh production metrics and associates them with our newly deployed version. The IDE plugin immediately displays the newly-collected feedback (for the same version we currently have checked out, i.e. HEAD of master). We can see that the if-statement, which was previously being executed 100% of the time, is now quickly decreasing towards a minimal amount as more customers request rides. A single point is visible on our graph this time, as this call was made only once in the past hour, as expected.

Image for post
Image for post
We’ve fixed the performance regression.

If we go back to our main find_ride function, we are able to see the new performance characteristics of the call to compute_distance_km, and the performance graph looks happy again. The gutter on the left also displays a big decrease for the find_driver_candidates call (which also uses the map/location service to find nearby drivers), shaving off multiple seconds from the find_ride call.

Image for post
Image for post
A decrease in response times clearly seen as a result of our change.

This may be a slightly silly and contrived example, but it’s the only way to showcase what this tool is trying to achieve, without the unnecessary details. The key takeaway is that, unlike any APM tool, the IDE plugin together with an always-on instrumentation layer is able to provide developers with a tool for the analysis of performance regressions, mapped on top of real code artefacts. It provides meaningful feedback on a line-by-line basis, and integrates with existing IDE workflows extremely well, such as version control, jump-to-declaration, etc. No browser-based APM tool is able to provide this level of developer-friendliness and actionable feedback.

Exceptions and logs

The instrumentation layer collects a lot more than just profiling information. We’ve already seen the IDE plugin display an estimated branch probability for the if-statement in our compute_distance_km function, but it actually uses the same profiling information to figure this out. Why stop here?

Remember how we said that if we couldn’t find a driver for our customer, we would throw an exception. Let’s hover over this line:

Image for post
Image for post
The plugin displays exceptions across the whole stack trace, directly in our editor.

The plugin is able show us the list of exceptions which have been thrown by this line for the current version of the application. Although we have already fixed the likely culprit which caused our performance regression, it’s very easy for us to hover over this line to double-check that everything is okay. In this case, only two occurrences of the exception are seen, and combining this information with our own knowledge of the wider business context, satisfies us that this is the expected behaviour. If we saw that this exception was suddenly being thrown on every find_ride call, we could immediately start investigating the root cause of why eligible drivers aren’t being found.

All of the lines in the entire stack trace are highlighted — such that if we navigated to the caller of find_ride, the same information would be displayed. The aim is for future versions of the plugin to include rich filtering features that would enable developers to answers questions like “How often does the code I’m currently looking at throw exceptions, under scenario X”, and later help them answer even more important questions like “Why is this code throwing exceptions in scenario X?” Furthermore, the effects of any deployments made are immediately relayed back to the developer as soon as the first customer starts interacting with the new version of the application. This massively decreases the amount of time that developers have to wait to see if their most recent deployment fixed the last set of issues they were working on — with the increased confidence that it’s the real system running in production, and not just an automated test suite. This can not only increase the rates at which we continuously deliver software, but also make us more confident about their correct operation.

Why stop at exceptions? If you have been paying attention, we have a logging call right at the end of our find_ride function. It displays the number of drivers which decline a ride prior to us finding one. Hovering over this line shows the exact time, type and message for each logging call that has been made in this deployment:

Image for post
Image for post
The plugin displays logging records across our entire application.

As with exceptions, having such information readily available in the IDE eliminates the need for developers to constantly switch between a terminal, or perhaps a log-aggregator tool, and their code. The logs they’re interested in are positioned in the exact place where they occur and inform the developer when making various development decisions.

Smart IDE suggestions

The above demonstrated features tie in various first-class metrics such as performance, exceptions and logging data in a sleek IDE plugin that allows us to diagnose a wide array of issues. We are also not limited to using this information for debugging — we can envision this kind of information helping developers to write better code rather than simply being a post-mortem tool.

Any regular user of the IntelliJ IDEs should be familiar with inspections. These are helpful hints which the IDE gives us on anything from code style to auto-refactoring suggestions. What if our production feedback plugin implemented a similar suite of inspections, that could warn the developer that they’re about to call a function which has recently experienced performance issues, or has recently started throwing exceptions or otherwise misbehaving? There are a lot of possibilities to explore beyond this initial v0.0.1 prototype.

Image for post
Image for post
Example of an IntelliJ inspection telling us that this isn’t a very Pythonic way of testing booleans. What if it told us things about how our code is likely to behave in production, before we even push?

What you’ve seen in the above only scratches the surface of what is possible in terms of integrating rich production metrics with developers’ daily workflows. There are many other questions which can be answered through the use of such a plugin — we explore some of these at the end of this post.

Technical Details

There are three main components to this entire pipeline:

  • A language and framework-specific instrumentation layer which hooks into our web application and collects profiling information, logging calls and exceptions

I will go over each of the above briefly — you can also refer to the below diagram:

Image for post
Image for post
Overall system architecture.

Instrumentation layer

From the user’s perspective (the user of this system), it all starts with a simple configuration file which looks like this:

Image for post
Image for post
A feedback.yaml file.

This tells the instrumentation layer a few things.

  • A unique name to identify the production metrics for this application

The user then opes up the file where they initialise their Flask application (or any other supported* web framework), import the instrument_flask function from the instrumentation package, and wrap their flask app in a call, passing in the feedback configuration file:

Image for post
Image for post
Minimal setup required.

*currently supporting Python + Flask, but the tool is very extendable to other frameworks and languages.

This call does a few things:

  • Hooks into the Python root logger in order to capture logging information using a custom handler

The last thing left for the user to do is to open up a configuration dialog in IntelliJ and point the IDE to the same feedback configuration file:

Image for post
Image for post
The only IDE set-up required is choosing the path to the feedback configuration file.

After that, the feedback toolbar and all of its associated actions will become available.

The profiler we’re using is called PyFlame, and is a very low-overhead sampling profiler which is safe to run on every request in production. It runs as a separate process and uses ptrace to inspect Python stacks and infer execution times. It runs for the duration of the whole request, after which it is terminated for that request. I am currently working on a way to run this as a parent of the Flask process in order to allow ptrace to work without any elevated privileges. Another TO-DO item is replacing this with an even better and more actively maintained profiler, PySpy.

All instrumentation is collected on a per-request basis. At the end of each request, the collected profiling data in the form of a blob of text are sent to the metric backend for parsing, together with current version information, logging records captured during the request, any thrown exceptions, start and end timestamps, etc. The tool can be easily adapted to work outside of a web-application context; however, I’ve decided to focus on web applications, as they allow individual requests to become first-class citizens with a well-defined start and end.

Metric Backend

The metric backend acts primarily as a parser of instrumentation data such as that produced by PyFlame and stores this versioned information in a PostgreSQL database.

The hardest thing it does is the parsing of PyFlame outputs into meaningful profiling information. The output from PyFlame is essentially a mapping between a particular stack trace and the number of samples it collected in the period of time that it ran. From the information sent by the instrumentation layer, it is able to translate each profiled PyFlame line into an execution time.

Another part of the API allows a request for any filename to be sent by the IntelliJ plugin, along with versioned feedback for the last N requested versions.

IntelliJ Plugin

The bulk of the code is actually in the front-end plugin. The plugin essentially creates a feedback loop, both conceptually and literally. Every X seconds, it asks the metric backend for feedback on the currently opened files. Once it receives this feedback, it performs line translation in order to create a view of a single logical line.

To illustrate what I mean by this, think back to our earlier focus on the line which called the compute_distance_km function. As code changes, this line will be moved around the file, or possibly be changed/deleted. However, we would like to display a performance history (as well as power other features) based on the previous N versions of the same line available in the metric backend. If during version A the line existed on line 35, and in version B the line was moved down to line 39, the profiler would collect profiling information for two different lines. The IDE plugin uses textual differences between consecutive versions of the file to figure out that line 35 is logically the same as line 39. It is then able to translate the feedback collected from line 35 in the earlier version and combine it with the current feedback collected for line 39. This allows us to track changes in performance (as well as other things) even as code evolves.

Some might question the validity of this approach. Is it really possible to perform this translation in 100% of cases? Shouldn’t we be doing some form of static analysis to confirm that line 35 indeed equals line 39? The answer is based on a few facts:

  • This actually works very well in most cases. If a line looks exactly the same, it can be considered the same line from the developer’s perspective, as it was called with the same arguments, etc. There may be some edge cases to cover, but these can be dealt with other, simpler ways than static analysis

All of this line translation means that as developers actively edit code in their editor, they can still view existing feedback in the place where they expect it, despite the fact that the instrumentation layer has no awareness of how code is being changed and later viewed by developers. It also allows us to track how feedback on a particular logical line evolves through multiple deployments, allowing teams which adhere to continuous delivery practices to immediately see the results of their changes.

Future Work

What we are proposing here is an entirely new method of creating a very quick and effective feedback loop that aids developers in their daily decision making. Some other features which are being planned:

  • Integration with request payloads: being able to look at a function and filter based on the payload which kicked off its execution would allow developers to lean how their function behaves with respect to different inputs. This information can be smartly overlaid on top of lines, with the plugin making certain intelligent suggestions without the developer needing to think of how to filter in the first place

Furthermore, none of this work is limited to Python, or the IntelliJ platform — these were picked for reasons of simplicity and familiarity. The metric backend and IDE plugin have been written in a generic* way to allow extension to any other language, web framework, or profiling tool.

*as generic as one would expect for v0.0.1.

Closing Thoughts

This work is done for the purposes of my own final-year Masters project and supervised by Dr. Robert Chatley at Imperial College London. This blog post is being written as a self-contained demonstration, but also as a way to reach out to developers and gauge interest. Given enough interest, I will continue its development as an open-source alternative to current proprietary APM solutions. Feel free to star the GitHub repository for updates on a potential stable release:

Thanks for reading.

[1] J. Cito. Developer targeted analytics: Supporting software development decisions with runtime information. In 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 892–895, Sep. 2016.

Image for post
Image for post

Follow us on Twitter 🐦 and Facebook 👥 and join our Facebook Group 💬.

To join our community Slack 🗣️ and read our weekly Faun topics 🗞️, click here⬇

Image for post
Image for post

If this post was helpful, please click the clap 👏 button below a few times to show your support for the author! ⬇


The Must-Read Publication for Aspiring Developers & DevOps Enthusiasts

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store