Telemetry driven releases with Application Insights and VSTS (part 1)
Gathering and comparing telemetry in staging and production environments
With Application Insights you can gather a lot of telemetry on your running application and with VSTS you can easily setup a pipeline to deliver your changes to production in a fully automated way. What if you would combine these two and use the telemetry from Application Insights to verify your changes do not have unexpected side-effects in production before you allow your changes to be released to all of your users.
Recently Marcel de Vries and I looked at what it takes to setup this simplified release pipeline using VSTS to orchestrate the release, Application Insights to gather telemetry and Azure to host our example application.:
Note that we only waited 15 minutes here after deploying to staging, which is likely too short for a real production scenario. Also we will only have a basic check before we allow our changes to be released to our users. But this example will provide you with a basic pipeline you can extend for your particular scenario.
We are going to send the telemetry of both our staging and production environment to the same application insights instance. We do this so we have the ability to compare the data with charts, queries and alerts using application insights analytics. We will then compare the performance of our home page between staging and production to spot performance degradation.
Setting up Application Insights
First let setup application insights to be able to separate the data. I’m going to assume you already have application insights configured for you application. In the example we have used the famous MVCMusicStore but you can apply these techniques to your own application as well. We chose for the MVCMusicStore specifically to prove these concepts can be easily applied to older technology stacks as well. To be able to separate the data from each environment we are going to “tag” all telemetry send to application insights using a custom “SlotName” property by implementing an ITelemetryInitializer.
Then we add this into our Application_Start method (line 25):
Using Application Insights Analytics
Now our telemetry is tagged we can use this custom property in queries so separate requests to “staging” and“production”. First login to your azure portal and find the correct application insights instance and select analytics
You will be redirected to application insights analytics. Here you can create all kinds of queries and charts. For example this query:
Will give you a stacked bar chart with the requests to both staging an production over the past 14 days summarized per 5 minutes
Because this is application we used for testing the chart looks a bit strange. But you clearly can see we started a load test around 13:45 and that we diverted some traffic from our production to our staging environment so we can see how it behaves with real traffic.
Another, and in this case more interesting, query you can run is the following:
This query shows the duration of request to the home page on both staging and production. In our example we added a feature that made loading the homepage really slow and you can see that in this chart starting around 14:15. You can also see that this feature made it into production around 15:15 because the blue dots representing request duration on production move up to the level of the green dots representing request duration on staging.
Using the data that we have, we would have been able to prevent this. We have the data to calculate what the “normal” request duration is on production and we can compare that to the current request duration on staging after deploying a new feature there. We can then make a rule saying if it for example gets more than 10% slower than the average request duration over the last 24 hours we do not want to continue.
We can use a query like this to do just that:
let AvgDuration = toscalar(requests | where timestamp > ago(24h) and customDimensions.SlotName == "production" and name == "GET Home/Index" | summarize percentile(duration, 95));requests |where timestamp > ago(5m) and customDimensions.SlotName == "staging" and name == "GET Home/Index" and duration > AvgDuration * 1.10 | summarize(count())
First we calculate the average request duration over the past 24 hour. We are taking the 95th percentile to filter out the high outliers to get a more stable value. Then we count the number of request to staging over the past 5 minutes that are more then 10% slower then the average duration on production.
Note that looking back 24 hours might be too short to get good averages in production. In our test environment we did not have long term data so looking back over the past day gave the most usable results for us. In staging we look back 5 minutes because we wanted to be able experiment and see results quickly, again in a production scenario you might want to take more time and look back further to spot anomalies in your staging environment. To summarize, do not just blindly copy this query or others but spend some time querying application insights analytics to figure out what time frames and percentiles work for you.
You can use the same queries to setup alerts in Azure Monitor by querying the application insight logs and setting thresholds.
What worked really well for us was creating the query in Application Insights Analytics first because it is easy to visualize it in different ways. The visualization make it a lot easier to interpret the results. Also it is easier to modify it quickly in the application insights analytics environment. Once you have query you are happy with, you can turn it into an alert or use it in other places such as the application insights REST API which in turn can be used as a release gate in VSTS.
Part 2 will focus on how to do just that by integrating this query into a release pipeline in VSTS.