Stackdriver: Enabling deep code level insights

We need to get Nagios up-and-running. No, we need to get Splunk up-and-running. I’ve used Zabbix at my previous assignment. Let’s create an ELK stack to perform monitoring.

Sigh. How recognisable and yet so annoying…


Introduction

As a DevOps team, we are responsible for writing good quality code. Performing our test suites, being unit-tests, code coverage tests or regression tests. After that, we are shipping our lovely code into production, ready to be received by our customers.

It all sounds so simple, but from the moment we started developing our code until it has landed into a production environment, we came across at least 6 different debugging/logging/monitoring/profiling/communication tools.

Those tools help us in getting the actual insights we need to keep delivering a good quality customer journey online.

Ask 6 different DevOps teams to list their favorite tools for debugging code, logging errors, monitoring applications and communication and you will probably have a selection of 24 products that can be used to achieve the goal as described above: keep delivering a good quality customer journey online.

Figure 1.1 An example of a ‘few’ tools to choose from in the DevOps field

Monitoring: Cause or Effect?

Now imagine the situation that something would really happen in production.

Our monitoring, profiling, logging tools would be reporting red, sending multiple notifications, triggers and information about something happening on production. But what are we actually looking at? Because now we are playing the actual game as how you could see it when entering a war room.

Are we looking at the root-cause of the incident our customers are facing, or about to face, or are we simply looking at an effect.

From the moment we are in a crisis situation, we need to have all the information available regarding the incident. That means we have to login to all of our different monitoring, profiling, logging tools to first create clear oversight at what happened. Which tool alerted the first anomaly or was triggered by reaching a threshold etc.

Figure 1.2 Are we looking at Cause or Effect when using multiple tools?

In practice using all these tools would mean, wasting our valuable time in:

  • Getting the tool experts in place for each tools, to explain us how they work
  • Notifying the administrators we are missing credentials or policies
  • Wasting a lot of time to get the insights we need from the different tools, to get a clear understanding of what is actually happening
  • Resetting passwords to be able to login at all

How awesome would it be if we would have all information in just one tool. Naah. Not gonna happen, would probably be your first mindwave.

But what if I told you Google Cloud Platform created Stackdriver, enabling you to debug, diagnose, log, communicate and fix the software problems, even in production. Created by Google, it does not only support Google Cloud Platform, but also Amazon Web Services.


Deep code level insights

With the default Stackdriver profile you can already get a lot of measurements and metrics for free, supporting a wide range of web-/application servers. When onboarding into Stackdriver you will receive automatically insights to your applications uptime, as the health monitoring is implemented on default. From that first view you can easily extend the metrics/monitoring configurations you would like to see on the dashboards.

You can also define your error thresholds in your application infrastructure with only your mouse and integrate to tools as DutyPager, Slack, SMS notifications, Email etc.

When it really comes to problems in production you can also convert the monitoring alerts into tasks or tickets in your ticketing system of choice, whether being trello, jira, zendesk etc.

The best part is yet to be unrevealed, being live diagnosing in production (with a minimum delay) on what actually is the root-problem of the issues your customers are facing. Stackdriver profiling enables you to have a look at the traces of customers, also revealing the complete stack with errors in the code.

No key configuration or agent needs to be installed on the application server to start diving into debugging your code. It is just completely set-up from within Google Cloud IAM. We will keep you posted on some detailed insights of Stackdriver!


Want to know more?
Visit our website: http://levarne.nl for more information or contact!