Towards Observability-Driven Development…

Published in

mStakx

5 min readJul 31, 2018

It was 06:30 in the evening, and I was about to wrap a pretty normal day at work. I closed my laptop, packed my backpack and on the way home started browsing through Twitter. That is when I was hit by a wonderful surprise — A tweet by Jason Dixon highlighting that Monitorama now acknowledges ‘Observability’ by updating their tagline from ‘An Open Source Monitoring Conference & Hackathon’ to ‘An Inclusive Event for Monitoring and Observability Practitioners’.

This development seems congruous to the update in the strategy of almost every tech company who build products around IaC. ‘Observability teams’ now are as legit as ‘DevOps teams’ in tech organizations…and they’re here to stay (at least until a better philosophy hits the automation tech. world)

In my earlier post, I had mentioned how the philosophy of Observability stack emphasises on the contribution of the development teams in addition to the DevOps/SysAdmin teams to build effective ‘Monitor’able systems. For a better part of this decade we’ve emphasized on Operations teams to pitch-in with development efforts to make systems more ‘Monitor’able. However, with AI and DataScience piercing through every aspect of software development with its worthwhile insights, it is but natural for developers to own up the task of making systems better at being context-aware, self-healing, and hence intelligent.

Here is an update on my recent firsthand experience of how implementing Observability stack for their systems could make developers’ lives more productive and effective: It was mid-July, and a friend of mine buzzed me with an issue they were facing at their start-up. This friend of mine had founded a FinTech start-up a few months ago, and had an energetic set of developers building software on Python Django platform. This application had a couple of Selenium scripts running to automate some of their business logic.

Python Django Application with Selenium Automation

The problem the team presented in front of me was: When they tested everything on their local setup everything worked fine. However, the moment they deployed the application on AWS, the Selenium scripts would fail resulting in failure of the business logic.

This was a good opportunity to take our boilerplate #1 (link) for a spin. I was excited to see if our Observability stack would catch the issue at hand!

We instrumented the code, and deployed it on AWS EC2. Once all the Docker containers were up-n-running, we checked the Kibana dashboard, and voila! There it was…clear as day: We could actually observe the failure logs and traced them from Application -> nginx -> gunicorn.

Gunicorn (Green Unicorn) is a Python WSGI HTTP Server for UNIX. The application was using gunicorn to serve the requests coming over to the server. A typical invocation of gunicorn would be as simple as:

gunicorn --bind 0.0.0.0:8000 myproject.wsgi:application

With such an invocation the keep-alive value (The number of seconds to wait for requests on a Keep-Alive connection) tends to be 2 seconds. However, since the Selenium scripts made a certain assumptions, the keep-alive value for this particular application needed to be higher. We experimented and found that 11+ seconds worked for the selenium scripts to execute successfully. Hence, the new invocation became:

gunicorn --bind 0.0.0.0:8000 --keep-alive 11 myproject.wsgi:application

The developers until this point had never checked the gunicorn config. The maximum they had debugged was application logs, and nginx logs. Once we got to know that the first failure happened at gunicorn level, we found that the default keep-alive of gunicorn was very small for Selenium scripts to complete their execution, and hence, the server would close the connection, and the consequent requests would fail and the request would fail with error 502.

Since, there was no monitoring/Observability framework in place, all things were happening manually in their dev-world! Before trying out our boilerplate#1, the development team had spent quite a lot of time (days!) to debug the issue; and were unable to figure it out. However, with proper Instrumentation + Stack + Visualization in place, the issue was detected and rectified tracked in a matter of minutes. Moreover, we also found out that the local dev-environment and the production/test environment were not identical. Thus, we found our first real-world use-case for our boilerplate!

We’re now experimenting further as to how we could remedy this issue by probably turning the system into a self-healing system. This would involve setting up a rule-based approach to increase/decrease the server timeout/keep-alive values (Gunicorn/nginx etc.) based on request type in order to avoid unnecessary 502s and other server failures!

Looking back at this experience, I wonder: If Observability could impact a small group developers’ life in such a positive manner; then how valuable a philosophy it would be for tech giants! With technology getting complex with each passing day, it is imperative to inculcate an Observability-centric culture in the engineering teams.

In upcoming blogs the intention would be to go deep into every aspect of Observability in order to promote adopting Observability-Driven Development practices.

Keep reading, keep exploring, keep sharing… :)

Towards Observability-Driven Development…

Written by Sumit M