I attended the keynote speech of Charity Majors at Serverless Computing London, where she highlighted the term observability-driven development to create highly available and resilient systems. Her speech inspired me to write about what observability-driven development is and it means for serverless.
What is Observability-Driven Development?
Observability-driven development (ODD) does not replace test-driven development. It adds another layer on top. Some developers believe writing lots of tests is the best solution to preventing application failures. This is great for verification purposes but what they forget is that they also need to prevent unknown-unknowns — failures that they might not be aware are happening and cannot predict. And then they neglect to implement observability on the system that can help detect all possible permutations of failures — both the predictable and the unpredictable ones.
Today, developers are expected to take full responsibility over their code — not just for the design of the piece of code itself, but also to make sure it runs well in production, it works well as a piece of a complex system. Therefore, developers cannot stop at simply writing tests to make sure their portion of the code executes as planned. They have to keep in mind that there are other influences and circumstances the code may experience that they weren’t able to predict. Installing observability or taking an ODD approach helps create a robust portion of code that allows it to be passed into production and into unknown, unpredictable situations and be evolved to perform reliably and well even after the developer has finished writing the core code itself.
What is the value of ODD in serverless applications?
Employing ODD while developing serverless applications is especially important because of the highly distributed nature of serverless architectures. With distributed systems the chance of having a failure in some part of the system increases exponentially when compared to traditional systems.
Because of this, as developers, observability should be our topmost concern when developing serverless applications. This is especially true for anyone who wants to create highly available and resilient systems.
Your System, Your (Monitoring) Choice
As Thundra, our mission is to help bring observability into the serverless world and make it easier for you to adopt serverless applications into your systems. But, we do not make assumptions about which metrics or data are the most valuable for you to monitor. That decision is yours and Thundra is designed so you can choose according to what your application (or business) needs.
We provide all the tools to make it easy to add observability to your system, but after that it’s important for you to determine:
- What metrics and information is most valuable for you to observe
- How that information needs to be presented so you can ask question, identify problems, and take action
Let me explain with couple of examples.
Thundra provides automated instrumentation, which allows you to see an AWS Lambda function’s execution flow end-to-end. However, Thundra cannot determine all the metrics you need to understand the behavior of your system. By default, AWS and Thundra collects some metrics. However, we recommend that you also create custom metrics to enrich the default data that Thundra collects. For instance, if you have a Lambda function that charges money to your customers, you might want to include the transcationId of that payment in your span. Later, if your customer asks why he is being charged twice for the same item, you can use Thundra to quickly discover which transaction was affected by which failed Lambda invocation.
Let’s talk about another example. Thundra gives you the option to perform asynchronous monitoring so you do not add any overhead to your system. With this approach, your system sends all the data to AWS Cloudwatch logs. Then, we grab that raw monitoring data and convert it into valuable information by displaying it in our visualizations and giving you the ability to query the data to get more precise answers.
However, sending a log with a timestamp, log level and a log message is probably not going to help you quickly answer tough questions about your system behavior. We recommend using a structured logging approach in order to extract quality information from your logs and to make it easily “queryable.” Think first about the questions you might want to answer about your system behavior and then what logs you need to send in order to logically answer those questions.
Learn, Practice, Observe, and Repeat
Building a robust observable system isn’t something that happens overnight or on your first try. However, it’s certainly worth always keeping in mind as you develop your applications. Educate yourself on how to develop observable systems.
Take your best shot at adding observability to your application on the development phase. When you test and deploy it, see what information is helpful and what you could instrument differently to get your answers more quickly. Then, go back and adjust your approach. Learning how to create an observable system is like a muscle you need to continuously train. It will become easier to implement as you continue to exercise it. And, in the end, you’ll be a hot shot at developing highly reliable, resilient systems.
When you create an amazing observable system, you can rest assured that if anything goes wrong you’ll immediately know about it and know exactly how to fix it. Zero guessing involved!! Jerks or “innocent helpers” can always go in and break stuff. but at least you can rest assured it’ll never be a mystery what’s breaking and where to go to get it resolved.
Because they are highly distributed and often part of complex systems, applications built on serverless architectures should not be left alone in the dark. Applying observability is a necessity for proper serverless development.
Building an observable system begins during the design and development phase. Observability technology like Thundra only goes so far as the developer must also decide what metrics are most important to collect and how the information needs to be presented. Practicing and establishing great observability development skills is essential to building highly-available and resilient systems.