Sofia’s Observability Odyssey: The Culture of Observability

7 min readNov 27, 2023

Josh: Hello, everyone! This is your host, Josh “Not Null” Macintosh, and welcome to the XYZ Tech Podcast! Tonight, we have something special in store for you all. We’re thrilled to have not just one, but two special guests to enlighten us about observability. Please give a warm welcome to Sofia Wang and Lauren Johanssen! Welcome, both of you!

Sofia: Thank you for having us… I mean, me! Haha!

Lauren: Hey, everyone. Thank you for having us!

Josh: The pleasure is all mine, folks! So, could you please introduce yourselves to our audience?

Lauren: Absolutely! I’m Lauren Johanssen, an SRE with four years of experience. I transitioned from a development path in Java and have been focusing on observability as an SRE for some time now. It’s been a fun journey!

Sofia: My name is Sofia Wang. I’ve been working with infrastructure for over 5 years now, and I’ve been on the SRE path for 2'ish years, working with a fantastic observability team alongside my dear friend here.

Josh: And here we are, folks! Tonight’s all about observability with Sofia and Lauren! I’m pumped, and I hope you are too!

Lauren & Sofia: Absolutely! Laughs.

What is observability?

Josh: So, could you guys define observability for our audience?

Lauren: Indeed, observability means understanding a system’s internal state by observing its external outputs. Essentially, collecting surface-level data allows you to comprehend what’s happening inside a system.

Josh: Sofia, could you elaborate on that?

Sofia: Sure! Picture observability like understanding a party from outside the room. You hear music, see people coming and going, notice changing lights, catch bits of conversations through windows. Despite not being inside, you get a good idea of what’s happening inside. Similarly, in tech, observability is about understanding a system’s internal workings by observing external signals — like error messages, performance metrics — without directly accessing its internals.

Lauren: She’s the analogy queen, no doubt laughs. Observability isn’t simply about installing a tool to monitor a service or two. It involves monitoring every data source, mapping out everything — from serverless functions to a vast legacy database. This approach allows you to understand how all components interrelate. For instance, you’ll predict how the database might fail following a substantial traffic spike and identify the cause far more quickly than previously possible.

How it all started?

Josh: So, how about you guys tell us about your first experiences with observability? What was your first step into this world?

Lauren: Well, my story with observability isn’t that fun _chuckles_. I had a task to fix a problematic Elasticsearch installation, which led me into the world of infrastructure. From there, I delved into observability — collecting metrics, writing alerts, creating dashboards — and the rest is history.

Sofia: For me, working with infrastructure was a long-term gig, and observability became the path I wanted to follow. Some friends pursued security, management, etc., but for me, observability was the choice.

Real Life Scenario

Josh: That was enlightening. Now, could you guys share a real-life scenario where you had to handle observability?

Sofia: Of course. Recently, we were tasked with assisting a team that operates a crucial service. The issue lay in their entire observability setup, or lack thereof — you could barely call it a stack, just a single logging service.

Josh: Oof! Oh boy, that sounds troublesome.

Lauren: Oh, it definitely was.

Sofia: We spent about a month helping them with various tasks, such as instrumenting their Java application, collecting metrics, creating dashboards and alerts, and revamping their logging system to focus solely on necessary information. Now they’re on track, no more war rooms for a while chuckles.

Lauren: And that was just the beginning. We have more discussions ahead about tracing and the concept of opentelemetry. It’s going to be an interesting case to discuss once we’re done with it.

Josh: Otel? Really? That sounds cool! Also, let me send a quick shoutout to the Opentelemetry community out there, you guys are doing wonders chef’s kiss.

Sofia: A collector to rule them all chuckles.

APM Services

Josh: Indeed. Now, talking about some other tools, what’s your take on APM like Datadog, New Relic, Dynatrace, etc? Is it any good? Expensive? Necessary? Can you do something like that using opensource? Because once I had a coworker that said that on a daily out and loud “Why APM? Let’s do our APM” and people just laughed like “Yeah mate, let’s do it with hookers and blackjack”.

Lauren: Laughs. Well, depends.

Josh: That’s what a senior engineer would say, right? Haha.

Lauren: I guess so chuckles. I share Sofia’s perspective on this. In my view, if you’re running a startup — small, with few engineers and developers, but growing rapidly — it’s money well spent. From day one, you might not have the right number of people to handle observability, unless you’re an exceptionally seasoned CTO, which most aren’t. By the time you gather enough personnel to address this, it might be too late, not impossible but significantly more challenging. So, if you have the funds, it’s worth investing. Nowadays, APM services handle everything: metrics, tracing, logging, and the APM itself, all in one package. Setting up alerts becomes effortless, with painless instrumentation. As your company expands and hires more staff, you can then instill a culture of observability and consider open-source options to reduce costs along the way.

Josh: Awesome, so I have some experience with the purple dog one and it is fantastic indeed, expensive, but really helpful.

Sofia: Mhm. I have a lot of experience with New Relic, it is an awesome “observability suite” indeed.

Q&A

Josh: Girls, now it’s time for some Q&A! Let me fetch a few questions from our adorable chat, please say hi to our chat.

Lauren & Sofia: Hi chaaaat haha!

Josh: Happy now, chat? Now let me see, got a good one here from SuperUser321 asking “What should I do to apply some observability on a super distributed system?”

Sofia: I’d prioritize an end-to-end monitoring approach. It involves collecting data from every element within the structure, spanning from infrastructure to serverless components. The key is integrating and synthesizing this data to create a comprehensive end-to-end monitoring system. Sometimes I talk like a robot, so don’t mind me chuckles.

Josh: That was kinda creepy indeed haha. I’m taking some notes here for myself. Now from Calico1988 asking “Can I migrate my Zabbix alerts to Prometheus?”

Lauren: Well, it’s better to “migrate” the logic, not the alerts, just recreate the same logic on Prometheus and you’ll be good.

Josh: Unless they got like, 10k Zabbix alerts to move to Prometheus lmao.

Lauren: I hope that’s not the case laughs.

Josh: I want to see the answer for this one, JackieWelles1997 asks “Let’s say I’m gonna try to bring observability culture to the company I’m working for, but they won’t listen and follow the best practices and implementations, what else can I do?”

Sofia: Great question, Jackie! So, when it comes to building an observability culture, one big thing you need is support from management. I mean, everyone from tech leads to the big bosses has to get on board with how important observability is for our work environment. It’s all about gathering data and putting together a report that shows just how much better things could be with observability. You know, like spotting outages super fast, catching those weird glitches — stuff like that. Numbers speak volumes, so showing them some hard data can really drive the point home.

Josh: Moving on, let me see, here is a good one from member dARKMerlin88 asking “Nice talk guys! So you’re telling us that observability has the same weight as security for example when it comes to software development and systems in general?”

Lauren: Yeah, totally! You know things are going well when the whole tech team at your company prioritizes observability right from the start of any new project — that’s seriously cool. That’s when you can see the culture making a real impact. I mean, come on, it’s 2023! If you stumble upon a developer creating a basic API that’s gonna handle tons of traffic without any visibility at all, that’s just not okay anymore…

Sofia: Totally agree.

Josh: Well, that’s a wrap for tonight’s show! I had a lot of fun, what about you girls?

Lauren: It was fun!

Sofia: Indeed!

Josh: Haha that’s the spirit, what about you chat? I see I see haha. But, before I let you guys go, let me talk a little about our dear sponsor VampireShark VPN! With VampireShark VPN you’re totally safe and sound to surf the net! Use the code XYZPODCAST and get a limited 15% discount on your first 3 months subscription!

Josh: Now, a final question for you both. What’s the future for observability?

Lauren: Oh, that’s a cool question! Lately, I’ve been checking out some new community services like Otel, and you know what? Prometheus and Victoria Metrics are stepping up their game too. And hey, can’t forget Grafana! I’ve been leaning more towards open-source lately. But, I gotta say, APM services are sprinting ahead with anomaly detection and AI features. Is the future looking bright? I’m hoping for it, haha!

Sofia: I’m crossing my fingers for more tools in our toolbox — simple as that chuckles. Let’s keep spreading the observability gospel so folks catch on to its importance sooner rather than later.

Josh: That’s it people! I am Josh “Not Null” Macintosh and this is your XYZ Tech Podcast! Have a good night!

Lauren & Sofia: Bye byyyye!

Sources:

https://www.splunk.com/en_us/blog/learn/observability-culture.html

Sofia’s Observability Odyssey: The Culture of Observability

Written by Adso