What does logging, monitoring & observability mean in the world of devops?
In this blog post, we will learn the essential concepts of Logging, Monitoring, and Observability in terms of software development and delivery.
Unless you’ve been living under a rock, you might have heard these words raining like cats and dogs by your DevOps/SRE teams. If your organization is lean and does not have a dedicated team, then software architects or senior developers would brief you about these aspects.
Let us define these principles/practices in simple, easy-to-understand, and all-inclusive language.
Socrates rightly said,
The beginning of wisdom is the definition of terms.
Logging — It refers to the act of collecting, storing, and analyzing log data generated by software systems and applications for providing valuable insights into the performance, reliability, and security of these systems, and can help teams identify and resolve issues quickly.
This information can be used to diagnose and troubleshoot problems and understand how a system is being used. Common types of log data include system logs, application logs, and access logs.
The primary purpose of logging is to record what happened in the system and at what timestamp.
Logging improves the efficiency and effectiveness of workflows and processes (making auditing and troubleshooting faster) and can lead to better outcomes by improving quality.
It also provides features like —
- Real-time search and analysis
- Integration with other DevOps tools and platforms
- The ability to set up alerts and notifications based on specific log data patterns.
Some popular logging tools are Graylog, Sentry, Splunk, Elasticsearch, Logstash, and Kibana (ELK).
To draw from a real-life parallel example for logging, think of a security guard/ helpdesk registration desk who notes basic details of everyone like
- Entry time
- Exit time and
- Purpose of visit and location.
Monitoring- It refers to actively observing and analyzing the performance and behavior of systems and applications in real-time. This allows teams to proactively identify and address issues before they become critical.
Common types of metrics to monitor include
- System resource usage
- Application performance
- Error rates.
It supports features like setting up monitoring tools and dashboards that provide real-time visibility into the metrics as mentioned earlier.
Examples of monitoring tools include — NewRelic, DataDog, and Prometheus.
Note — Logging and Monitoring are closely related and are sometimes used interchangeably by DevOps professionals, so to quickly identify/differentiate them remember the below point.
Logging — Historical analysis, and auditing/record keeping for compliance reasons, whereas
Monitoring — More real-time visibility and proactive alerting to prevent issues which can lead to outages in future.
Like the earlier security guard/helpdesk example, monitoring can be thought of as 24*7 (closed circuit television) CCTV scanning of premises.
Note that logs captured by a register/diary give extremely limited visibility, but CCTVs can capture much robust features like
- The appearance of a subject,
- Clothes/accessories that they are wearing,
- Their walk/gait and other mannerisms
Such features are not readily accessible or obvious in the case of logging, but they can be analyzed with monitoring.
Observability - It refers to the ability to understand the state of a system and its components using various techniques and tools. This includes monitoring, logging, tracing, profiling, and testing techniques.
In simpler terms,
Observability is Logging +Monitoring+ Tracing + Profiling i.e., the superset of all the above concepts
It can also be thought of as a holistic, 10k ft bird’s eye view of the system.
Observability is critical in modern, complex environments that rely on microservices, containers, and other distributed technologies. It allows teams to gain visibility into the interactions and dependencies between different components and systems.
“Observability is the foundation for building reliable, scalable software. Without it, you’re flying blind and hoping for the best.” ~ Torkel Ödegaard
To understand Observability with an example, let’s look at how sniper soldiers operate.
Lots of people feel that military snipers are lone wolves and operate in isolation. But that is not the case. Snipers usually have a spotter — a support operative on the ground actively observing and relaying information about the target like the location, surrounding movements, and so on. If a spotter decides the target is a threat, they convey the same to their buddy aka the sniper soldier laying low a few kilometers away.
Based on this information the sniper can take the ultimate decision as to whether to neutralize the threat or not.
Now that we have defined these terms, let’s look at some challenges involved in traditional Logging, Monitoring, and Observability systems
- Complexity — Traditional approaches to logging and monitoring can be complex and resource-intensive, requiring the use of multiple tools and technologies to collect and analyze data from different systems and applications. This can make it difficult for teams to gain a comprehensive view of their systems and can lead to inefficiencies in their workflows and processes.
- Limited visibility — Traditional approaches to logging and monitoring often provide limited visibility into the behavior and performance of systems and applications. This can make it difficult for teams to find and resolve issues quickly and can lead to longer mean time to resolution (MTTR) for issues.
- Silos — Traditional approaches to logging and monitoring can create silos between development and operations teams, as these teams may use different tools and technologies to collect and analyze data. This can make it difficult for teams to collaborate and can hinder the flow of information between teams.
These challenges are effectively mitigated by the following —
- Centralized logging — involves collecting log data from various sources and storing it in a central location for easier analysis and visualization.
- Continuous monitoring — involves using automated tools and processes to continuously monitor the performance and availability of applications and infrastructure.
- Performance monitoring — involves monitoring key performance indicators (KPIs) to ensure that applications and systems are meeting their performance targets. For example - monitoring things like response times, CPU utilization, and memory usage.
- Event correlation — involves analyzing log data and other monitoring to identify patterns and relationships between different events and issues. This can help DevOps teams identify the root cause of problems and take more targeted actions to fix them.
- Automated incident response — This involves setting up automated processes to respond to specific events or issues identified through monitoring and logging. This can include automatically restarting failed services, scaling up infrastructure to meet increased demand, or sending notifications to the appropriate team members.
I hope that after reading this blog, you now understand the nuances and importance of Logging, Monitoring, and Observability in the new age software development life-cycle.
Stay tuned for more stories like this, and please share your feedback, comments, or suggestions if any in the comments section.