Logging & Monitoring to keep track of large distributed systems

Published in

SystemDesign.us Blog

6 min readNov 19, 2022

Visit systemdesign.us for System Design Interview Questions tagged by companies and their Solutions. Follow us on YouTube, LinkedIn, Twitter, Medium.

https://www.skedler.com/blog/wp-content/uploads/2022/03/Three_Pillars_of_Observability.png

Logs, metrics, and traces are often known as the three pillars of observability. While plainly having access to logs, metrics, and traces doesn’t necessarily make systems more observable, these are powerful tools that, if understood well, can unlock the ability to build better systems.

Logs give us a record of what actually happened in our system. By their very nature, they are typically chronological and provide context for understanding other data sources. For example, we might use logs to help understand why a metric spiked or to troubleshoot a broken trace.

Metrics provide a way to measure the health of our system at any given moment. By querying metrics, we can answer questions such as “is the system up?” or “how many requests are currently being processed?”.

Traces give us visibility into how individual requests flow through our system. By following a trace, we can see where bottlenecks might be forming and identify which parts of the system are taking the longest to process requests.

Collecting logs, metrics, and traces can be daunting, but there are a number of tools and services that can help. At its core, observability is about understanding your system well enough to be able to answer any question that might come up about its behavior. By leveraging the power of logs, metrics, and traces, you can build systems that are more observable and easier to troubleshoot.

Event Logging

It is the process of tracking, storing, and monitoring events that happen in a computer system or network. The purpose of event logging is to provide a record of activities that can be used to track down and diagnose problems, as well as to monitor for security issues.

Event logs can contain a wealth of information about what is happening on a system or network. They can include details such as when a user logged in or out, what resources were accessed, and what changes were made to files or other data. By analyzing event logs, administrators can get a better understanding of how their systems are being used and identify potential problems early on.

While event logging is a valuable tool for understanding and troubleshooting systems, it can also be used for malicious purposes. Attackers can use event logs to cover their tracks and avoid detection. For this reason, it is important to properly secure event logs and limit who has access to them.

Event logging is a key part of any security strategy, and it can be used to help meet compliance requirements. When used correctly, event logs can be an invaluable resource for understanding and improving the security of your systems.

There are a number of different types of event logs, but they all share the same basic structure: a timestamp and a record of what happened. The most common type of event log is the system log, which tracks events that happen on a computer or network. System logs can be generated by operating systems, applications, and devices such as routers and switches.

Other types of event logs include application logs, security logs, and access logs. Application logs track events that happen within a particular application, while security logs track events related to security, such as failed login attempts. Access logs provide a record of who accessed what resources and when.

Metrics

Metrics are a numeric representation of data measured over intervals of time. Metrics can harness the power of mathematical modeling and prediction to derive knowledge of the behavior of a system over intervals of time in the present and future.

Since numbers are optimized for storage, processing, compression, and retrieval, metrics enable longer retention of data as well as easier querying. This makes metrics perfectly suited to building dashboards that reflect historical trends. Metrics also allow for gradual reduction of data resolution. After a certain period of time, data can be aggregated into larger buckets (for example, hourly instead of every minute) to save on storage and processing costs.

Metrics are a vital part of any observability strategy, and they can be used to track a wide variety of performance indicators. Common metrics include request latency, CPU usage, and memory consumption. By monitoring these and other metrics, we can see where bottlenecks might be forming and identify which parts of the system are taking the longest to process requests.

In addition to monitoring performance, metrics can also be used to track the health of a system. For example, we might monitor the number of failed requests over time to look for patterns that could indicate a problem. We can also use metrics to set up alerts that notify us when something unusual is happening so we can investigate and take action if necessary.

Tracing

Tracing is a technique for understanding the execution flow of a distributed system. It involves recording information about each step in the processing of a request as it travels from its source to its destination. This information can then be used to reconstruct the path taken by the request and identify any bottlenecks or other problems.

Tracing is an important part of observability, and it can be used to understand the behavior of systems in production. By analyzing traces, we can see exactly how requests are flowing through the system and identify potential problems. Tracing can also be used to monitor the performance of individual services and compare them against each other.

There are a number of different tracing tools available, but they all share the same basic features: the ability to record information about each step in a request’s execution and the ability to replay the trace to see exactly what happened.

Common features of tracing tools include the ability to take snapshots of the state of the system at various points in time, the ability to search and filter traces, and the ability to generate reports. Some tools also provide visualizations of trace data, which can be helpful for understanding complex flows.

Snapshots

A snapshot is a copy of the state of a system at a particular point in time. Snapshots can be used to understand what was happening in a system at the time they were taken and to compare the state of the system at different points in time.

Snapshots are an important part of observability, and they can be used to diagnose problems that occur sporadically or that are difficult to reproduce. By taking snapshots of a system before, during, and after a problem occurs, we can identify the root cause of the problem and take steps to prevent it from happening again in the future.

There are a number of different snapshotting tools available, but they all share the same basic features: the ability to take snapshots of the state of a system at regular intervals and the ability to compare snapshots to each other.

Common features of snapshotting tools include the ability to schedule snapshots, the ability to take manual snapshots, and the ability to exclude certain parts of the system from being snapped (such as sensitive data). Some tools also provide visualizations of snapshot data, which can be helpful for understanding the state of a system over time.

Conclusion

Logs, metrics, and traces are all important tools for observability. They each serve a unique purpose and are complementary to each other. By monitoring logs, metrics, and traces, we can gain a better understanding of the behavior of distributed systems and identify potential problems.