The DevOps Journey - Medium

Observability & monitoring — Part 04

Fathima Dilhasha — Sat, 29 Dec 2018 15:58:31 GMT

Observability & monitoring — Part 04

This post is a continuation of my series of posts on Observability & monitoring.

You can read my previous posts Observability & monitoring — Part 01, Observability & monitoring — Part 02 and Observability & monitoring — Part 03 to keep up with the discussion. These posts are broken down into many parts assuming that it will ease the reading & understanding process. It sure helps the writing process ;)

As promised, we will discuss about the implementation of the metrics monitoring tool kit. Since this is a PoC implementation, I decided to limit the metrics monitoring to a minimal set of metrics. If you need to extend this solution, it’s a matter of adding more exporters to expose additional metrics.

Moreover, this PoC implements a monitoring tool kit which monitors a deployment of Ubuntu servers in AWS consisting of a WSO2 server cluster. Following is the deployment diagram. Anyway, the implementation is pluggable with any other deployment. You might only require to change few metrics exporters to collect metrics specific to your system.

Deployment in AWS

Now, we’ll discuss about each tool covering the life cycle of a metric. Let’s keep the below diagram for reference.

Basic Architecture with tools

Metrics Exposure — Prometheus Exporters

Following is the set of initial metrics this monitoring toolkit covers along with the Prometheus exporters used. Since these are a set of common metrics, the readily available exporters[1] were able to cater the requirement.

https://medium.com/media/19ae50e456ff9f09561ad69397a48fa6/href

Node Exporter

Metrics covered by node exporter

The node exporter[2] exports metrics to monitor the status of a node. Metrics for Load average, Memory usage and total number of processes running are readily available.

Node exporter also supports text file collectors. A text file collector is capable of exporting the metrics written to a file in a previously configured directory.

In our PoC, the login test is done using the text file collector. There is a component which is capable of doing a login test against the WSO2 server that is running in a node. In order to append the login metrics to a text file every minute, a cronjob is deployed in the Ubuntu servers which makes use of the login component and writes the result to a text file in Prometheus’s text file collector directory.

If you are using any other service than WSO2, then you might want to design the login test following the above sample.

JMX Exporter

Metrics covered by JMX exporter

WSO2 servers are Java based servers and have JMX enabled by default. JMX exporter[3] is a Java agent which can be added to JAVA_OPTS in the server. In the case of WSO2 servers, you can add it to the wso2server.sh as follows.

-javaagent:$Jmx_exporter_agent_path=:$Jmx_exporter_config_path \

You can refer the post[4] by one of my colleagues to learn the specifics on configuring this exporter for WSO2 servers.

Blackbox Exporter

Metrics covered by Blackbox exporter

The Blackbox exporter[5] can be used to perform TCP, HTTPS connections and export the success or failure as a metric. In this case, we are using the Blackbox exporter for checking the accessibility of the port 9443, which is the default port in which WSO2 servers are running. Blackbox exporter configuration only contains the modules that are supported by the exporter. The other details are added in the Prometheus configuration.

Following is the Blackbox configuration that is required for our requirement.

modules:
tcp_connect:
prober: tcp

MySqld Exporter

Metrics covered by Mysqld exporter

The Prometheus’s Mysqld exporter[6] is capable of monitoring MYSQL server health based on many metrics. In our case, we will only use the metric result by performing a select query against a particular MySQL server. Connecting to your databases require credentials which can be passed as environment variables or hidden file. Prometheus also supports exporters for many other database solutions[7] as well.

Metrics Collection— Prometheus Server

Prometheus server requires configurations for discovering the targets that need to be scraped.

Scrape configs

In order to scrape metrics which are exposed by the exporters in previous section, we need configurations to specify different jobs(tasks) for the metric collection. Following is a sample job configuration in Prometheus config file for the node exporter.

- job_name: 'Node Exporter'
  ec2_sd_configs:
    - region: "eu-central-1"
      profile: "DemoISSetup"
      port: 9100

To discover nodes in a given AWS deployment many properties[8] such as region, availability zone, instance type or instance tag can be used. AWS API keys or a named AWS profile can also be used to discover targets.

In our example the region tag is being used. When this configuration is present in Prometheus, it will scrape metrics from port 9100 of all the nodes in AWS ‘eu-central-1’ region.

Prometheus also supports target discovery in many[9] other widely used deployments such as Azure, Kubernetes and OpenStack. In generic cases it also supports file based and static target discovery.

Tag based configs

Prometheus also supports tag based[10] target selection. That is based on the instance tags in AWS, you can either keep or drop an instance from monitoring. See below examples.

relabel configs

Metrics Storage— InfluxDB

You need to create a database in InfluxDB[11] to store the metrics collected by Prometheus. Then, Prometheus server needs to be configured to read and write to a remote storage as follows.

remote_write:
- url: “http://localhost:8086/api/v1/prom/write?db=prometheus"

remote_read:
- url: “http://localhost:8086/api/v1/prom/read?db=prometheus"

Metrics Visualization—Grafana

As the first step, we need to add InfluxDB as a data source in Grafana as follows.

Then, based on the metrics from above data source, we can create dashboards to monitor various metrics of the monitored system. We can create separate panels for the metrics we need to monitor.

One of the panels in Grafana dashboard

Alerting — Pagerduty

We need to create a service in Pagerduty as follows to send the notifications to the on call person(s) via multiple channels.

Then, we need an integration key to identify the service

For the PoC, I have used alert configurations built in Grafana to check the threshold levels and trigger the alert to Pagerduty. So, when we have multiple alert configurations per panel, this will evaluate the metrics values and trigger notification to Pagerduty. For this, we need to configure Pagerduty as a notification channel. Here, we need to add the integration key we obtained from the API service in Pagerduty to send the alerts to that service.

Notification channels in Grafana

With these configurations in place, we have our basic monitoring tool kit ready. If you have other tools that are already used in your organization for any part of the metrics life cycle, this tool kit allows the capability to plug that tool without much effort.

I will be discussing about more tips and tricks on improving this tool kit as well as automating the effort to gain the best out of this solution. Stay tuned :)

References

[1] https://prometheus.io/docs/instrumenting/exporters/

[2] https://github.com/prometheus/node_exporter

[3] https://github.com/prometheus/jmx_exporter

[4] https://medium.com/@lashan/monitoring-wso2-products-with-prometheus-4ace34759901

[5] https://github.com/prometheus/blackbox_exporter

[6] https://github.com/prometheus/mysqld_exporter

[7] https://prometheus.io/docs/instrumenting/exporters/#databases

[8] https://prometheus.io/docs/prometheus/latest/configuration/configuration/#ec2_sd_config

[9] https://prometheus.io/docs/prometheus/latest/configuration/configuration/

[10] https://prometheus.io/docs/prometheus/latest/configuration/configuration/#relabel_config

[11] https://docs.influxdata.com/influxdb/v1.6/supported_protocols/prometheus

Observability & monitoring — Part 04 was originally published in The DevOps Journey on Medium, where people are continuing the conversation by highlighting and responding to this story.

Observability & monitoring — Part 03

Fathima Dilhasha — Mon, 24 Dec 2018 10:52:04 GMT

Observability & monitoring — Part 03

This post is a continuation of the posts Observability & monitoring — Part 01 and Observability & monitoring — Part 02.

These previous posts discussed the basics of observability, metric monitoring and defined the life cycle of a metric.

In this post, we will discuss how the tools covering the main stages of the metrics monitoring toolkit were evaluated and selected.

When designing the metrics monitoring toolkit, the goal was to design a system suitable for dynamic deployments, considering modern design paradigms addressing micro service, container based architectures. The toolkit should also be adaptable to any infrastructure let it be a standard on premises deployment, a deployment hosted in Cloud or a container based deployment.

Based on the basic requirements few tool stacks were shortlisted and evaluated. Following are the stacks.

Prometheus with Grafana and Alert manager
TICK stack (Telegraf, InfluxDB, Chronograf, Kapacitor)
Icinga2
Sensu and Graphite based stack

These tool stacks were compared based on many guidelines. Following is a summary of the comparison.

https://medium.com/media/05fadb6658eebc522e66a7f335f3b387/href

Metrics collection can usually take the push or pull model. In push model, the metrics collection tool just sits there while the plugins do the collection. But, in pull model the metrics collector is responsible for scraping/retrieving the metrics from the defined endpoints. While Prometheus, Telegraf and Icinga2 support the pull model, Sensu does not support the pull model because it has no inbuilt collecting capabilities. Telegraf and Prometehus also support the push model. But Prometheus is best used in pull model while Telegraf in push model.

When it comes to storing the metrics, Prometheus has an inbuilt time series database. But it is also capable of storing metrics in other external storage[1]. TICK stack uses InfluxDB[2] for metrics storage. InfluxDB supports data retention policies per database and supports querying the metrics too. On the other hand, Icinga2 does not save metrics and only saves the aggregated values as per the requirement. Graphite uses a numeric time series database called Whisper[3].

Even though there are dashboarding and alerting solutions provided by Prometheus and the TICK stack, we can also integrate other solutions to these systems. When using in a production setup, node discovery capabilities of the solution is very vital. Telegraf and Sensu does not require node discovery for the metric collector as they are using the push model. Icinga2 takes PuppetDB as Import source or manage hosts with Ansible or can use Foreman[4] for host discovery. Prometheus handles node discovery using configs[5]. It has configuration support for many infrastructures.

Prometheus supports a wide list of exporters and also allows writing exporters if required. Telegraf plugins can be written as per the requirement too, if the provided plugins are not sufficient. Nagios check commands[6] can be used to write checks in Icinga2.

Based on above comparison, it was decided to use the below stack as the metric tool kit. This is using various solutions across the compared stacks to achieve the best outcome. It was decided to use the Pagerduty trial as the alerting solution as it is already being used by many companies for incident management.

Prometheus as metric collector
InfluxDB as metric storage solution
Grafana as metrics visualization solution
Pagerduty as alerting solution

Following is the basic architecture of the metric monitoring toolkit.

Architecture of Metric monitoring tool kit

Prometheus exporters are responsible for exposing existing metrics from third party systems. They are light-weight processes written in Go. Prometheus server which follows the pull model for metrics collection and is designed for reliability. Prometheus can be easily integrated to many supporting tools.

InfluxDB is a time series database(TSDB), which provides a SQL like query language. In InfluxDB, a metric corresponds to a measurement in the Database. Grafana is a very widely used visualization tool which supports a wide range of data sources including InfluxDB databases. Grafana also includes a built in query parser and provides basic alert configurations as well. Pagerduty allows incident classification and supports multi channel notifications.

I will be discussing the implementation in detail in my next post. Stay tuned :)

Update:

Next post at https://medium.com/the-devops-journey/observability-monitoring-part-04-8742a06caff4

References

[1] https://prometheus.io/docs/operating/integrations/#remote-endpoints-and-storage

[2] https://www.influxdata.com/time-series-platform/influxdb/

[3] https://graphite.readthedocs.io/en/latest/whisper.html

[4] https://www.icinga.com/2017/04/26/automated-monitoring-icinga-meets-foreman/

[5] https://prometheus.io/docs/prometheus/latest/configuration/configuration/#configuration-file

[6] https://www.nagios.org/projects/nagios-plugins/

Observability & monitoring — Part 03 was originally published in The DevOps Journey on Medium, where people are continuing the conversation by highlighting and responding to this story.

Observability & monitoring — Part 02

Fathima Dilhasha — Fri, 14 Dec 2018 17:57:32 GMT

Observability & monitoring — Part 02

This post is a continuation of the post Observability & monitoring — Part 01.

As promised, I will be discussing about metrics monitoring in this post.
Metrics monitoring is one of the pillars of Observability.

Pillars of Observability

When it comes to monitoring metrics, we can categorize them based on the following approaches.

Service level vs. Instance level
Blackbox vs. Whitebox

Service level vs. Instance level

Service level metrics monitoring is focused on monitoring service level objectives(SLO) for the system. There are four signals (Latency, Traffic, Errors, Saturation) that are identified as the golden signals of service level monitoring. The SLOs can be established based on these golden signals. Some service level metrics are login results and port availability.

While service level metrics monitoring keeps track of SLOs, instance level metrics monitoring is needed for diagnosing the root causes. Some sample instance level metrics are load average, disk usage and JVM heap usage.

Black box vs. White box

Black box monitoring treats a system as a black box and refers to monitoring the system from outside. This type of monitoring indicates the availability of a system and is symptom oriented. So, leveraging operating system level metrics and network level communication metrics can be considered as black box techniques.

White box monitoring allows detection of future problems and depends on the ability to inspect the internals. In a multi layered system(e.g. WSO2 Public Cloud), a symptom from one layer can be the reason for an issue in another layer. For example, in the database system monitoring, slow database reads is a symptom. But for the application layer, the latency in database access can lead to a latency in invocation.
So, the white box metrics should be determined in a way that the cause for an issue is identifiable across the involved layers. It is advisable to define thresholds for the metrics such that an anomaly is distinguishable.

Life cycle of a metric

During metrics monitoring, we can define five main stages of a metric.
Any toolkit that is being used for monitoring should cover these stages in order to provide meaningful insights on the metrics.

Life cycle of a metric

Metrics exposure : A mechanism to expose metrics to an external monitoring tool from the system that is being monitored
Metrics collection : A mechanism for collecting the exported metrics
Metrics storage : A mechanism to store the collected metrics to gain insights on the trends of the metrics
Metrics visualization : A mechanism to visualize, track and identify trends in the metrics over time
Alerting: A mechanism to notify the system administrators on any anomalies in the metrics

I will be discussing about a metrics monitoring tool kit that involves tools covering above five stages in my next post. Stay tuned!

Update:

Next post at https://medium.com/the-devops-journey/observability-monitoring-part-03-35a4601c0380

Observability & monitoring — Part 02 was originally published in The DevOps Journey on Medium, where people are continuing the conversation by highlighting and responding to this story.

‘chmod’ — For Linux/Unix users

Fathima Dilhasha — Mon, 10 Dec 2018 16:42:08 GMT

‘chmod’ command is used to change permissions of a file/directory by Linux/Unix users.

This command can be used in 2 different forms.

1. Symbolic permission notation

chmod who=permissions filename

‘who’ refers to :

u: user who owns the file , g : group that file belongs to , o: other users ,

a: all the above

‘permissions’ refers to:

r: read , w: write , x: execute

So, an example ‘chmod’ command to give,

read,write and execute permissions to a user
read and execute permissions to the group
read permissions to other users

will be as follows.

chmod u=rwx,g=rx,o=r myfile

2. Octal permission notation

chmod octal-number filename

The ‘octal-number’ above is the representation of permissions in a numerical manner.

When taking user, group, other from the representation in the symbolic notation and considering the same example,

u : r w x : 111 : 7

g : r — x : 101: 5

o : r — — :100:4

chmod 754 myfile

References:

‘chmod’ — For Linux/Unix users was originally published in The DevOps Journey on Medium, where people are continuing the conversation by highlighting and responding to this story.

Linux Booting Process

Fathima Dilhasha — Sun, 09 Dec 2018 05:49:06 GMT

When you press the power on button and wait for few seconds, you get prompted for user login. Ever wondered how this happens?

Today’s plan is to clear the magic behind the process of getting this login prompt after power on i.e. the booting process. I’ll be discussing this in the context of a Linux based system, other OS(Operating System) boot up might slightly vary.

my login screen ;)

The Linux booting process can be summed up in six main steps as follows.

steps in booting process

BIOS (Basic Input Output System)

The main purpose of the BIOS is to perform system integrity checks, find the boot loader program and pass the control to that program. BIOS is an OS independent, special piece if firmware which runs from ROM(Read Only Memory).
The system integrity check performed by BIOS is called POST(Power On Self Test). This is a very brief test on CPU, memory and storage devices to verify that the system is in a boot-able state.

Then, it will check in the boot priority order to find a boot-able device. It will check in Disk drive, SD card reader, CD/DVD ROM, hard drive according to the boot priority that is configured.

My boot priority order

Once a boot-able device is located, it will search for the Master Boot Record, load it and hand over the control to it.

2. MBR (Master Boot Record)

The master boot record is located in the first sector of boot-able disk( in my case it’s ‘dev/sda’ — these device files change based on the controllers used)

The master boot record is less than 512 bytes in size and contains three main portions — Primary boot loader, partition table and MBR validation check.

The MBR validation check contains information about GRUB(Grand Unified Boot loader).

3. GRUB

GRUB is responsible for starting the operating system and has knowledge of the file system. Older Linux systems used another boot loader named LILO(Linux Loader).

GRUB configuration file (/boot/grub/grub.conf), contains information about kernel and the Init rd(Initial RAM Disk) image. Init rd contains necessary modules/drivers to load actual OS file system.

$ ls -F /boot/ — lists Linux kernel images, init rd image and other necessary information

Contents of boot directory

If multiple kernel images are present, GRUB shows a splash screen and prompts to select. If not selected, GRUB loads the default kernel according to configuration.

4. KERNEL

Kernel mounts the root file system as specified in the grub.conf and starts ‘/sbin/init’ process. Once the root file system is loaded, kernel dismounts the temporary file system and loads the real file system.

$ ps -ef | grep init — can check the ‘/sbin’process

Kernel initialization process with process id 1

Kernel is responsible for getting the hardware running.

You can view the messages related to Linux kernel using the Kernel ring buffer. Kernel ring buffer is a data structure that is always the same size and when it’s full, old messages are discarded to give place for latest messages. This content gets stored in ‘/var/log/dmesg’

$dmesg — prints content of kernel ring buffer

Contents of kernel ring buffer

$ uname -mrs — shows the kernel image that’s being loaded

$ dpkg — list | grep linux-image — lists all available kernel images

5. INIT

In older systems, kernel looks at ‘/etc/inittab’ file to check the run level. Following are available run levels.

0 — halt
1 — single user mode
2 — multi user mode without NFS
3 — Full multi user mode
4 — unused
5 — X11 (run level 3 + display manager)
6 — reboot

Now, ‘systemd’ has taken over ‘inittab’ and systemd has targets which are roughly equivalent to run levels.

Systemd targets

When inittab is used, to change the default run level it can be added to ‘initdefault’. When systemd is used, ‘/etc/init/rc-sysinit.conf’ contains the default run level information.

6. RUN LEVELS

Depending on run level(target), kernel starts executing the programs from one of the directories below.

$ ls /lib/systemd/system — lists available targets

Run level targets here are simlinks

Run level 0 — /etc/rc.d/rc0.d/ or /etc/rc0.d
Run level 1— /etc/rc.d/rc1.d/ or /etc/rc1.d
Run level 2— /etc/rc.d/rc2.d/ or /etc/rc2.d
Run level 3— /etc/rc.d/rc3.d/ or /etc/rc3.d
Run level 4— /etc/rc.d/rc4.d/ or /etc/rc4.d
Run level 5— /etc/rc.d/rc5.d/ or /etc/rc5.d
Run level 6— /etc/rc.d/rc6.d/ or /etc/rc6.d

If you check inside these directories, they contain programs starting with ‘S’ or ‘K’. S stands for startup scripts and K stands for kill scripts. The succeeding numbers in the program name denote the order of programs.

Based on the target/run level, the OS will be loaded and you will be prompted for login. :)

References

https://www.youtube.com/watch?v=RgLMBXg5b9I
https://www.liquidweb.com/kb/linux-runlevels-explained/
https://www.youtube.com/watch?v=ZtVpz5VWjAs
https://www.youtube.com/watch?v=RdbyPwo4W2E

Linux Booting Process was originally published in The DevOps Journey on Medium, where people are continuing the conversation by highlighting and responding to this story.

Hello docker

Fathima Dilhasha — Wed, 05 Dec 2018 17:00:34 GMT

Docker is a software container platform. Developers use Docker to eliminate machine specific issues.

You can install docker using following commands.

You can refer https://docs.docker.com/engine/installation/ to get docker installed in your machine.

I am using ubuntu OS and I will discuss on how you can get separate images for WSO2 ESB and Activemq.

A Dockerfile provides instructions for Docker to build a docker image.
For each line in the Dockerfile, a new docker image is built, if that line results in a change to the image used.

You can create you own images and commit them to Docker Hub so that you can share them with others. The Docker Hub is a public registry maintained by Docker, Inc. that contains images you can download and use to build containers.

How to create a docker image for WSO2 ESB.

Prerequisites:

Download zip file of wso2esb-5.0.0
Create a file named “Dockerfile” in the same path

Dockerfile:

Following is the content in the Docker file to create an ESB image.

FROM ubuntu:14.04

MAINTAINER dilhasha dilhasha@wso2.com

# Install Java.
RUN \
echo oracle-java8-installer shared/accepted-oracle-license-v1–1 select true | debconf-set-selections && \
apt-get update && \
apt-get install -y software-properties-common && \
add-apt-repository -y ppa:webupd8team/java && \
apt-get update && \
apt-get install -y oracle-java8-installer && \
rm -rf /var/lib/apt/lists/* && \
rm -rf /var/cache/oracle-jdk8-installer

RUN \
apt-get update && \
apt-get install zip -y

COPY wso2esb-5.0.0.zip /opt

WORKDIR “/opt”

RUN unzip /opt/wso2esb-5.0.0.zip

ENV JAVA_HOME /usr/lib/jvm/java-8-oracle

RUN chmod +x /opt/wso2esb-5.0.0/bin/wso2server.sh

EXPOSE 9444

CMD [“/opt/wso2esb-5.0.0/bin/wso2server.sh”]

In the above dockerfile, we install java and then copy the ESB file and start the ESB. Let’s see what each of the keywords in the Dockerfile mean.

FROM : The base image from which we start building our custom image

RUN : Runs commands that are specified — as a root user using sh -c “your-given-command”

COPY : will copy a file from the host machine into the container

WORKDIR : sets location from which to run commands from

EXPOSE: Will expose a port to host machine(Allows us to access esb’s management console)

CMD : Will run a command. This is usually the long-running process in the container. Here, we are running the wso2server.sh script.

Now we can build this Dockerfile using the below command. can be anything you prefer. This will look for a file named “Dockerfile” and run it.

docker build -t .

After the above command exists, you will see a success message as below.

Successfully built

Now, you can start the container using the below command.

sudo docker run -p : -t

For the ESB example, port to export is “9443” and you can use any port you prefer in your host machine to expose it.

When you run above command you’ll notice that ESB gets started and you can access the management console at “http://localhost:”.

How to create a docker image for Apache activemq.

Prerequisites:

Download zip file of apache-activemq-5.14.4
Create a file named “Dockerfile” in the same path

Following is the Dockerfile used to build an Activemq image.

FROM ubuntu:14.04

MAINTAINER dilhasha dilhasha@wso2.com

# Install Java.
RUN \
echo oracle-java8-installer shared/accepted-oracle-license-v1–1 select true | debconf-set-selections && \
apt-get update && \
apt-get install -y software-properties-common && \
add-apt-repository -y ppa:webupd8team/java && \
apt-get update && \
apt-get install -y oracle-java8-installer && \
rm -rf /var/lib/apt/lists/* && \
rm -rf /var/cache/oracle-jdk8-installer

RUN \
apt-get update && \
apt-get install zip -y

COPY apache-activemq-5.14.4.zip /opt

WORKDIR “/opt”

RUN unzip /opt/apache-activemq-5.14.4.zip

ENV JAVA_HOME /usr/lib/jvm/java-8-oracle

RUN chmod +x /opt/apache-activemq-5.14.4/bin/activemq

EXPOSE 8161

CMD [“/opt/apache-activemq-5.14.4/bin/activemq”,”console”]

The only new thing in this Dockerfile is the below line.

CMD [“/opt/apache-activemq-5.14.4/bin/activemq”,”console”]

This CMD [“command” , “parameter”] format helps us to provide an additional parameter to our script.

You can use the same commands as in the previous section and see that the activemq instance gets started in console mode.

Thanks :)

Hello docker was originally published in The DevOps Journey on Medium, where people are continuing the conversation by highlighting and responding to this story.

Observability & monitoring — Part 01

Fathima Dilhasha — Wed, 05 Dec 2018 16:40:46 GMT

Observability is a property of a system which indicates whether the internal states of the system can be determined based on the external outputs. On the other hand monitoring is an activity we execute to identify possible issues, estimate capacities,etc.

If there is no monitoring in a system, we cannot even be sure whether the service is working. So, it is very important to have a thoughtfully designed monitoring infrastructure. Following is a model developed by Google engineers on developing and running distributed systems based on Maslow’s hierarchy of need.

Source : https://landing.google.com/sre/book/index.html

Without monitoring, you have no way to tell whether the service is even working; absent a thoughtfully designed monitoring infrastructure, you’re flying blind.

It is important to review which characteristics the system needs to be observed and which monitoring system will be used for the observation.

There are three pillars of observability i.e. metrics, tracing and logging. Monitoring is basically used in terms of metrics monitoring.

Based on my past experience, it is learnt that trying to build a monitoring system which will identify all possible failures is an impossible task. Rather we should focus on a good monitoring system which can identify a failure when it happens and helps in post mortem analysis. We should also be able to detect severe anomalies and avoid such failures as well.

Monitoring system should address two questions: what’s broken, and why?

In summary, Observability is a property of a system and Monitoring is an activity we perform on a system.

While Observability covers a larger scope, monitoring is mainly used in terms of metrics monitoring.

I will be discussing on metrics monitoring in my next post. Stay tuned :)

Update :

Next post at https://medium.com/the-devops-journey/observability-monitoring-part-02-d4d81b67c09a

Observability & monitoring — Part 01 was originally published in The DevOps Journey on Medium, where people are continuing the conversation by highlighting and responding to this story.