Image for post
Image for post

Production grade Kubernetes Monitoring using Prometheus

VAIBHAV THAKUR
Jan 22, 2019 · 6 min read

Monitoring is a crucial aspect of any Ops pipeline and for technologies like Kubernetes which is a rage right now, a robust monitoring setup can bolster your confidence to migrate production workloads from VMs to Containers.

Today we will deploy a Production grade Prometheus based monitoring system, in less than 5 minutes.

The following set-up works perfectly in independent Single Cluster environment and is also a good primer on getting started with Prometheus. However,

If you have multiple k8s clusters running and you would like to monitor them under a single pane of glass. You should check this out.

Pre-requisites:

  1. Running Kubernetes cluster with at least 6 cores and 8 GB of available memory. I will be using a 6 node GKE cluster for this tutorial.
  2. Working knowledge of Kubernetes Deployments and Services.

Setup:

  1. Prometheus server with persistent volume. This will be our metric storage (TSDB).
  2. Alertmanager server which will trigger alerts to Slack/Hipchat and/or Pagerduty/Victorops etc.
  3. Kube-state-metrics server to expose container and pod metrics other than those exposed by cadvisor on the nodes.
  4. Grafana server to create dashboards based on prometheus data.

Note: All the manifests being used are present in this Github Repo. I recommend cloning it before you start.
PS: Leave a star if you like it.

Image for post
Image for post
Monitoring Setup Overview

Deploying Alertmanager

Before deploying, please update “<your_slack_hook>” , “<your_victorops_hook>” , ‘<YOUR_API_KEY>’ . If you use a notification channel other than these, please follow this documentation and update the config.

kubectl apply -f k8s/monitoring/alertmanager/

This will create the following:

  1. A monitoring namespace.
  2. Config-map to be used by alertmanager to manage channels for alerting.
  3. Alertmanager deployment with 1 replica running.
  4. Service with Google Internal Loadbalancer IP which can be accessed from the VPC (using VPN).
root$ kubectl get pods -l app=alertmanagerNAME                            READY     STATUS    RESTARTS   AGEalertmanager-42s7s25467-b2vqb   1/1       Running   0          2mroot$ kubectl get svc -l name=alertmanagerNAME           TYPE           CLUSTER-IP    EXTERNAL-IP   PORT(S)          AGEalertmanager   LoadBalancer   10.12.8.110   10.0.0.6     9093:32634/TCP   2mroot$ kubectl get configmapNAME                     DATA      AGEalertmanager             1         2m

In your browser, navigate to http://<Alertmanager-Svc-Ext-Ip>:9093 and you should see the alertmanager console.

Image for post
Image for post
Alertmanager Status Page

Deploying Prometheus

Before deploying, please create an EBS volume (AWS) or pd-ssd disk (GCP) and name it as prometheus-volume (This is important because the pvc will look for a volume in this name).

kubectl apply -f k8s/monitoring/prometheus/

This will create the following:

  1. Service account, cluster-role and cluster-role-binding needed for prometheus.
  2. Prometheus config map which details the scrape configs and alertmanager endpoint. It should be noted that we can directly use the alertmanager service name instead of the IP. If you want to scrape metrics from a specific pod or service, then it is mandatory to apply the prometheus scrape annotations to it. For example:
...
spec:
replicas: 1
template:
metadata:
annotations:
prometheus.io/path: <path_to_scrape>
prometheus.io/port: "80"
prometheus.io/scrape: "true"
labels:
app: myapp
spec:
...

3. Prometheus config map for the alerting rules. Some basic alerts are already configured in it (Such as High CPU and Mem usage for Containers and Nodes etc). Feel free to add more rules according to your use case.

4. Storage class, persistent volume and persistent volume claim for the prometheus server data directory. This ensures data persistence in case the pod restarts.

5. Prometheus deployment with 1 replica running.

6. Service with Google Internal Loadbalancer IP which can be accessed from the VPC (using VPN).

root$ kubectl get pods -l app=prometheus-serverNAME                                     READY     STATUS    RESTARTS   AGEprometheus-deployment-69d6cfb5b7-l7xjj   1/1       Running   0          2mroot$ kubectl get svc -l name=prometheusNAME                 TYPE           CLUSTER-IP    EXTERNAL-IP   PORT(S)             AGEprometheus-service   LoadBalancer   10.12.8.124   10.0.0.7    8080:32731/TCP      2mroot$ kubectl get configmapNAME                     DATA      AGEalertmanager             1         5mprometheus-rules         1         2mprometheus-server-conf   1         2m

In your browser, navigate to http://<Prometheus-Svc-Ext-Ip>:8080 and you should see the prometheus console. It should be noticed that under the Status->Targets section all the scraped endpoints are visible and under Alerts section all the configured alerts can be seen.

Image for post
Image for post
Prometheus Targets Status
Image for post
Image for post
Prometheus Graph section depicting all metrics

Deploying Kube-State-Metrics

kubectl apply -f k8s/monitoring/kube-state-metrics/

This will create the following:

  1. Service account, cluster-role and cluster-role-binding needed for kube-state-metrics.
  2. Kube-state-metrics deployment with 1 replica running.
  3. In-cluster service which will be scraped by prometheus for metrics. ( Note the annotion attached to it.
root$ kubectl get pods -l k8s-app=kube-state-metricsNAME                                  READY     STATUS    RESTARTS   AGEkube-state-metrics-255m1wq876-fk2q6   2/2       Running   0          2mroot$ kubectl get svc  -l k8s-app=kube-state-metricsNAME                 TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)             AGEkube-state-metrics   ClusterIP   10.12.8.130   <none>        8080/TCP,8081/TCP   2m

Deploying Grafana

By now, we have deployed the core of our monitoring system (metric scrape and storage), it is time too put it all together and create dashboards.

kubectl apply -f k8s/monitoring/grafana

This will create the following:

  1. Grafana deployment with 1 replica running.
  2. Service with Google Internal Loadbalancer IP, which can be accessed from the VPC (using VPN).
root$ kubectl get podsNAME                                    READY     STATUS    RESTARTS   AGEgrafana-7x23qlkj3n-vb3er                1/1       Running   0          2m
kube-state-metrics-255m1wq876-fk2q6 2/2 Running 0 5m
prometheus-deployment-69d6cfb5b7-l7xjj 1/1 Running 0 5m
alertmanager-42s7s25467-b2vqb 1/1 Running 0 2m
root$ kubectl get svcNAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGEgrafana LoadBalancer 10.12.8.132 10.0.0.8 3000:32262/TCP 2m
kube-state-metrics ClusterIP 10.12.8.130 <none> 8080/TCP,8081/TCP 5m
prometheus-service LoadBalancer 10.12.8.124 10.0.0.7 8080:30698/TCP 5m
alertmanager LoadBalancer 10.12.8.110 10.0.0.6 9093:32634/TCP 5m

All you need to do now is to add the prometheus server as the data source in grafana and start creating dashboards. Use the following config:

Name: DS_Prometheus

Type: Prometheus

URL: http://prometheus-service:8080

Note: We are using the prometheus service name in the URL section because both grafana and prometheus servers are deployed in the same cluster. In case the grafana server is outside the cluster, then you should use the prometheus service’s external IP in the URL.

Image for post
Image for post
Adding Prometheus as a data source in Grafana.
Image for post
Image for post
Kubernetes Cluster Monitoring Dashboard

All the dashboards can be found here. You can import the json files directly and you are all set.

Note:

  1. No Need to add separate dashboards whenever deploying new service. All the dashboards are generic and templatized.
  2. Prometheus offers hot reloads. So if you need to update the config or rules file, just update the config map and make a HTTP POST request to the prometheus endpoint. Eg:
curl -XPOST http://<Prometheus-Svc-Ext-Ip>:8080>/-/reload #In the prometheus logs it can be seen aslevel=info ts=2019-01-17T03:37:50.433940468Z caller=main.go:624 msg="Loading configuration file" filename=/etc/prometheus/prometheus.ymllevel=info ts=2019-01-17T03:37:50.439047381Z caller=kubernetes.go:187 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"level=info ts=2019-01-17T03:37:50.439987243Z caller=kubernetes.go:187 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"level=info ts=2019-01-17T03:37:50.440631225Z caller=kubernetes.go:187 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"level=info ts=2019-01-17T03:37:50.444566424Z caller=main.go:650 msg="Completed loading of configuration file" filename=/etc/prometheus/prometheus.yml

3. Alertmanager config can be reloaded by a similar api call.

curl -XPOST http://<Alertmanager-Svc-Ext-Ip>:9093>/-/reload

I hope this helps you get insights into your kubernetes cluster and effectively monitor workloads. Feel free to reach out should you have any questions. If this post helped you please 👏👏

Don’t forget to check out my other posts:

  1. Continuous Delivery pipelines for Kubernetes using Spinnaker
  2. HA Elasticsearch over Kubernetes
  3. Scaling MongoDB on Kubernetes
  4. AWS ECS and Gitlab-CI
Image for post
Image for post

Join our community Slack and read our weekly Faun topics ⬇

Image for post
Image for post

If this post was helpful, please click the clap 👏 button below a few times to show your support for the author! ⬇

FAUN

The Must-Read Publication for Creative Developers & DevOps Enthusiasts

Sign up for FAUN

By FAUN

Medium’s largest and most followed independent DevOps publication. Join thousands of aspiring developers and DevOps enthusiasts Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

VAIBHAV THAKUR

Written by

DevOps Engineer | USC Alumnus | Fight On! ✌🏻

FAUN

FAUN

The Must-Read Publication for Creative Developers & DevOps Enthusiasts. Medium’s largest DevOps publication.

VAIBHAV THAKUR

Written by

DevOps Engineer | USC Alumnus | Fight On! ✌🏻

FAUN

FAUN

The Must-Read Publication for Creative Developers & DevOps Enthusiasts. Medium’s largest DevOps publication.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store