Monitoring Automation with Datadog + OpsGenie

Addhe Warman
Inside Bukalapak
Published in
6 min readDec 11, 2020

In the DevOps lifecycle, Monitoring is one of the major phases. It makes sure that all the applications running in your production environments run continuously without any failure. In Bukalapak, we have around ~10400 monitors alerts in our environment. If any failures occur, it will send a notification to on-call personnel so engineers can tackle it as soon as possible. In this article, I will share how we manage to integrate Monitoring and alerting tools: Datadog and also incident management tools: OpsGenie.

(Why)

I will start from the first early stages of services migration to the cloud in March 2019. At that time, we are still running our Monitoring and alerting system with Prometheus. As we grow, we realize that we need to improve our monitoring system by implementing Infrastructure as code. We need a monitoring system with a complete journey of Monitoring and the alerting system itself. All through code and pipeline Automation, start from onboarding, monitoring agent deployment, delivering metrics, managing dashboard, Integration to “On-Call” incident management system, configuring our alerts priorities, alerts Escalation policies, monitoring/alerts deployment. Okay, let’s start to break them down to tackle them one at a time easily.

# Onboarding

Thank god our security team is also adopting the Infrastructure as code and run their pipeline and custom script to onboard and offboard people in this organization. Kudos to security team in Bukalapak.

# Monitoring Agent Deployment

Our infrastructure ecosystem consists of Kubernetes, VM, and Bare Metal. So here is what we’ve done so far:

  • Kubernetes: We build our pipeline to deploy a monitoring agent to our k8s cluster as a daemon set
  • VM: We bake our monitoring agent into the image for VM in our DC, and we execute agent monitoring deployment during instance provisioning in GCP
  • Baremetal: We include a monitoring agent as part of our Ansible provisioning for our Bare Metal

# Delivering Metrics

  • Without metrics, we will not be able to create a Dashboard. Metrics delivered by an agent with a pre-installed module, custom scripts that we’ve made, or using custom StatsD by orchestrate code and send important metrics for services into our Monitoring platform and will acknowledge it as “Custom Metrics.”
  • Metrics have their type, understand the kind of metrics will make our life easier while creating Dashboards. Please take a look at refs URL

# Managing Dashboard

  • Although We can create Dashboard through Automation and pipeline, we decided not to do so in this rapid migration and deployment phase. As an alternative, we provide a Dashboard template. The monitoring team created Dashboard templates to support different monitoring levels starting from apps, databases, queues, etc.
  • We provide Automation to Backup our Dashboard; if we accidentally removed or deleted the Dashboard, we can recover it from our Backup in our GCS ( Google Cloud Storage ).

# Integration to On-Call Management System

  • Up to this point, we’ve metrics and Dashboard in Datadog. Now, let’s start playing around with our on-call management system. It’s essential to define our alerts monitors priority, alerts monitor routing based on priority, on-call schedule structure, and Escalation structure.
  • To enable or establish a connection between our Monitoring platform with the on-call management system, we need to exchange the API Key, API Key from OpsGenie, stick into Datadog, and API from Datadog need to stick into OpsGenie. This process needs full authorities to access a particular part of a page with a list of Integration on both parties.
  • We create a pipeline to configure squad (team) in our On-Call management system. We choose “OpsGenie” as our On-Call management system as it provides us rich API that we can use to automate our process of managing squads (team) such as team members, Escalation, and schedule.

# Configure Alerts Priorities

Every company is unique and has it’s own way and standard on how to determine Alerts of P1, P2, P3 up to P5. Some of the company only aware of P1 and P2. So are we, in Bukalapak here it’s what we define as P1, P2, P3 up to P5

  1. P1: Bukalapak wide incident
  2. P2: Service Critical Down / UnAcceptable Performance
  3. P3: Service Non-Critical Down
  4. P4: Non-SLO Alerts
  5. P5: Non-Production issue

# Alerts Escalation Policies

Monitor alerts escalation policy is a company policy handling escalation based on organization chart ( org chart ). There are two things that we need to define:

  1. Who will be the next person
  2. How long until it goes to the next person

In Bukalapak, we escalate our Alert based on priorities, P1 and P2 Alert escalated up to VP Level. P3 Alerts up to P5 will only go to Head of Engineering Level. To give more context, herewith what we’ve implemented so far for P1 and P2.

  • Primary will be paged immediately after an alert is fired. Expected to ack in 5 minutes.
  • Secondary will be paged 5 minutes after an alert is fired. Expected to ack in 5 minutes.
  • All team members will be paged 10 minutes after an alert is fired. Expected to ack in 5 minutes.
  • EM and Head will be paged 15 minutes after an alert is fired. Expected to ack in 5 minutes.
  • VP will be paged 20 minutes after an alert is fired.

# Monitor / Alerts Deployment

We have more than 50+ Product Engineering Squad with more than 15+ Tribe. We know and are aware of all monitor / alert deployment is managed by the infrastructure monitoring team. It will not scale, and the most likely team will become the blocker. We created a Pipeline to deploy our monitors/alerts, provide the guide on how to submit MR / PR, configure their monitor/alerts, sample query for the monitor, and set priority tags for each Alert. This documentation is part of our operational runbook and well maintained by the Monitoring team.

That’s all the bold part; details in each item above have their challenges.

(How)

Thank God we have a reliable team, understand what they are doing, and have less supervision. They’ve done great things for this organization. Okay, let’s summarize it.

First, we list them all down in the note, populate them as Jira’s story, and break it down and distribute it to the team.

  1. Automate your onboarding/offboarding process to the Monitoring platform system
  2. Create Automation to deploy your monitoring agent
  3. Create Template for your Dashboard
  4. Automate your Team creation through Custom script and run it as a Pipeline
  5. Establish Integration from Monitoring platform to On-Call Management
  6. Create a standard for Alerts prioritization
  7. Create a standard for Escalation based on Alerts prioritization
  8. Automate your Monitors/Alerts deployment via pipeline

During the process, we release them bit by bit, we track the progress every Friday as part of weekly meetings, get feedback over here and there, take a note and convert it into a Jira story. The case might be different for each organization, so shipping it bit by bit will make it easier to adopt the changes and fast feedback delivery.

(What)

This what have achieved so far

  • We have our pipeline to deploy alerts monitor ( include synthetics monitors deployment pipeline ).
  • We have our pipeline to deploy our on-call schedule into OpsGenie.
  • We have ~10K Alerts Monitors deployed thru our pipeline
  • We have ~4K Dashboard Monitoring
  • We have successful ~13K pipelines
  • We have average pipeline execution within ~15 mins
Our Oncall Schedule Maintain Thru our Automation Pipeline
Our Oncall Schedule Maintain Thru our Automation Pipeline
Our Opsgenie with Alerts Monitors Priority Configured
Our Opsgenie with Alerts Monitors Priority Configured
Datadog Synthetics Monitoring
Datadog Synthetics Monitoring
Datadog Monitors ( Sample )
Datadog Monitors ( Sample )
Our pipeline to deploy alert monitor in Datadog
Our pipeline to deploy alert monitor in Datadog
One of our Datadog Dashboard
One of our Datadog Dashboard

There is always room for improvement. Nothing is perfect. We’re still doing our best to meet our expectations, and again nothing is magic in the Automation by the end of the day. Someone needs to maintain the code and Automation (e.g., pipeline), keep it standard and up to date.

--

--

Addhe Warman
Inside Bukalapak

My Nickname “Awan” taken from Name [A]ddhe [Wa]rma [n] it’s Cloud. Working in Cloud Environment GCP + AWS in Large Scale.