Gotta monitor’em all! — Part 1

An adventure about monitoring with Zabbix

Bruno Padilha
3 min readMay 8, 2017

As some of you know, I’m currently a SysAdmin in a insurance company in Brazil. My latest projects were a little DevOps-ish and cloud oriented, including: Implementing Docker for some basic infrastructure tools; SaltStack for configuration management; Gitlab to control config files versions so we could control all the changes on LAN/SAN switches, simple scripts, reverse proxy config files and much more assets. But when you work in a traditional IT environment, things are much slower and all that work isn’t 100% used… but, let’s move on.

When I started in my current job, there was no monitoring tool. The infrastructure/operations team was small, all servers ran Microsoft Windows and no one really cared about monitoring. So, my first idea was to implement Nagios with some basic checks, like CPU and RAM utilization, disk space and etc...

Nagios was the monitoring tool we relied for some years to keep the team aware of the problems/alarms, until the day we discovered Zabbix after a new person — and a big friend of mine — joined the team. He was responsible for managing Zabbix and keep all the servers monitored, among other things.

From the day he left the company (late 2015) to early 2017, Zabbix was sort of abandoned. Obviously all servers had its agent installed, but the monitoring templates and configurations were too basic for what I would like to achieve, and after reading The Phoenix Project I decided to start changing the way the team was working and monitoring was the first one. And that’s what this series about: monitoring.

The first steps

The very first step was to update Zabbix Server to 3.0.x. I’m not going to explain how to do that, you can find a lot of tutorials over the internet and Zabbix documentation is/was a must during the whole project.

Holy ****, what a mess!

The second step was to check what was actually being monitored and why we were receiving so many alerts. I wrote some questions in a paper notebook:

  1. What services and systems are currently monitored?
  2. What is being monitored? Only hardware? OS status/health? Third party services (systems)?
  3. What services are not being monitored? E.g: Exchange, Web Servers (IIS/Apache), Systems for other departments and etc…;
  4. What to monitor?

After answering these questions I was able to create a plan and start the project. To keep up with what I'm doing or how much I've done, I've created a new card on our Trello board.

Print screen taken on 7th May 2017 — this list is huge!

The plan was basically:

  1. Create a Host Group for every application/system;
  2. Update the Zabbix Agent through Saltstack (for Linux Servers) or manually for Windows Servers (shame on the Windows Team that did not finish the SCCM implementation) — I will explain later on why the update was necessary;
  3. Verify the applications or systems running on the server;
  4. Create a Template and add all the Items and Triggers for that Host Group;
  5. Create a Screen (for some services I had to create a custom Graph) and an Action;

With the plan on my mind, a lot of energy to put into this I begun the work…

This post is part of a series called Gotta monitor'em all!. To know when the next part is up, hit the Follow button and if liked the post, hit the little heart! ❤

--

--

Bruno Padilha

DevOps Engineer @ Leroy Merlin Brasil. Nerd, headbanger and gamer sometimes. I write in portuguese and english.