What do you do when you have CronJobs running in your Kubernetes cluster and want to know when a job fails? Do you manually check the execution status? Painful. Or do you perhaps rely on roundabout Prometheus queries, adding unnecessary overhead? Not ideal… But worry not! Instead, let me suggest a way to immediately receive notifications when jobs fail to execute, using two nifty tools:
- cmaster11/Overseer — an open-source monitoring tool.
- Notify17 — a notification app that lets you receive notifications on Android/iOS and web.
Brief tech excursion: Kubernetes events
The underlying trick we will use is watching the stream of Kubernetes events. (A list of basic events can be found in the Kubernetes source code.)
Try running the following command in your cluster:
kubectl get events --all-namespaces
Most likely, you will see some interesting events happening. In my stream, I see a job that failed to create a pod. Womp womp.
50s Normal Pulling Pod pulling image "alpine"
23s Normal Pulled Pod Successfully pulled image "alpine"
23s Normal Created Pod Created container
23s Normal Started Pod Started container
2m39s Normal SuccessfulCreate Job Created pod: test-74rz4
22s Warning BackoffLimitExceeded Job Job has reached the specified backoff limit
You might notice that one of the events is
BackoffLimitExceeded. This event is generated whenever a
Job fails and there are no more retries available. This is the event we're going to watch with Overseer.
Overseer can easily be run in Kubernetes using the provided example. More specifically, we will use the following files:
000-namespace.yaml: the Overseer Kubernetes
redis.yaml: the database where the alerts/found events will be stored.
001-service-account-k8s-event-watcher.yaml: a service account that lets Overseer watch Kubernetes events.
overseer-k8s-event-watcher.yaml: the Overseer worker that will watch for new Kubernetes events.
overseer-bridge-webhook-n17.yaml: the notification system to inform us about found events.
To start, we’ll set up the core of Overseer with the following commands:
kubectl apply -f https://raw.githubusercontent.com/cmaster11/overseer/3f8ee2bbc1e5452d292e14c8b3e78960385b7ac9/example-kubernetes/000-namespace.yaml
kubectl apply -f https://raw.githubusercontent.com/cmaster11/overseer/3f8ee2bbc1e5452d292e14c8b3e78960385b7ac9/example-kubernetes/redis.yaml
kubectl apply -f https://raw.githubusercontent.com/cmaster11/overseer/3f8ee2bbc1e5452d292e14c8b3e78960385b7ac9/example-kubernetes/001-service-account-k8s-event-watcher.yaml
kubectl apply -f https://raw.githubusercontent.com/cmaster11/overseer/3f8ee2bbc1e5452d292e14c8b3e78960385b7ac9/example-kubernetes/overseer-k8s-event-watcher.yaml
You can monitor the process (in Linux) with:
watch kubectl -n overseer get pod
When all pods are up and running, let’s proceed with the notifier!
To set up the notifier:
- Create a Notify17 account, it’s free!
- Next, create a notification template from the dashboard by pressing the import button and pasting the following configuration:
Once you’ve imported the template, save it by clicking the Save button.
The last step is to set up Overseer’s webhook bridge.
Copy the file https://github.com/cmaster11/overseer/blob/3f8ee2bbc1e5452d292e14c8b3e78960385b7ac9/example-kubernetes/overseer-bridge-webhook-n17.yaml to a local directory and replace
REPLACE_TEMPLATE_API_KEY with your notification template API key. Then apply the file with
kubectl apply -f FILE_PATH.
And we’re done!
To test the whole system, you can try to apply the failing job example file:
kubectl apply -f https://raw.githubusercontent.com/cmaster11/overseer/master/example-kubernetes/example-failing-job/job-fail.yaml
The job will fail and in a few seconds Overseer should generate an alert and send it through Notify17!
P.S. If something doesn’t work, remember that
kubectl get pod and
kubectl logs POD_NAME are your friends.
To clean up Overseer, just delete its namespace with:
kubectl delete ns overseer
Written with StackEdit.
To join our community Slack 🗣️ and read our weekly Faun topics 🗞️, click here⬇