Proactive Monitoring via SlackBot

Dotandaya
Plarium-engineering
4 min readJul 18, 2022

--

As the infrastructure team at Plarium, we serve several teams: data engineers, data scientists and data analysts. In the past year, all of those teams started to work with our interactive platform, which we described in our previous article.

Process automation was an essential factor for the ongoing work of a growing number of teams, as well as automatic monitoring which allows the teams’ members to monitor their processes and get alerts when they failed.

As distributed teams, a significant fraction of our communication is performed in Slack. The solution of slack bots is useful for sending notifications to team members and make day-to-day work more efficient by automating workflow tasks and information sharing. Therefore, we decided to investigate this possibility to provide proactive monitoring of the process’ alerts and take action accordingly. In addition to that, the users will be able to get information about the processes, check last runs statuses and logs and trigger them.

A bot is a type of a Slack app designed to interact with users via conversation. For that, we used Bolt — “A JavaScript framework to build Slack apps in a flash with the latest platform features”. Our app is deployed on K8S, *defines a costume receiver (handling and parsing any incoming requests from Slack and then sending them to the app) *listens and sends messages and *listens and responds to actions/ events.

Our processes run on Jupyter notebooks that are scheduled by Google Composer (managed Airflow). In the deafault_args of each DAG we defined an on_failure_callback that sends the information about the failed DAG to our Slack app. Moreover, we defined slash commands that return information about the last DAG runs, logs of the runs, and the ability to run a DAG.

Our Slack bot takes monitoring one step ahead, which allows our team members to manage their processes anywhere and to update the team about the processes’ status.

This is what it looks like:

The alert includes the process and task names of the failed DAG and opens a thread that includes a button to rerun the process and a follow-up message that update the process status. Inside the message, the user can choose the task that one wants to see its logs, and check the 10 last status runs of this process.

After the process owner finished to handle the failure — one can close the thread and this will update the group about it.

The slash command we defined:

*/air-dags — return a dropdown with our DAGs in airflow

and a click on one of them returns the information about the chosen DAG and allows to trigger it:

*/air-dag-tasks dev-develop-survey-provider-full-int — return a dropdown with the tasks of the last run of the DAG

A click on one of the tasks will return the logs of the tasks:

*/air-dag-runs dev-develop-survey-provider-full-int — return dropdown with the last 10 runs statuses of the DAG.

And a click on one of the runs will return information about the specific run:

*/air-dags-running dev-develop-survey-provider-full-int — return dropdown with the current runs of the DAG

And a click on one of the runs will return information about the current run:

*/air-run dev-develop-survey-provider-full-int — trigger the DAG.

--

--