Unified Monitoring of ETL Performance with BumbleBee

Published in

Walmart Global Tech Blog

5 min readAug 4, 2021

In the Big Data realm, the performance of an ETL process is important, but the monitoring of such process performance is an equally important aspect.

Here, when we talk about monitoring, it is not just about how well or fast the ETL process is. We also consider attributes such as integrity of given workflow against its parent and child(ren) workflows, code quality of given workflow and more.

There are a set of tools for measuring these attributes and displaying them to the dashboard. Such tools calculate specific metrics associated with specific attributes. It is challenging to display all these metrics for a workflow in a single dashboard. While having a single unified dashboard has its advantages. The single dashboard would give a holistic view of the overall performance of the entire workflow.

We, at Walmart, have various tools/frameworks to measure metrics like code quality, load testing, unit testing etc. These frameworks have their own calculated metrics which are, of course, stored separately.

As data sources of all these metrics are different, it is difficult to have a consolidated view of all metrics in one place. Due to which there are multiple dashboards for certain metrics and one should be aware of all respective dashboard in order to navigate to the right one.

To address the above problem, we have come up with an aggregation service named “BumbleBee” that is intended to accumulate various metrics.

Why BumbleBee?

The idea is similar to what a bumblebee does. A bumblebee goes from flower to flower in order to gather nectarine and converts nectarine into nectar which can also be consumed by others. Hence the tool is named BumbleBee.

BumbleBee collects various metrics from sources like code quality, Yarn utilisation, load/performance testing results etc and persists them into a single database(nectar), which in turn enables us to display all relevant metrics on a single dashboard at the team level.

BumbleBee Eco-system:

What is BumbleBee?

BumbleBee is a Scala service that aggregates various performance metrics from different sources to a single database which again can be used as a source of truth for a dashboard. It can be deployed on any edge node which can connect to airflow DB. It can also talk to Yarn APIs with appropriate configurations. It uses various components like the ServiceRunner container which takes care of polling at a defined interval and web hooks to be notified from Sonar on analysis completion.

BumbleBee started with retrieving airflow applications and Yarn resources associated with such airflow application if any. But now, it reads all application details from airflow as well as Yarn (even if not associated with any airflow application) by polling airflow and Yarn at the same time.

BumbleeBee service is further extended to track integration testing metrics as Integration testing (for ETL workloads) runs on Airflow. We are storing task level information so that we can have a granular view of integration testing workflow and their resource consumption from the Yarn end. Along with this, It can track performance/load testing metrics from Automaton (load/performance testing tool) and code quality metrics from Sonar-Qube to provide trend on API performance and code quality standards respectively.

How does polling work in BumbleeBee?

BumbleBee has a very simple flow for continuously tracking and retrieving these metrics. It continuously polls(configurable) respective services with a given configuration like project details, respective YARN cluster etc. to retrieve metrics and stores them into a database with suitable schema. All these metrics are stored in different tables under the same database. Any BI tool can be used to connect with the database and to visualize aggregated metrics according to written queries.

Design Diagram:

A deeper look at how BumbleBee works

For Airflow tasks, BumbleBee checks for tasks created or completed since the last execution and persist their details into the database and then attempts to retrieve yarn application id if available from airflow task logs.

After the yarn application id is retrieved, it further retrieves application resources like memory, cores, execution-time etc. used from the Yarn cluster. Once we have Airflow-yarn applications, we need some mechanism to identify Integration testing workflow among those. BumbleBee has a provision for that in configuration where you can pass valid workflow prefixes which will be taken into consideration even if they don’t have associated Yarn application id. With this change, it starts storing parent Integration testing workflow and its tasks.

Automaton performs load testing for API services and will post result via Email and slack channel, so this slack channel can be used by BumbleBee to read load testing updates and it will attempt to fetch load/performance testing metrics from Automaton API and will store API and status code level metrics which can be aggregated to show project-level performance as well.

BumbleBee uses Web hooks and executors for retrieving code quality metrics from Sonar. It has a web hook that can be configured for any Sonar project. Sonar calls this a web hook when the code analysis of a given project is finished. Upon a web hook call, BumbleBee fetches code quality metrics from Sonar and stores them in the database.

With BumbleBee in the picture, we are planning to move out of silos and bringing every important metrics for services/ETL performance on a single dashboard. The nectar collected by bumblebee (insect) is sugar-rich, similarly, the nectar (metrics) collected by our BumbleBee is data-rich. These metrics are not only useful for visualization or dashboard but also help in identifying candidates for performance improvement.

Unified Monitoring of ETL Performance with BumbleBee

Written by Jalpesh Borad