Engineering key performance indicators as an operational improvement to create an analytical basis for decision making
KPIs rule the world. There is a saying, “what gets measured gets done.” Measurement plays a basic factor in management. With good indicators, you can easily see if your team is achieving a goal or not. You can determine if your work is making an impact, manage your processes, and track all the problems. It boosts overall efficiency. Everyone is different, but in most cases, adding measurement in any form increases motivation and performance — it keeps people focused on progress toward an intended result.
For a technology company like GumGum, it’s very important to measure engineering KPIs (or EKPIs) — key performance indicators used in the engineering industry to monitor health and efficiency of projects. They give us the information we need to see if all our services are working properly. Monitoring them allows us to quickly respond to any issue. We maintain many different systems. Some of them are really crucial — when they fail, we will lose money. We need to keep these systems working all the time and keep monitoring their health.
It’s not easy to monitor dozens of different applications. Each one has various sets of KPIs we want to measure. Of course, there is no holy grail for that. It’s hard to configure a system where we can monitor everything we need. That’s why we are using several softwares for monitoring them. Each team is responsible for their services and dependencies. For instance, AdExchange team should monitor ad delivery flow, to be sure that it’s uninterrupted. To do so we should check the following:
- AWS Cloudwatch to monitor EC2 instances
- Prometheus to monitor Geoserver cache
- PagerDuty to monitor all the incidents and notify the team
- ScyllaDB and more
On the other hand, GumGum’s Web Engineering team should monitor their Web API which is commonly used among different web applications, and some other smaller standalone apps.
We use different software for that, for instance:
In the above example, we can see that there are many different softwares we need to measure engineering KPIs. Each measurement system looks different, has a different UI, and could have a separate authentication method. Monitoring everything can be challenging, especially for someone outside the team, not familiar with all systems. It’s also really hard to get a bigger overview or compare data from different sources.
I had a dream solution
We decided to do something about that. The goal of the Engineering KPIs project was to develop a standardized platform for reporting and visualizing key engineering metrics. We wanted to create a place where we can access KPIs from all the systems in one, unified way. It can bring value for many people in the organization, not only engineers but also management.
Such a tool can increase the visibility of engineering metrics to the rest of the business. We don’t need to sign in into multiple systems and be familiar with them to check data we need. Having everything in one place will allow us to easily compare data and pave a pathway for implementing quarterly/prioritized project metrics.
The idea behind this project was to make it automated and allow other teams to contribute. Each team is responsible for their indicators and should be able to manage them. They should have a possibility of defining alarms or adding new metrics.
We decided to use Python as the project programming language and execute the code on AWS Lambda. Our Lambda is scheduled to run every hour. The concept is shown at the following figure:
In the first step, AWS Lambda executes Python code. It’s responsible for getting all metrics data and outputting results to the S3 bucket. Each metric is different; our Python-based solution connects to many external systems (like Cloudwatch API, Snowflake, etc) to get data we need. Then, it converts responses to the agreed structure so they can be easily converted to JSON files. Finally, it uploads the resulting files into S3 bucket.
Pypelayer takes care of setting up the next step. It automates the process of creating data ingestion pipelines from S3 to Snowflake. It creates notifications on the S3 bucket, Snowflake stages, Snowpipes and tables based on the JSON file structure. With just one command, everything is set up and data is loaded into Snowflake. Each time a new file is uploaded, it triggers an event to add new data to the Snowflake database — no additional work is required.
The last step is to visualize metrics. GumGum as a company is using Looker for that, so we decided to do the same. Looker meets all of our requirements. After setting up a connection to our EKPIs Snowflake database, we can easily add new views and prepare dashboards. Looker also allows us to define alarms.
Deeper look into the code
Our Python script iterates a list of metrics and tries to get data for each of them for a certain period. Data can be gathered from different systems, so we need to use different connectors to connect to them, e.g. Snowflake connector can be used for getting measurement data from Snowflake warehouse. In the following figure, you can see some of the connectors and metrics implemented in the first phase of the project.
The concept is quite simple. As you can see, we can add more connectors and metrics in the future.
Contribution by adding new metrics
Adding a new metric for existing connectors should be easy. How can we do that? We just need to define a new
MetricDto object. This object contains all the information about the metric:
When this is done, a new metric will be executed at a scheduled time, and data will be uploaded to S3 bucket. As mentioned before, the next step is to run Pypelayer datasource command:
pypelayer datasource new - backload - s3_path=s3://bucketname/data-engineering/pingdom_uptime
After running this command, Pypelayer will set up a connection between the S3 bucket and Snowflake. It will also create a table and load data. After the process is done, the data is ready and available in our Snowflake database. With a few clicks in Looker, we can create fancy visualizations.
Contribution by writing new connectors
Developing a new connector can be a little more challenging, but it should be possible for engineers with basic programming skills. To do that, we need to create a new class extending the abstract
Connector class and provide a
You can see an example here:
from connector.connector import Connector
from metric.metric_dto import MetricDto
from metric.metric_template_dto import MetricTemplateDto
self, metric: MetricDto, start_time: datetime, end_time: datetime
) -> [MetricTemplateDto]:
#get data from external system
#convert data and return result
This method is executed during EKPI’s Lambda execution. From there, we usually connect to the system from which we can get the metric, eg. API or database. Then we get the data and map it to the
MetricTemplateDto object. This object should be returned as the result of it.
Summary and final look
After preparing everything, we can definitely say: it was worth it. We have built the EKPIs project, which gathers all the GumGum’s engineering key performance indicators in one place. There were no problems with getting metrics — each modern system offers an API to extract data outside it. Our tool is automated, easy to maintain and contribute. We can extend it with new metrics we might need in the future.
We have created a board in Looker with a separate dashboard for each team. Dashboards contain several useful eKPIs for each squad. Teams can manage their dashboard and set alarms. They can also add new shared metrics or connectors which all other teams can benefit from.
Developing a project from scratch to a working product is always a rewarding experience. I had the opportunity to work with a variety of technologies, including Amazon Lambda, Prometheus, Scylla, REST/GraphQL API to collect and store data. It’s an interesting project providing valuable insights to engineering teams.
Rafał Ścipień, Senior Software Engineer at GumGum
As a part of Data Engineering team I had a pleasure to develop Pypelayer and work on a connector for Airflow. Getting KPIs like the amount of task retries or execution time helps greatly to keep all the DAG up and running. The benefits of having such high-level insights are immeasurable when you are managing a system like Airflow.
Filip Hanzel, Software Engineer at GumGum
References and resources
Header image: https://unsplash.com/photos/qwtCeJ5cLYs
Drawing tool: https://excalidraw.com