Engineering Quality Dashboard Initiative — Part 1
Today, in the software engineering teams, there are many tools used for product development and delivery. Each tool captures different information, which is vital to understand the holistic status of our product quality. Each team is putting a lot of effort into silos to get feedback on our Product Quality, but the information is hidden or lost across different systems.
So, we are working on a Quality Dashboard initiative to solve this problem. It is a collaborative effort involving engineers from all the tribes, product managers and designers, customer support, internal analytics, and product marketing teams. We are still in progress and might take two months to come to a good shape.
Opportunity Statement
There is a need for creating a dashboard for each product team where one can visualise the quality of their team deliverables at-a-glance. The dashboard should show the current state of the key quality indicators and trends or progress made in each quality area.
Purpose
- Deliver consistent experience to our customers
- Strengthen confidence for all stakeholders
- Start quality improvement discussions
- Prepare teams to address the future complexities
- Improve efficiency
This information would be used by the product teams and the leadership group to identify and prioritize the quality issues.
Holistic View of Product Quality
Product quality depends on all the aspects of our end-to-end software development process. We can’t look at one system or part of the process to measure it.
Our Engineering Leadership Team (ELT) helped us by prioritising six quality areas.
- Functional Capability — How well is the team meeting (or exceeding) their end-user expectations?
- Reliability — How failure-free are the services developed by the team?
- Performance — How well the services owned by a team are running in production?
- Usability — How user friendly is the features released by the team?
- Security — How secure are the solutions built by the team?
- Maintainability — How well designed are the services developed by the team?
Challenges
Our first challenge was to identify metrics that could help us to measure these quality areas. Initially, we researched and came up with an exhaustive list. However, we realised that not all the metrics are easily measurable.
To start with, we decided to visualise metrics using data captured in our existing toolset. We had to learn how to collect the data from different data sources in an automated way. We picked Grafana for our dashboard as most of us were familiar with it.
We had a wide variety of metrics. Most of them were lagging metrics — based on data that is continuously published by deployed service in production. We also had some leading metrics which were based on the survey results and won’t change that often. We still need to work on the visual representation of these metrics.
Each team owns multiple services in different lifecycle modes — active, maintenance, minor_changes. Our SRE team did this classification of services. Also, we have services deployed in various regions globally. We decided to start measuring data from active services (rather than dealing with legacy code) for all regions.
After that, we had to come up with rules to aggregate the service level data into team level rating. Also, we have common code repositories for a few apps. We had to come up with rules to segregate data so that we could display in individual team dashboards.
Key Quality Metrics
Below are the indicators we shortlisted —
- Reliability — SLOs, Number of crashes, production incidents, and alerts
- Performance — Slow render, Frozen frame for Mobile Apps, Page speed index for Web App, Response time, Error rate, Throughput for backend
- Functionality & Usability — This was a bit of a grey area for us. We are still looking at how we measure it. We want to use Google’s HEART framework in the future for usability. To start with, we want to measure the UI and API test coverage against user flows, Test results trend, customer problem reports (CPR) reported by customer support, task success rate from the Analytics system
- Security — Security practice assessment, security issues
- Maintainability — Code complexity rating, Unit Test coverage
Data sources
We are pulling data from the below sources to represent in the Grafana dashboard.
- Prometheus
- Opsgenie
- Code Climate
- Jira
- Firebase
- Sentry
- New Relic
- Amplitude
Trial run
We did a trial run by collecting data manually for a team and putting it on a Confluence page. It gave us clarity on the types of tasks we need to do to turn our dashboard dream into reality. It also helped us to convey our idea with our stakeholders.
Creating a roadmap
We listed out all the tasks to have a fully functioning Quality Dashboard. We categorised the tasks into six groups
We realised that this is a humongous initiative. Our initial roadmap showed us that it might take a year to set up the Quality Dashboard for all engineering teams. This made us revisit our priorities. We couldn’t filter metrics based on impact. We sorted metrics that need lower effort. We picked areas where we, as engineers, have greater control to change things. We came up with P1 to P9 quality priorities. We parallelised the tasks among five teams based on the quality and the team priorities. Thanks to all the quality engineers who are driving on those tasks along with their team members.
We are currently focusing on automation of data collection and defining quality standards for the v1 release of the Quality Dashboard.
Will keep you posted on how we go from here … Wish us all the best!