Pillars of developer productivity in Measure portfolio — part 1
On a warm day in June 2019, I first entered the Hootsuite Bucharest office to start my journey in developer productivity engineering on the newly-founded team responsible for the Measure portfolio headquartered here, handling the Analytics product. Moving quickly from senior developer to team lead and then to manager, I became responsible for both the vision and strategy of the team, as well as for the execution of the projects that we decided to work on. Recently, the team received a broader scope, driving a company-wide project impacting most development teams (concerning API docs, covered in part 2).
With more than 5 years of experience and projects under the team’s belt, this article is divided into two parts for brevity and is structured in multiple sections that we consider distinct pillars. Together, they form a sturdy foundation for both the team’s initiatives as well as for the increased productivity of the developers that we work with (which is the overall goal of the team). Each section presents the motivation around the topic and then delves deeper into the actual work that happened to implement it.
As a high-level overview, these pillars are:
- Tooling monitoring — Objective tracking of progress and value is not feasible without monitoring and metrics for analysis.
- Optimized CI/CD system — Increase developers’ experience by paving the path to move the code through different stages and environments smoothly and transparently.
- Development environment — Boot up a new development environment that enables developers to rapidly and independently test code changes without impacting shared or pre-existing environments.
- Tooling and dev monitoring — Custom tools and experimentation are necessary to address specific domain and process requirements.
- Developer feedback — Developer productivity requires qualitative data and collaboration to complement the objective metrics.
Even though this might seem like a lot for a single team to handle, we started with a limited scope of a single portfolio of around 25 developers (but handling 300+ services) and collaborating with ops developers on more infrastructure-centric topics. The realization of this vision was made possible by the dedication of our exceptional team members, the unwavering support of our Director, who assembled the team, and our shared passion for the technical intricacies of our work.
Through metrics, some quick wins along the way, and positive feedback from the developers, we built trust with both the dev leadership and the developers. This trust allowed us to experiment and embark on longer projects that delivered full value only after a few quarters of hard work.
Tooling monitoring
Is there a single metric that can be used to measure and compare developers or teams of developers? This is the million-dollar question that, unfortunately, I don’t think has a universal answer (because context matters!). However, you can achieve a holistic view of your engineering efforts by combining objective/quantitative (system metrics) and more subjective/qualitative (developer feedback) approaches. This approach was first documented in the SPACE framework, and we fully agree that the entire context needs to be gathered from both types of information. This pillar will focus on the former, while the last pillar will focus on the latter.
For a successful system metrics initiative, you must be very upfront with your developers about the data you collect, how you do it, and, most importantly, how you will use it. Our developer productivity team primarily focuses on high-level aggregated data or particular details from systems to help dig further into an issue. As such, we’re not in the business of tracking individual performance, as we’re looking at overall system performance to pinpoint bottlenecks and follow it up to see if initiatives have made the expected impact in time.
CI metrics
When asked what can be improved to increase productivity, almost all developers mention getting things through the CI system, merging, and deploying promptly. As such, having a system that can aggregate and break down information historically with as much context as possible was mandatory for the future success of CI-related initiatives.
Because the implicit data provided by the CI system was not enough, we built our plugin to gather as much information as possible about both pipelines and constituting jobs (e.g., duration, status, parameters, trigger information, link to run details) and push it into an event logging solution. We’re using Sumologic as the destination of the data, both for the simplicity with which we can send the information and for its excellent dashboards and alerting features.
The system’s data can be used in two major ways: first, in the aggregate, to check the overall “health” of the pipelines (duration and failure percentage); second, to dive deeper and investigate specific issues. Having this data available also impacts our roadmap, as it can guide us to opportunities that we can investigate further and serve as success indicators after initiatives are completed.
For the former (aggregated data), we introduced internal team-specific SLOs, covering major developer workflows (checks running on PRs, merges, and deploys). On top of these, we implemented alerting that warns us when they are close to consuming the error budget, so we need to take a closer look at that flow to see what’s causing the increase (and fix it if possible!). Alongside this, in our weekly team meeting, we look at the evolution trends over the last couple of weeks, as some increases creep in slowly and are trickier to pinpoint before they become a triggering issue.
In the example above, our SLO dashboard (showing data for one week) told us we needed to look deeper into the PR checks for the web-ui (frontend) repository. At the same time, the other flows were healthy enough (maybe even needing their thresholds decreased). We found out that one PR, which was getting lots of commits (and, thus, check runs), was doing a rewrite in a core part of the app, which meant that a more extensive suite of tests was running on it (more information on this in the second pillar) and making it go above the “normal” threshold. As such, it was one known exception that meant there was nothing to do on our side (these types of PRs shouldn’t be that common on the average, with more specific work covered by running fewer tests being the norm)!
An example of the latter (diving deeper into a topic) is this graph tracking the evolution in time of job duration, in this case, a Python unit test job. We can see that the duration usually resides between 4 and 5 minutes (pending on system overhead and load), except for two runs, which we quickly attributed to scheduling issues in the CI system, causing the jobs to time out.
Flaky test metrics
Another topic that ranks high on the developer bucket list of things that cause frustration and decrease productivity is test flakiness — unreliable tests that don’t produce the same result each time they’re executed. Under the pressure of tight deadlines, the “usual” developer behavior is to initially try and rerun the pipeline containing such a test, especially if that test is on a part of the code not touched by the PR. This behavior usually passes the responsibility further down the line until the flakiness turns into more persistent failures, which require proper fixing (or temporarily excluding the test from the executed suite until adequate time exists to fix it, adding technical debt). During its “lifetime,” such a flaky test can, thus, consume time from multiple developers!
To address this issue more proactively, we started working on tracking test executions at one of the company-wide hackathons so that we could identify and investigate flaky tests. The idea proved successful, and, similarly to our CI data, we used Sumologic to store the test execution information and build dashboards on top of it.
Even if defining a “generic” test data type was easy, actually parsing the test results from the various test executors we were using for all the used languages (Python and Go for backend and Typescript for frontend) and types of tests (unit, integration, E2E) was more challenging. Even though all of them had the option to generate a report using the JUnit XML format, they were using it in pretty different ways (both in terms of structures and ways of reporting results), needing some explicit parsing to happen for each to match the standard data type. However, we encapsulated this logic in a CI plugin, which can be used everywhere without such implementation details.
Similar to the CI data, we can start with aggregated data that can be filtered to determine the main offenders in terms of failure percentage and then follow through with the actual trends in execution and history to determine the cause of the flakiness.
For the time being, we haven’t fully passed on the responsibility for monitoring this to the developers, as for most E2E and live tests (which prove to be more prone to flakiness due to their more dynamic nature), there’s a failure reason that the productivity team is responsible for — missing live test data, as a 0 in the Analytics context can both mean no data as well as point to issues in the data gathering process (we will discuss this topic more in part 2). We usually want to rule this out before passing it on to the developers for further investigation.
Process metrics
Using GitHub and/or Jira data might be a more sensitive topic, as developers might first think it could be used to track their performance. However, when aggregated and anonymized, it can be used as a high-level indicator of process metrics evolution.
One such process we worked on is “merge early.” One rule in place for the Measure teams since before I joined was that PRs had to stay in staging for at least a day to be tested with more data and by more people before being merged into production. Of course, this process increases the time spent on an issue and causes more context switches as people move to other tasks while the tests happen.
We worked on both process and tooling to decrease the PR end-to-end time while maintaining similar levels of reliability. This approach allowed some PRs to bypass the whole staging wait and be merged earlier. This long-term initiative started in 2021, with changes made along the way and further ideas on the backlog.
Above is the graph showing the number of PRs merged in production in 12 or fewer hours since the last commit was pushed to the branch. The initiative has caused more than a 50% increase in PRs merged earlier! This graph also shows the power of aggregating this information at the repo level, as we can easily track the overall progression and impact of the individual initiatives without going into detail regarding the context of the PRs.
And if you’re curious, what we did to achieve the increase was:
- Have a labeling system in place for these PRs to be able to track them and their motivation in time
- Discussing with developers and defining situations in which a merge early label should be used
- Adding automation in detecting some of the situations so that developers know out of the box that the PR is a strong candidate for merging early
- Further discussing with developers about ways in which tasks can be approached to benefit more from the already-defined situations that warrant merging early
Optimized CI/CD system
With proper monitoring, you can detect and track the progress of targeted initiatives that improve developer experience with the SDLC and CI/CD systems. These initiatives usually focus on a specific job or pipeline and try to determine a “refactoring” that provides the same functionality but with even more benefits — duration, ease of use, or reliability.
In some cases, the changes are more transparent to developers; in others, they need to learn and incorporate the changes into their daily routines to fully benefit. This is where having an excellent bi-directional working relationship comes into play, as driving initiatives with more considerable changes is easier if the developers are convinced that the benefits are worth it. This trust develops when you have a track record of past improvements and have demonstrated a willingness to listen to and address some of the self-identified issues that cause toil — using both objective and subjective approaches.
The power of dependency graphs
One cornerstone of many possible optimizations is the ability to define dependencies between various components programmatically. Thus, you can determine the effects of changing one component by determining which dependencies are affected and, more importantly, which parts shouldn’t be affected, minimizing the amount of work required!
We applied this to our backend and frontend repositories, using Bazel and Nx, respectively, with additional support from the --findRelatedTests flag in Jest. Even though they operate at different levels (Bazel and Jest work at a file level, while Nx works at the module/project level), we can implement similar approaches for both building and testing.
These also proved the most significant time gains from our CI initiatives. For example, we reduced most Go test run times from over 10 minutes to under 5 minutes, and most TypeScript unit test runs from 9–10 minutes to under 3 minutes . The more specific and “isolated” a change is, the less time the tests take! This was achieved without impacting overall reliability or coverage, maintaining trust in the automated testing process.
Another well-loved feature available to backend developers is what-services-to-deploy:
Similar to detecting the affected tests, we can define services by the needed binaries and config files and then determine, starting from the branch diffs, which services are affected and need to be deployed. This detection makes developers’ lives so much easier, as they don’t have to remember all the dependencies or deploy all the services for each change.
One particular topic of note here is accuracy. Automation through generation from code files usually gets you 95% of the way, but manual dependencies still need to be defined by a developer (usually in config files that are read programmatically). Requiring this step introduces the possibility of accidentally missing some dependencies. To counteract this, we typically go with a more cautious approach by both having some broader links defined (e.g., folders of configs rather than individual files) as well as having cron jobs that run the complete suites to cover any lingering issues that might creep in (having flaky test monitoring also helps with this).
Caching everywhere!
A large part of the gains from using dependency graphs comes from the use of cached artifacts from previous executions, allowing us to reuse already generated modules without having to rebuild them when they’re not part of the affected dependencies. To implement this, systems capable of persistently storing and responding to requests need to be set up, but the maintenance costs are fully covered by the benefits they bring.
This use case applies to building the Go binaries needed for services. The Bazel-cache server handles requests and, based on SHA digests for different modules, helps by not rebuilding previously cached libs if they have not been changed. For a single service and its health_check binary, the build time without any cache is 66 seconds, while with caching, it takes 10 seconds, an 85% decrease in time!
Another area where caching is proving valuable is with Docker images. We first take advantage of it when building the images, by using a client-server architecture with Buildkit and persistent storage. This approach allows the local caching of previous layers to be reused on new builds. To make this even more efficient, we also optimize the Dockerfiles so that the elements that change most often are added as late as possible to maximize the cache usage for the initial layers.
In terms of runtime, we try to include all the needed dependencies and tools in the actual images instead of installing them each time to also benefit from the Docker cache on the worker nodes (we also moved from an initial docker-in-docker approach to running natively on k8s to be able to benefit from this entirely). However, this approach brings forward the discussion of balancing having more persistent nodes with fuller caches and having a more dynamic composition of nodes, primarily in terms of costs. We’re still experimenting to find the best configuration.
Pipeline philosophy
One last part of this topic is how we organize the developers’ pipelines, as we own the more complex ones. For these, we have a couple of rules that are followed:
- Pipelines should be idempotent — running the same pipeline with the same inputs should produce the same outputs repeatedly.
- One crucial aspect of idempotency is isolation. A good example is using a single pod with a test runner and all the needed database containers so that multiple parallel runs don’t affect each other in any way.
- Pipelines with side effects follow a similar pattern: They run all necessary checks in parallel and generate temporary resources (e.g., docker images). Then, they create a side effect, aka no-turning-back-point for the pipeline (e.g., merging into master or deploying a service). Finally, they trigger an idempotent post-generation pipeline that performs additional checks or generates other resources linked to the main one (e.g., retagging images with the merge SHA).
- Ideally, pipelines should have a single, well-defined purpose and should not deviate from it
- Parallelism is always beneficial when it comes to overall pipeline and job structure (e.g., running tests with multiple workers).
End of part 1
The first two pillars are closely linked and must be presented together to demonstrate how the data gathered through monitoring is used in initiatives to improve the developer experience.
The second part of this article will cover the following three pillars: developer environment, tooling, and developer feedback, which complement the systems data.

