Load Balancing the Demands on Data Engineering
In Data Engineering, we regularly receive requests from multiple teams, with varying levels of priority, urgency, and importance. The business side of Noom is growing, and with that comes an increasing number of these requests. We are also in the process of modernizing our current infrastructure and growing it from “startup mode” into something more scalable and maintainable. While platform tasks are equally important as the business requests, they are often less urgent.
So the question becomes: how do you design a team that can simultaneously balance moving the platform forward with the demands of a rapidly growing business, while also keeping your team members happy, healthy, and productive?
I’m Joel Ossher, the Technical Lead Manager on Noom’s Data Engineering Team, and this is a question we have been struggling with for a while. Recognizing that this is not a challenge unique to us, we thought we’d share our journey with others who might be going through a similar experience.
Our Journey Thus Far
The mission of the Data Engineering team at Noom is to build a world-class data ecosystem that:
- Smoothly and reliably scales as the number of users and products that interact with our current data ecosystem increases and Noom continues to grow.
- Unlocks leaps in Data Science and Analytics Engineering velocity when working on large-scale data computation and ad-hoc exploration.
- Significantly improves Noom’s ability to understand, explore and draw insights from our data across all teams.
- Enables Noom to be an AI/ML-enabled organization where the ease of use and rich capabilities provided by our ecosystem make the decision to apply machine intelligence techniques purely based on the problems themselves.
- Supports Noom’s legal requirements and user trust principles around security and privacy as we launch in additional regions.
In addition to the platform work captured by our mission statement, we also take on ad-hoc projects of importance to our business that do not have other resourcing. As a result, our team is responsible for a number of buckets of unrelated work, each with its own flavor and stakeholders.
Initially, the Data Engineering team was structured much like you’d expect, with everyone under one umbrella, and on one unified team. We had sprints of work which included tackling our growing technical debt and platform improvements, along with business requests from Product and other teams. We experimented heavily with the way this work was distributed and prioritized but found that we would inevitably focus our attention on the more urgent business requests, and made little progress on the platform side. While the business tasks were interesting and impactful for the company and users, the lack of progress on the platform side was frustrating. Because the platform continued to be difficult to maintain and upgrade, we were essentially paying a tax on everything we were trying to do, including the work that fell in the business bucket. Everything was taking longer than we felt it should.
We pivoted and tried applying a more systemic solution with how we distributed the work. We thought maybe if we adjusted the number of business projects and platform projects we were actively working on, and the percentage of time we dedicated to each, it would help balance the workload accordingly. Invariably either the estimates were initially inaccurate on the business projects, or there were urgent and important things that were added to the scope, which affected our estimates. With a limited number of hours in the day, some tasks had to move to the bottom of the queue. The platform project was the least urgent, and again, was the easiest work to trim away. Over time, we saw that pattern repeat, sprint after sprint, and it resulted in chronically minimal progress on the platform.
The next approach we tried was to identify platform components that were related to incoming business projects, with the secondary goal of fixing or improving the platform at the same time. Unfortunately, this did not result in actually making progress on the platform, even though it seemed sound. We made a little progress, but, because it took us much longer to complete the business work, it was difficult to satisfy all the competing priorities at once.
Needless to say, it was time to rethink how we distributed and prioritized our work, and the structure we used to organize our team.
The Experiment: Flexible Work Pods
After a three-month test run, we came together as a team in February 2022 and decided to try a completely new approach. As an experiment, we created flexible pods of focused work areas. Essentially, we split our team of 10 into three pods: Platform, Business, and Machine Learning. Each pod would focus on the work that fell into their respective buckets. To ensure that our team members felt a level of autonomy in their day-to-day work, after estimating how many people would be needed in each, we self-selected into the three pods according to preference. As the Technical Lead Manager, I’d float as needed to allow for differences in pod workloads. As with many things at Noom, we knew this was just an experiment and we could iterate as we saw fit.
This is not an uncommon approach to dividing up work, but we were keen to safeguard against some of our team’s initial concerns and create a uniquely balanced and successful solution that would work for us. Those concerns included:
- Balancing autonomy with consistency. It was important to us to ensure that as time went on, people still felt happy doing the work that they were doing, and that they maintained autonomy and control. We decided to open the pods at the end of every quarter to allow for someone to switch if necessary. We hope our team stays in their pods for longer than a quarter, so there isn’t time wasted on everybody ramping up and learning new work every few months. But because this is a new structure for us, we want to allow our team to have more control over their own experience, and not feel locked into an uncomfortable situation in the long term.
- Balancing flexibility with stability. We knew that the pods would have to stay true to their own buckets, even if the workload swayed periodically toward one pod or another. Otherwise, if people are shifting pods on the fly, or if the work was reassigned to the pod with the least amount of work at the moment, we would be repeating our old patterns, and we would see the same result: everything would end up in the Business pod and we’d be back to square one. The downside to this level of stability is that if we were off in our original estimation of how many people should be on each pod, we would have locked ourselves in for a quarter. This is something we will keep a close eye on and will have to address should it become an issue.
- Balancing small and large group communication. Breaking up a bigger team into smaller pods also has its advantages and disadvantages. Siloing work means the sharing of knowledge becomes more difficult. This is especially significant for us and will require focused knowledge-sharing efforts, as it is important for the Data Engineering team to present a unified knowledge base for the rest of Noom. (Also, the flexibility of pod-switching encourages good documentation practices for others who will be joining the pod later.) Further, smaller teams allow for participation and better interaction, which is a huge benefit for some of our engineers. Because there are more teams to communicate with, we will have to strike a balance between team meetings, pod meetings, and daily standups. Zoom fatigue is real, and something we are keen to prevent!
- Balancing responsiveness and focus. It is important to us to remain responsive to the needs of the quickly growing company, so we are keeping the focus of these pods fluid and transient, and will spin up new ones as needed. For instance, portions of our current workload may get absorbed into new teams that are being built outside Data Engineering (like Noom’s new Analytics Engineering team), so we will dissolve and create new pods as needed. Although the Data Engineering team is one stable unit, our pods aren’t meant to be permanent, rigid fixtures.
- Managing onboarding new employees. Dividing work in this way begs the question: what pod do you assign to new employees joining the team? To answer that question we will look to the Business pod because that will give a new employee the broadest overview of Noom engineering and will help foster empathy and understanding around the challenges of the teams we support.
We’re sure there will be unforeseen challenges that crop up along the way, but that’s the nature of being open to experimentation, and we will iterate as needed.
Did Our Experiment Work?
Since this is a new experiment for us, and we are still in the first few fiscal quarters, we’ll be monitoring our progress closely. So far, our experiment seems to be working to protect the time and happiness of our team members, although it’s clear we need to expand the team to handle the workload capacity.
Some questions we are continuing to ask along the way include:
- Are we effectively balancing making progress on platform technical debt while responding to work from the product and business teams?
- Do our team members still enjoy the work that they’re doing?
- Is team knowledge becoming siloed or are we maintaining a shared knowledge base?
- Is our team-wide support rotation still effective?
- Are we balancing our small and large group interaction effectively?
- Did we get the allocation of pod team members right? Is one pod overworked compared to others?
- Are the pods we have in place an accurate representation of our current buckets of work?
Measuring success through metrics is tricky in this particular case. Our goal with surfacing more of the platform work is to ultimately reduce the time it takes to work on the codebase. So we could, in theory, measure pre-pod versus post-pod time spent to compare the differences. But that, of course, has its flaws as tasks vary widely in complexity and would not be very reliable. We considered measuring the adoption of technical improvements (for instance, the percentage of transformations running in Spark), but that would only reveal how well we’ve implemented the solution we chose to implement, not if it was the right thing to implement. Other ideas we had all landed at the same conclusion: it would be a lot of effort for inconclusive and unreliable results.
We decided that self-reported engineer happiness was our best measure of success in this experiment. If our experiment is working, and we’re actually able to reduce the technical debt and modernize the platform, engineers will experience a tangible change in the amount of time and the ease with which they can complete their own work. They’ll feel less frustrated and their lives will be better. (And the Data Engineers will feel good about the work they do too.)
It’s important to note that the ideas, development, and implementation of our new structure have been a completely joint effort across all the members of our team, and I’m very grateful for all their creativity and contributions. It truly has been a collaborative endeavor, and I appreciate everyone on the team working together to help solve our challenges.
We’ll be writing a follow-up post as our experiment continues, and we will share what we learn along the way. It’s an awesome time to be a part of this team, and we’re excited about the future.