Origins of Data Role Conflicts

Matthew Seal
Noteable
Published in
7 min readJun 29, 2021

As organizations grow, so too does the specialization of teams addressing specific aspects of the organization’s needs. The same holds true for data organizations. A new data science team might be established to generate novel insights, or the data engineering team might split off a group for managing business critical data streams. Within these teams you may have many different roles too such as data engineer, data modeler, business analyst, or data scientist. This specialization can differ significantly from company to company, but the expansion of these groups always comes with the same growing pains.

Too often the assumption is that the distance between teams is caused by bad hires — an assumption that can derail resolutions — when in actuality it’s the result of each team’s specialization.

While all organizational growth may lead to disagreements between groups, with data teams there’s also a more fundamental shift in problem solving which is challenging to identify. A team that once worked well together can start to splinter over how to solve a problem, and it’s easy for a gulf to form between the various teams. Too often the assumption is that the distance between teams is caused by bad hires — an assumption that can derail resolutions — when in actuality it’s the result of each team’s specialization. The approach these developers take to perform their jobs can vary dramatically and what feels like an obvious path to solve a data problem for one individual may be counterintuitive for another team.

Let’s look at a hypothetical scenario: you have three data teams collaborating to bring some intelligence to your website. One team, a newly established data science team, has developed a model for the purchase path that will suggest closely related products to upsell in the cart. A data engineer is assigned to the project who is responsible for collecting raw website data and preparing it for use in model training. And finally, an application engineer is assigned to the project to integrate the model into the website and deploy changes to the production site.

Data specialists often combine their talents to tackle new problems

The model has great scores during development, indicating it’s ready for prime time, but runs into issues during A/B testing. Users are complaining that the suggestions are completely wrong, and complaints arrive immediately after each code release is made for the website. It seems simple enough to resolve, but the team is a month into trying to fix the issue. There’s heated conversations with lots of uncertainty, and as a project leader you’re not sure what to make of the problem.

So, what went wrong? First it was determined that the model needed more recent user details during training to retain its accuracy. The data scientist made some quick changes to add a few more features, but this update required two extra data streams that weren’t previously being consumed. The data engineers scrambled to get a streamlined input for the new sources that could be available on the project’s non-daily build cadence. Some late nights were put in to get the project back on track.

But these changes caused the model to be significantly larger. The data application team asked that the model be put into a separate application to isolate the memory and CPU usage. To correct for that, the data scientist tried to reduce the model size by adding different inputs in a new approach, but the new version takes four additional days to generate. Since the data engineers can’t schedule more time to build the improvements given their other project priorities, they ask to revisit once the data science team is certain about the data they need. Added to the mix, the application team raised worries about the model response times and suggested the model results be cached; which the data scientist is against because it would impact the quality of responses.

Every individual is pulling their weight trying to fix the technical problems, but trust is thin. So what’s the root problem and why are the proposed changes not helping?

The project is at a standstill. Each proposal is now being shot down by the other two teams because they can’t afford to spend time on another failed approach. The team manager and the director have now been in meetings with you all week to work this out. Every individual is pulling their weight trying to fix the technical problems, but trust is thin. So what’s the root problem and why are the proposed changes not helping?

The root cause for disagreements between teams on late projects can be confusing

The second half of this question is the part to focus on. The original problem was likely a fairly benign issue that with the right fix would have been quickly solved. But the wrong approach has caused a chain of decisions that didn’t account for all of the data domain constraints at once. These chained proposals aren’t solving the entire problem because each contributor is over-applying patterns from their specialization to others’. This in turn magnifies the failure. Their colleagues then repeat the same problem by tackling their corner without the larger picture in mind. Finally, everyone is frustrated and angry at each other for still having problems weeks after launch.

The project at hand is cross-domain and contained three data constraints, which was why there were multiple disciplines involved originally.

  1. There’s a learned model that needs enough features and training data to correctly guess a solution often.
  2. There’s a continually repeating data stream to keep that model accurate over time which has strong time constraints.
  3. Finally, there’s an application that must host the model which has response time and resource caps.

Optimizing on any one of these too much can become a game of whack-a-mole, trading off one element for cost in another. Why the confusion?

Take a look at the way each developer solves their individual constraints. The data scientist solves her model problems by introducing new data and adding more complexity to the model to account for edge cases with acceptable losses to other attributes. Shorter build times and resource usage don’t help make the model more accurate. The data engineer plans out specific data feeds in advance that are well specified with the minimal development time to meet production needs given the demands on his team. Changes over time to the pipeline are to be minimized or re-planned entirely. And finally the data application engineer focuses on repeatable, well isolated units of work — that have minimal complexity per unit — to ensure high up-time of the services. Anything that adds edge cases correlates with increased service outages.

These are not poor trade offs within each domain; they’re what’s most important to each specialization’s day-to-day work obtaining success.

In the example, the set of data concerns were likely planned out such that the input data was well structured, built in a timely fashion, and fit within application constraints. But if those are thrown out of balance by a change which only focuses on one of those constraints it can take several iterations to fully correct. The application had limited response time and strict memory boundaries while the data feeds needed to be strictly specified, all while supporting a learning model that would iterate through approaches over time. It’s hard to get ideal programs that meet these demands, and likely the developers themselves were not aware of some of those constraints as they proposed solutions.

If you find yourself constantly arguing about how another team’s solution can’t be used, consider that the overall group is not approaching problems from the same position or with the same constraints in mind.

What was needed earlier in the process was to more clearly communicate the product boundaries and for problems to be better assigned to the data discipline best suited to find a solution. Organizing around the problem you are trying to solve and defining the goal to be achieved can help with establishing the right focus. If you find yourself constantly arguing about how another team’s solution can’t be used, consider that the overall group is not approaching problems from the same position or with the same constraints in mind. Try to communicate often and clearly, and take the time to better scope the issues rather than continuously disagree with proposals. And finally, try to focus on next steps and leave earlier mistakes in the past. A little empathy and understanding will go a long way to fixing the lost trust between frustrated colleagues.

Applying the right team’s focus with clear constraints can get you back on track

If you want to learn about collaboration between various data roles from me, do reach out on LinkedIn or Twitter. I’m always happy to talk with folks in the space. We’re also hiring if you have an interest in working on Notebook solutions for our upcoming beta launch.

--

--

Matthew Seal
Noteable

Matthew Seal is a co-founder and CTO of Noteable, a startup building upon his prior industry-leading work at Netflix.