Getting machine learning off the ground requires many skills and capabilities. Some of these skills are related, some are not. For example, knowledge of math and knowing when to use which machine learning (ML) algorithm share many commonalities, but they are largely unrelated to building infrastructure. That’s why people with ML skills tend to also be competent in math, but not in building infra. Skills cluster, and it’s useful to give these clusters names.
Cluster one: Data scientist
The first cluster of skills revolves around developing machine learning models. The bread and butter of a machine learning product. That is, creating something that can accurately infer unseen variables. For example, predicting which product a user is most likely searching for in an online shop.
Cluster two: Data engineer
The data engineer has a strong computer science background. This individual is concerned with building the software and infrastructure that integrates the machine learning model with the existing IT landscape. Data engineers also try to strike the right balance between the different system quality attributes that are relevant.
Cluster three: Site reliability engineer
The technical reliability of an operational machine learning model need to be monitored and improved. That’s what site reliability engineers and operations engineers are skilled at. They often also build and automate the infrastructure that hosts the machine learning product.
Cluster four: Analytics translator
The analytics translator, who sometimes assumes the role of product owner, is able to bridge the gap between the operational domain expertise and the technical expertise of data scientists, data engineers and site reliability engineers. This person has an overview of the entire value stream, business process or customer journey. Without an analytics translator, chances are you’ll solve the wrong problem.
There are multiple ways of composing teams and dividing responsibilities among them. The most straight forward composition is to group people with the same cluster of skills in the same team.
Expertise oriented teams do not work
When creating teams of people with similar skill sets, the tasks that each team does naturally follows. The data science team, usually called the data lab, validates ideas and builds prototypes. The product team consists of software and data engineers. They build a software product that contains a ML model. The data engineering team builds data pipelines, and the operations team deploys and maintains everything.
This is quite problematic, as this results in split ownership and many hand-overs, each with severe loss of knowledge. Lead time also drastically increases, because each hand-over requires another team to refine and plan their parts. On top of all this, it creates tight coupling between teams, which results in many single points of failure.
Feature teams do not scale
In traditional software engineering, we’ve seen this problem once before. We used to have a front-end team, a back-end team and a database administrators team. Any change to the system required all teams to coordinate. With the lean and agile movements we shifted towards feature teams. Each team is now end-to-end responsible for a feature, and has people from all disciplines.
There are some issues with applying this approach to machine learning teams though. In addition to the front-end engineer, back-end engineer, user experience expert, designer and site reliability engineer of a typical team, you now also need a data scientist, data engineer and an analytics translator. You might even need multiple people with the same role in a team. This leads to communication issues within the team, as each person added to a team adds more communication overhead than the previous one. The inherent complexity of ML prevents us from recruiting full-stack engineers — a measure to reduce the team size.
This might be slowing the team down, but at least not as much as the expertise oriented approach. If your organization does not yet do ML, start with a feature team.
When you’re further along, and are using ML in multiple teams, you’ll start to see patterns. There are some tasks that get repeated by each team. For example, every team requires monitoring for their ML models, and a tool to schedule batch jobs. With many teams doing the same mundane tasks over and over, it’s becoming economical to handle these cross cutting concerns centrally.
Centralized teams should handle cross cutting concerns
When moving some part of the work from feature teams to a centralized team, we should be thoughtful not to introduce the downsides we’ve seen with expertise oriented teams. We need some principles to protect ourselves.
Principle one: expose products, not tasks
When a centralized team takes over a tasks from the feature team, it becomes a bottleneck. People don’t scale, but products do. So instead, expose a product or platform that enables teams to do their tasks themselves, in a more efficient and effective way.
Principle two: opt-in
When feature teams are required to use the product of a centralized team, the incentive to create a good products is reduced. A feature team might also need something slightly different from the standardized product, which leaves them stuck waiting for the centralized team to improve their product. So give teams the freedom to choose whether or not to use the centralized team’s product.
Principle three: self-service
To prevent hand-overs, make the centralized team’s product fully self-service. Provide rich documentation, and tools or API’s to automate everything that previously required communication.
Principle four: require no communication
That brings us to the last principle: require no communication. Don’t get me wrong, by all means be approachable, helpful, and gather feedback to make a product that fits the feature teams’ needs. But if you find them communicating with you in order to understand or use your product, you’re probably not fully applying the other principles yet. Do everything you can to prevent the necessity of communication between the feature teams and the centralized team.
The services cloud providers offer adhere to these principles very strictly. They expose a product, that you can use. There are no hand-overs, and there’s no need to communicate with the team that builds it.
In short: start with feature teams, extract cross cutting concerns to centralized teams only when they become evident, and think of centralized teams as internal cloud providers.
About the author
Ruurtjan Pul (Twitter: @ruurtjan) is a data engineer at BigData Republic, a machine learning consultancy company in the Netherlands. We hire the best of the best in BigData Science and BigData Engineering. Feel free to contact us at firstname.lastname@example.org.
My thinking on this subject has been heavily influenced by Henrik Kniberg’s take on the Spotify engineering culture, Nicole Forsgren’s book Accelerate, and Zhamak Dehghani’s article on distributed data meshes.