11 lessons learned managing a Data Platform team within a data mesh

Souhaib Guitouni

Published in

BlaBlaCar

13 min readOct 10, 2023

Introduction

BlaBlaCar’s Data team moved to a Data Mesh organization in Feb 2022. We are now 7 teams:

6 domain-oriented, multidisciplinary squads. Each fully in charge of its business domain.
1 transverse platform team, with only Data Engineers.

The choice of creating the data platform team was made to provide a common infrastructure and tooling for data practitioners in all the data squads. Data practitioners being Data Engineers, Analytics Engineers, Data Analysts, ML Engineers and Data Scientists. We also set up transverse chapters, which gather people with the same technical skillset from different teams to share knowledge and jointly lead their field of expertise.

In this article, I’ll go through the different learnings we made in the last couple years implementing this pattern. The focus will be on the challenges this kind of team brings, rather than management in general.

Generic products

When working on a specific use case with a certain team in the mesh, we always have genericity in mind. In other words, we check if what we’re building can be reused infinitely with no effort for other use cases and with other teams. We think of the platform team as a startup within the company providing common tooling for data practitioners. This way, the platform team owns agnostic software, while domain teams own the usage of it.

Obviously, building generic software comes with a larger initial cost. As an example, it’s more complex to build a script that can copy any backend table to your warehouse, than one that knows how to copy only one specific table. The generic version must be able translate any possible field type in the input schema to an output warehouse schema. Things are simpler if you go specific, schemas can be hard coded for the input and the output. So you wouldn’t have to implement a complex schema translation logic.

From our experience, genericity turns out to be the right investment for major products as these home-built tools can now be reused with no effort.

Concrete case
We had to build our in-house feature-store (a data store for machine learning). It provides data for data scientists to build training datasets but also serve their live models in production.
What didn’t work
In the first version, we built a specific feature-store. Its source code knew about the data models (like trip offers, bookings and users): built heterogeneous input pipelines, along with specific code to store data in the database and to serve it.
Maintainers of the feature-store had to understand the business logic of each use case and the history behind it to be able to do maintenance. Adding new data meant sometimes adapting the code architecture of the whole app.
Users would create Jira tickets for us each time they wanted to get new data in the feature-store. And we had to undergo major changes each time.
What did work
While the first version opened the gate for production usage of machine learning, it was hard to scale once it started getting successful because the complexity of maintenance.
We decided to move to a setup where the platform team builds a generic feature-store that has no knowledge of the data. Adding new data pipelines doesn’t require much domain specific knowledge: maintainers provide the software solution, and users use it themselves. They’d add configuration and simple code with a predefined pattern to feed their use cases.
This gave autonomy to users while reducing the need for maintainers to know each and every domain use case. Usage is mostly configuration and is done by users themselves.

Cope with too many stakeholders

On one hand, data mesh provides autonomy to domain-oriented squads. They have little to no coupling and have strong ownership of their domains. The data platform team, on the other hand, has coupling with all of the teams. Stakeholders range from the nodes in the mesh, to the technical chapters, and the other Engineering teams. The latter will naturally use the Platform team as a gateway to the Data department, especially for low level technical matters like security, infrastructure and compliance. We found that it’s easier for them to address us rather than address all the data nodes one by one:

We are able to provide them a rapid overview of the whole data stack.
We can work with them to better formulate their requests to data practitioners: this is the case when someone managing the low level network wishes to address Data Scientists. Communication can be hard between the two as their skill-sets are far away.
We are able to take some shortcuts: sometimes, the solution can be built by the platform once for everybody rather than by each team individually.

Having so many stakeholders brings its load of challenges. There’s a need to gather information, find the right opportunities of impact, push adoption with so many people and adapt communication depending on the audience.

Concrete case
In our case, we have 6 Data teams, 4 chapters, and 5 Infrastructure teams as stakeholders.
We dedicate much effort into strengthening the links with them. This includes regular checkpoints, feedback requests, demos and co-building of our roadmap. This is lead by the data platform product owner. He collects the requests and feedback and works on the discovery phase to identify the right spots of impact.
We also set up a meeting called “stack governance” to discuss publicly the major decisions of the stack. Chapters play a major role in leading design decisions as well. And we have representatives from the platform team in several of them.

Build the right team

The scope is quite large, from supporting data science projects to analytics and infrastructure.

We needed a team with a wide range of expertise, to be able to communicate with all of the previously mentioned stakeholders and support our different missions.

Our development team members are all Data Engineers and share the same responsibilities. Still, they each have a unique background: with previous experiences as Data Scientist, Analytics Engineer, Marketing, Infrastructure, Backend or Frontend Engineer. This setup allows us to cover a wide technical scope while being able to internally share expertise when needed.

Concrete case
When the team first started, each person was comfortable with only a certain part of the stack.
It was difficult to fix things and answer requests in case the “expert” was off. One tool we adopted is the Skills Matrix. Its purpose is to share knowledge within the team. It allows each member to self evaluate. Then we take time each week where an expert helps us go through common operations for a certain part of the stack. “Non-experts’’ perform the operations and ask clarifying questions. Then we evaluate again.
We sometimes attribute topics to people who are the least comfortable with it on purpose. That can sound counter-intuitive, but we see it as a long term investment in building the right team. Experts would provide that person support along the way to make sure we deliver the right increment and with the right pace.

Pick your battles

We had a recipe for an overwhelming situation: too many stakeholders and a wide scope.

We use two key processes to not fall under the weight of requests:

Selective when building the roadmap: each quarter, we choose to work on projects related to a few priority parts of the stack, dropping the others.
Run and call rotation: each week, someone from the team is assigned to intervene in case a part of the stack breaks and to provide support for users. This provides continuity of service for all of the existing.

Unless you choose your battles, it will be too much to handle, especially with rising adoption bringing even more need for support.

Concrete case
The data squad responsible for buses wanted to ingest data from an API (of an external provider). There was no external solution that fit our needs and budget, so we chose to build something internally. We could either:
- Go generic from the start, having the data platform team build it.
- Have the domain squad be autonomous without our intervention.
We collectively chose to have it built and maintained locally by the domain squad as the need wasn’t there for other squads. That cleared room for the platform team to focus on other topics.

Align with key stakeholders

Looking at the same problem from another perspective, starting many projects at once by the platform team creates pressure and instability on others. Instead of spending time on their domain specific topics, the mesh teams would spend too much time absorbing the change we bring to the infrastructure.

Even with the best intentions, starting too many changes at once proved to bring negative impacts on the organization. These situations could generate resistance to change. The reason is simple: the people you want to impact are not available and do not understand why you’re doing it. This is why we plan ahead to check that they can absorb the change and we make sure they’re part of the project.

Concrete case
When it did not work
One of our first projects was to rewrite the tracking pipeline destined for frontend apps because it had a very complex stack and we were unable to maintain it properly. We handled it at first as a platform team project. The new version created new tables that analytics engineers of all domain squads needed to switch to.
Analytics Engineers had other priorities, and did not have the context about why we’re making the change. They had no guarantee the new data was correct and no room to work on it.
The situation was hard, and it took us too long to wrap up the project. Communication and negotiation for time allocation were hard as well.
What did work
One example of large changes we’re currently conducting is the move to dbt. The need started by Analytics Engineers with a need to change the way they do things, and the platform team was there to support through infrastructure the move.
We feel a completely different dynamic. Stakeholders are not just adhering to the change, but rather leading it.

Flexibility over hard rules

Our platform team allows the organization to avoid reinventing the wheel by providing common tools and patterns. Still, in some cases, domain-oriented squads need to go fast and alone to seize business opportunities, meet a certain deadline or just experiment on their own. And we are totally fine with that.

The data platform is perceived not as a team that decides how things must be done, but rather as a helper. We seek adoption rather than obligation.

Data Mesh has a bias towards flexibility. Having a transverse team shouldn’t change that, but rather empower it. Rules and patterns rise naturally through the tooling provided. This is especially valid for first times: in many cases, you do not wish to go generic at first. You wait until a pattern starts to appear then invest on making it generic.

To create cohesion in the stack, another pattern we put in place was swapping engineers between the platform team and the domain squads. The idea is that we encourage internal mobility, which, among other benefits, harmonizes practices. People joining the platform team bring domain context; people leaving it for squads become ambassadors of the common platform.

Concrete case
We are currently reorganizing the warehouse. As a platform team, we did not take the position of deciders, but rather helpers. Analytics Engineers and Data Analysts are deciding for themselves how they wish to organize the GCP projects hosting their warehouses. This is mainly led through their chapter.
The platform team is providing help in provisioning projects and building the rights management system that fits analytics wishes. We only intervene if there’s a serious complexity or security threat linked to a certain choice. Squads’ members should feel supported and have the flexibility they need to perform their job rather than infrastructure requirements and hard deadlines.

Innovate around real use cases, not trends

We do not adopt tools because they are trendy. We do it because there’s a business need.

To do so, we usually work hand in hand with a data squad around a real use case. We build something and then scale it to the other teams. We found this pattern to be very effective. It shows through example and measurable results what the new tool or methodology brings. This makes adoption much easier.

Co-building also creates common goals and shared success, which proved to be the best option for working together.

Concrete case
In Q4 2022, we wanted to improve the way we do machine learning, adopting an MLOps culture. That change couldn’t have been possible with only the platform team working on it.
We knew that one domain squad responsible for the fraud detection was looking to improve the lifecycle of its machine learning models, so we joined forces with them.
This gave birth to a work group, where we discussed organizational changes and experimented with tools and patterns. The project led to significant improvement of the quality of ML in production, the time to production and slashed by 4 the cost of implementing a training pipeline. We then scaled this pattern for other use cases with other teams.

Don’t take on all the roles

At some point, we struggled with the amount and kind of requests we were getting. For many people, the first reflex in case of issues was to contact the data platform team. Although that was flattering, because everyone believed we could help with literally anything, it made the job of the person on Run very difficult. This was especially true in the early days of the data mesh reorganization. Data practitioners came to us with questions out of our scope. This included schemas related matters, access to tools managed by the IT team, issues with the infrastructure owned by the SREs, etc.

The first reflex was to answer the best we could. And the more useful we were, the more out of context requests we had. The solution was to stop assuming roles that aren’t ours, and redirect to the right team. That helped build better dynamics within each domain squad and better links between the mesh nodes and their corresponding backend teams.

Concrete case
To resolve the many out of context requests issue, we had a few chats, among them:
- Within the team. This led us to identify the issue but also decide collectively to stop assuming those roles.
- With the engineering managers of the domain squads so that they can encourage the data experts to ask the question within their own multi-disciplinary team before asking the data platform team.
- With the chapter where Data Engineers gather. To expose them to the issue and have their buy-in to provide the first level of support within their groups instead of heading to us.
The number of out of context requests dropped drastically as the new organization became clearer to everyone, that inner-squad dynamics were created and that the responsibility of the data platform team became more clear. Still, this is always something to maintain, especially for newcomers.

“Build vs. Buy” to reduce your footprint

As mentioned previously, the data platform team builds and maintains multiple home-made products. This doesn’t prevent us from using open source solutions or using 3rd-party products.

The rule of thumb is that we should build it ourselves only if:

We really need it.
There’s no considerable external alternative or that the cost gap isn’t reasonable.

We always weigh the build vs. buy options on each architectural decision we make. And we revisit our decisions periodically to challenge our stack.

This is very important to keep only what needs to be maintained in house. Otherwise the data platform would have too much to maintain and less time to really focus on what brings value.

Concrete case
An example for the products we wanted to keep building internally is the frontend tracking pipeline. To challenge it, we contacted multiple external service providers, included people from the product and marketing and discussed our options as a group. We eventually saw little interest from end users about features provided by external solutions, and a huge cost gap. So we chose to keep the existing as it fit our needs.
An opposite case is the ingestion of marketing pipelines (getting data from external providers like Google Ads and Facebook Ads to the Warehouse). We chose to keep Rivery for those use cases, as it was more reliable and more cost efficient than something we could build ourselves.
We challenge the existing periodically. Not all at once, but each 6 months we open the debate for some.

Include the mesh nodes in the tech watch

We are not driven by trends, but we keep our eyes wide open about what happens elsewhere.

Many channels exist such as personal tech watches, talks with solutions providers and discussions in the French MDN (Modern Data Network). We also exchange knowledge with multiple peers from other companies, let them know about what we do and get from their experiences. If you’re interested, don’t hesitate to reach out to me on Linkedin by the way!

We always keep an open mind about the change, even for our most precious home built software.

One very important learning we made is that tech watches we lead should include end users, like Data Scientists or Data Analysts.

Concrete case
We recently decided to challenge our Airflow setup (we use Composer) against other existing solutions in the market or deploy it in our Kubernetes cluster and maintain it ourselves.
We had a group from multiple teams and various expertise, that had different experiences using Airflow.
Apart from making others feel included, we absolutely need their expertise and to build a collective decision as it impacts everyone.

Bring meaning

As any infrastructure team, you’ll only get feedback when things go wrong. Who ever sent their infra team a “Hey! I’m starting to work this morning and everything is smooth. Good job!” thank you note? Infrastructure teams are the heroes in the shadow and their mission is to make reliability be taken for granted.

If you happen to manage such a team, it is of high importance to celebrate success, make the link with the rest of squads’ projects and communicate about it. This is because behind every high quality business there’s high quality infrastructure. It’s great for the morale of the team but also to showcase where it might help for potential stakeholders.

Concrete case
Starting from the names of the projects in our roadmap, we make sure to make clear the business impact of projects that might seem purely technical. Something like “support [some squad] build a reverse ETL to calculate the probability of [a certain business case]”. That name gives sense to the technical project.
Like with any team, communication is key to sharing and celebrating achievements. And we do that on each considerable milestone.

Conclusion

Two years ago we implemented the data mesh paradigm at BlaBlaCar. And looking back, having a data platform team within our data mesh allowed us to drive pure technical projects that would have been impossible without it. It impacted the productivity of our domain squads and so consequently, of our business.

Dedicating resources to such a transverse team fosters sharing, collaboration and governance.

The balance needs continuous effort to be kept: influence without authority, picking the right battles, handling a wide scope and building shared knowledge within the team.