The building blocks of successful Data Teams

Published in

datamindedbe

7 min readMay 3, 2024

During my 7 years of experience in the field of data, I have worked in multiple data teams of which some were successful but others were not. In this blogpost I describe 5 building blocks that are crucial for the success of a data team.

Generated image by Midjourney with prompt: a data team celebrating success

Assign data ownership

Data ownership begins with assigning clear responsibilities to dedicated teams for building and maintaining different components of your data landscape, such as pipelines, applications, and dashboards. I’ve often encountered situations where no one was accountable for a failing component, leading to a blame game among teams more focused on self-preservation than solving the actual problem.

Attempting to rectify the issue independently can be time-consuming, as it often requires understanding how each component interacts, locating the code, and learning the product’s deployment process. To streamline this process, up-to-date documentation is invaluable, although it’s unfortunately rare to find extensive documentation. In addition to documenting your work, standardizing practices can also greatly assist in these scenarios.

A second aspect with respect to ownership is developing a product mindset within your teams. This approach focuses on how people will use your product and emphasizes simplicity. It often requires broader thinking beyond the initial use-case, considering potential future uses for the data.

Several years ago, I was part of a project where we didn’t apply this product thinking. We developed a custom flow for each company requesting our data. Consequently, we ended up with six APIs that all used the same input data but exposed it with minor differences, such as data format and exposed properties. The key takeaway was the need for better foresight regarding the long-term impact of this approach. Over time, this setup became difficult to maintain, as every data change required modifications in six different repositories.

An important aspect in defining your data product is to define service level objectives (SLOs) and service level agreements (SLAs). These can specify data update frequencies or product up-time guarantees. For potential consumers of your data, the up-time SLA might be a determining factor in their decision to use or not use your product.

Focus on business outcome

Every data engineer has their preferred technologies. Personally, I like writing code in Golang and deploying applications on Kubernetes. However, what sets a great data engineer apart is the ability to choose the best solution for a use-case, rather than their personal preference. In my work, this often means writing data transformations in Python instead of Scala, as our customers are more familiar with it.

https://imgs.xkcd.com/comics/automation.png

As an engineer it is important to remember that the business doesn’t care about your technical implementation. They’re more concerned about the output of your work: if the data is correct or if a new feature simplifies their job. They won’t care whether you used Snowflake instead of Postgres to store data, unless it makes a tangible difference to them, such as faster query speeds.

I’ve also learned that, when given the choice between rewriting a flow for maintainability or just adding code, business stakeholders usually favor the quickest option. Maintainability isn’t as relatable for them since it mainly impacts technical work. So, it’s up to you to make this call and convince them if necessary.

I recall a project where three software engineers wanted to migrate our database from Postgres to Cassandra for performance issues. This switch would have required weeks of work to deploy Cassandra on-premise and migrate all our data. However, after a short investigation into optimizing slow queries, we found that just adding two indices solved all our problems, and that solution has lasted for years.

Your primary focus should always be delivering value to the business, as they are your customers and pay for your work. This should take precedence over personal development, such as learning a new technology or programming language.

Use software best practices

Many best practices in data engineering align with those in software engineering. As software engineering has a longer history, we can adopt various practices rather than trying to invent new ones. While writing code is crucial, it’s just one component of delivering a new feature. Many other aspects simplify development and code maintainability. Here is a non-exhaustive list of these additional aspects:

versioning your code
writing unit and integration tests
setting up a cicd for your project
including monitoring and tracing
writing documentation
managing your infrastructure using code

https://imgs.xkcd.com/comics/data_pipeline.png

For me a good data engineer knows all these aspects and incorporates them when implementing a new use-case. I have written a previous blogpost on why I believe data engineers should be more like software engineers, which you can read here. A person who does not follow these practices and focuses on writing code and deploys it ad hoc to an execution environment is a cowboy and not an engineer.

I know that following this approach might require more time on the short term but the extra investment will easily be made back on the long run. When quality issues, bugs are found or version upgrades need to be performed these practices provide guardrails as well as help you.

Deliver a self-service data platform

When working in larger data organisations (>15 FTE), a data platform team typically creates building blocks (e.g., paved roads) for use by different data teams. These generic building blocks are designed to be reusable across various projects. Here is a non-exhaustive list of popular building blocks:

Workflow scheduler (Airflow, Dagster,…)
Compute environment (Kubernetes, AWS batch, Lambda)
Data storage (S3, AWS RDS, Snowflake)
Data processing engines (Spark, dbt, Polars)
Control access to data (Pbac, Rbac)
Project templates (cookiecutter)
CICD pipeline (Github actions, AWS codebuild)
monitoring and logging capabilities (ELK stack, prometheus-grafana-alertmanager stack)

Image created by my colleague Jonny Daenen

It’s essential for these building blocks to be self-serviceable to prevent the data platform team from becoming a bottleneck. The goal is to enable feature teams to work autonomously, without being dependent on the platform team’s availability. By automating these capabilities you will be able to support more use-cases with the same number of people. Reducing friction ensures that most use-cases utilize these streamlined paths, allowing for swift delivery of new use-cases.

The building blocks don’t have to support every feature needed by other data teams but should adhere to the 80/20 rule. By covering 80% of use-cases with common building blocks, the platform team enables the data teams to quickly deliver most standard use-cases. If a data team needs to deviate from the standard process, they can, but they’ll have to reimplement some of the common building blocks for their use-case.

Additional benefits of having streamlined paths in your organization include:

Most projects use a similar structure and are thus easy to maintain and adapt by everyone in your team.
Introducing new building blocks is easy as it can rely on the common structure.

Create a company-wide data strategy

This last aspect is not really a trait of the data team itself but rather from the organisation as a whole. For a data team to thrive and elevate the data maturity of the organization, everyone must recognize and act on its importance. This begins with management acknowledging its significance. In addition to management, it’s crucial to hire skilled individuals at all levels of the data team. Ultimately, everyone in the organization should comprehend the data strategy and how it benefits the company.

A few years ago, I experienced the effects of a misalignment between a data team and the business at a company I worked for. The data team developed many new data products that streamlined business operations. However, the business struggled to adopt these products, evidenced by:

Challenges in collaborative discussions to understand business needs
Extended validation periods for new features due to low business priority

The result was that these data products were rarely used, leading to frustration within the data team. The solution was to initiate the implementation of a use-case only when a clear business stakeholder was identified. This person served as a single point of contact for the team and promoted the new way of working among their colleagues.

Conclusion

In this blog post, I’ve identified five key characteristics that I believe are essential for a data team’s success:

Ownership of data
A focus on business outcomes
Adoption of software engineering best practices
Provision of a self-service data platform
Recognition that data strategy is a company-wide concern

I hope these points can help enhance your data team’s success.

If you think there are other characteristics to consider, or if you disagree with any of my points, please share your thoughts in the comments section.