How a Data Pipeline Playbook Helps to Succeed With a Digital Transformation

Published in

The Startup

7 min readNov 9, 2020

In recent years, many companies have seen a need to adapt to the digital transformation to fulfil future business cases. Companies like Facebook, Google and others have unlocked great potential by analyzing customer data obtained from their platforms. The next round of success in the industry might be when IOT data is available and analyzed. Therefore, a great question is; how to succeed with this next step? This blog post will describe the biggest challenges I have heard from several companies, and how we try to solve them in Grundfos:

Around two years ago I switched from a research career investigating one of the world’s largest datasets at CERN to work in a private company. The idea was clear: the data analysis experience I gathered at CERN could be used to solve other problems with other datasets, but with the same analytical and AI methods. A little naïve, I made the presentation: “How can Big Data experiences from CERN be used in the real world?” Theoretically, the answer I gave was OK: I described how to improve AI and algorithms when more data is collected. However, in the last years, I have attended several data summits and workshops. These meetings made me realize that I probably underestimated two aspects of my time at CERN that made my approach to the above question far from optimal for the audience:

The maturity of CERN processes versus the maturity of digital projects in the industry:

CERN has been collecting and analyzing data for over 50 years and is, therefore, to be compared with the more mature part of most companies. CERN has developed standards to data questions like: How is a large dataset stored optimally and made available to thousands of global researchers? How to trigger and decide what data to store? How are codes shared with reference data? How to validate the quality of the data and the end-to-end data pipeline? How to determine if an error is due to production code or device problems? How to label data via simulations? How to analyze data? How to maintain AI in production? How to evaluate and show results? and many more.

The first challenge in digital transformation projects: Best practices and standards are not yet put in place in established processes.

The diversity among employees at CERN versus digital projects in the industry:

CERN is both a research facility and an educational institution. Every year, around 200–300 university students spend two to four months at the CERN summer school, and many of these students also write major university assignments at CERN. Thereby, most CERN researchers to-be already know the processes, tools, theories, and culture of CERN at the end of their education. Also, almost all employees working at CERN have a PhD from CERN. Thereby, creating an almost too uniform background for people. The advantage of this is that all employees have a common language and a common understanding of the task.

This is very different from the industry. Relevant educations in data science and data engineering are only being created and developed at universities these years. Technical tasks are therefore solved by people with very different backgrounds, cultures and languages: mathematicians, physicists, software engineers, computer scientists and more. A common language and understanding between a heterogeneous group of technical employees are needed for them to create solutions and communicate their roles and tasks homogeneously. Communication is especially important since many stakeholders and domain experts have non-technical backgrounds in the industry: their professional expertise is beyond the data understanding. Clear communication regarding the technical tasks and profiles are needed for stakeholders to prioritize correctly to obtain goals — goals that are also lacking standards on how to be measured

The second challenge in digital transformation projects: There is no clear language and understanding of which profile is needed for what task to obtain a goal.

So how to handle the above questions? At Grundfos we implemented two very promising approaches :

Best practices and standards are not yet put in place in established processes

It is not feasible for industry stakeholder to spend 50-years to develop best practices as CERN did. The process for the industry needs to be quicker and less costly. The good thing is that many different industries can work together as their business cases and customers do not overlap — thereby there are no competition problems. Grundfos is entering into collaborations with other companies, sharing experiences and knowledge hopefully shortening the path to good practices, processes and measurement of success. This will also make it easier for newcomers to understand processes in a new company spending less time training people.

One place of collaboration is with software companies like Microsoft. Microsoft uses feedback from Grundfos to build the software tools for tomorrow. At Grundfos, we benefit for this since the tools then fit perfectly into the problems we face.
Another place of collaboration is the open-source community, the Airplane alliance. The Airplane alliance creates open-source material while creating industry standards together in different settings as co-creation workshops. Thereby learning can easily be transferred among companies and the open-source material eases the burden on the individual company to build the standards of the digital transformation

There is no clear language and understanding of which profile is needed to solve specific tasks to obtain the desired goal

The first step to understanding is to get a common language between a diverse group of people. A common language allows issues to be discussed and understood. To get a common language at Grundfos we created the Data Pipeline Playbook. This is the first playbook in a series of playbooks where the goal is to describe what tasks are involved in a data project and the responsibilities of each participating role. The idea is that if technical people can describe and agree on tasks and responsibilities for data projects then stakeholders can prioritize what profiles and which tasks are needed in different phases of the project to obtain successful projects.

Creation of the Data Pipeline playbook is an iterative process and the first version was created by a group of 10 data engineers. The playbook was used to start a broader Data Pipeline Community in Grundfos with more than 70 participants. The Data Pipeline Playbook has been shared in both an offline and an online version inside and outside Grundfos. Thereby more people can contribute to implement, discuss and improve the subjects and language in the booklet
The Data Pipeline Playbook describes the responsibilities and tasks of a so-called Data Pipeline Owner (data engineer) and important definitions for this. Some examples are:
A common Data Pipeline definition is given which in short is:” An arbitrarily complex chain of processes that manipulate data passing the output data of one process becomes to the input of the next”. It is explained how a data pipeline differs from other similar concepts and what is best practices.

The main responsibility of a Data Pipeline Owner is described to ensure an atomized and traceable data pipeline. It is required that data flow from the data source through analytical models and into UI/API plus also being stored into a centralized maintained data storage.
A checklist is made for what must be done at different phases of a project relating to each of the main topics: Data Ingestion, Data Synchronization, Data Pipeline Architecture, Data Governance, Performance Optimization, Production Orchestration and Automatization. Furthermore, a list is also given of the most used tools for Data Pipeline Owners

Please contact me if you want a copy of the full Data Pipeline Playbook from Grundfos (lthomsen@grundfos.com).

By Lotte Ansgaard Thomsen
Lead Big Data Engineer, Grundfos

Please read part two of this blog here:

Data Foundation in the industry build on the discovery of the Higgs particle at CERN.

Problems in IoT streaming data:” Can the experiences from CERN be used to create a data foundation for IoT streaming…

medium.com

About me:

Lotte Ansgaard Thomsen is a Lead Big Data Engineer at Grundfos. Lotte has a background as a researcher at Yale University investigating and analyzing data from CERN. The experience gained by working on one of the world largest datasets has given her a deep knowledge of what is important for successful IOT/AI projects .The last two years Lotte has been working at Grundfos: using her experience to create guidelines for data pipelines, data quality and data architecture,

Lotte believes that with collaboration, proper data handling and robust algorithm the world can be improved with the new capabilities