How a Data Foundation in the industry can be built on the discovery of the Higgs particle at CERN
I recently wrote the tech blog: ”How a Data Pipeline Playbook Helps to Succeed With a Digital Transformation”: https://medium.com/swlh/how-a-data-pipeline-playbook-helps-to-succeed-with-a-digital-transformation-e75facddd68c. A key point in this blog was that the Digital Transformation of the industry is a new adventure needing new best practices.
A frequent question to my last blog is whether the industry can adopt the best practices developed at CERN. In this blog, I will try and answer this question. More specifically I will zoom in on problems in IoT streaming data:” Can the experiences from CERN be used to create a data foundation for IoT streaming data in the industry?”
First, let us see who CERN and Grundfos are. Grundfos is not unique among companies regarding the digital transformation process, but I picked Grundfos as this has been my point of reference for the last two years:
- CERN is a research organization, and the main asset at CERN is the Large Hadron Collider (LHC): a circular particle collider which collides particles at four detector sites. I was a part of the ATLAS detector, which was designed and founded in 1995 and started recording data 15 years later. The detector was built to record data relevant to confirm or reject the existence of the Higgs boson.
- The hunt for the Higgs particle is often characterized as “finding a needle in a haystack” where the needle disappears almost instantaneously. The goal is to find the imprint of the needle in the haystack before it disappeared. The Higgs bosons lifetime is too short to detect it directly, and therefore detection can only be done by observing decay products.
- Grundfos is the largest manufacturer of pumps in the world, and it has kept its place as a market leader several years by having a constant focus on the future. In these years this means a heavy emphasis on the digital transformation. On top of selling pumps Grundfos plan to create digital solutions via agile principles. For example:1) Self-adjusting pumps that ensure a comfortable indoor climate in buildings with reduced resource expenditure. 2) Predictions for preemptive maintenance of pipes, allowing for a more stable solution where water spills are avoided in water transportation.
- Today, most companies embrace the agile way of working: a simple product is made available to the customer — and via feedback from the customer, the initial solution is improved to pave the way for more advanced solutions. The starting solution of a service is often to expose data to the customer and alert the customer of problems. Advanced solutions would ultimately center around forecasts.
CERN and Grundfos are clearly in two very different situations.
- CERN has spent about 15 years designing and building the experiment in full awareness of the ultimate business case: The discovery of the Higgs boson. The focus of all technical decisions has been to evaluate how it impacted the final result of the Higgs boson searches.
- In the industry these years the first wave of sensors is being marketed alongside with the first phase of digital service development. The new digital solutions are to be continuously improved through customer feedback. The specific technical end product is not clear, and the challenge is not only to obtain the right data — but also to define the problem the data should help to solve.
Therefore, the full approach on the design of an IoT data foundation cannot be directly transferred from CERN to the industry, however, two aspects can be transferred:
A clear communication pipeline is needed for creating a strong data foundation
There needs to be a common understanding between the employee working on the device, the technical data people building the data foundation, and the business people facing the customer. An important aspect to transfer from the CERN approach is the constant focus on the business result. In the agile landscape, customer feedback is essential to have an idea of the long-term business case. There is also a need for strong communication between the device responsible and the technical data people. The technical data people need to have a basic understanding of what data is sent from the device to interpret the data. The technical data people must include both data scientists and data engineers: data scientist who use data and data engineers who build the foundation. The data foundation cannot be dismissed as a technically detached decision made by data engineers. Weak links in the communication chain from data recording to business case implementation weakens the data foundation there to support future services.
The topics considered in creating a data foundation.
The methods considered at CERN can be used as inspiration to solutions in the industry: In-depth these methods make for some pretty technical topics — but I hope even the overview of solutions can give some inspiration to solutions for the industry problems:
- Storage bandwidth: Even CERN has limited bandwidth available to store data, and every experiment is anxious to not discard data they need to succeed. The different research groups are therefore constantly analyzing the impact of different bandwidth allocations on the expected result of their Higgs boson searches. Similarly, for industry storing and transferring data takes time and money. Storing all data recorded will invariably drive up costs and time spent analyzing and maintaining the data. But discarding the wrong data severely limits the possible solutions.
- Triggers: All detectors at CERN use triggers. Triggers are online systems selecting what data are of potential interest for further analysis. Only data passing the triggers are recorded on permanent storage. A trigger is often split into a hardware-based and software-based part. The software-based only consider the events that passed the hardware-based trigger criteria. The development of triggers is a constant non-trivial task at CERN and the complexity of triggers is ever-increasing. Development of a trigger is the result of repeated analyses of how the trigger will impact the business case: The search for the Higgs boson. In industry the problem is the same: Some parts of a solution can be applied on edge, some need the data in the cloud and some should have data available and stored offline to improve the solution. Ultimately all factors influence the final business cost and possibilities.
- Data Storage and computations: The LHC at CERN has its own computing resources for data storage, distribution and analysis. “The Worldwide LHC Computing Grid”. The data is stored in file types developed for CERN. The raw data lands in a storage layer where the files have a structure with flexibility to vary for example the number of particles, but a fixed overall structure. Then the next layer has a common calibration and skimming to make the data accessible to more users, without every user needs to repeat the same steps. In the end, there is often created a more flat file type which a group of analysts can share to easily visualize and interpret the final results. Data are versioned via tags in terms of install base, data owners, experimental setup and code versions. Mapping of these tags can be found in the data catalogue. The technical implementation of data storage at CERN is proprietary, but the design philosophy can be transplanted to the industry. Like the idea of having different layers in storage for different purposes. The first raw data layer needs to be flexible to allow varying setup, modes and data outcome. The last layer should facilitate visualization of the end result.
- Data Quality: At CERN the study of data quality ensures a solid data foundation and provides a shortcut to quickly troubleshoot problems in the detector. Data quality validation is divided into an online and an offline part. The online part checks both via algorithms and manned shifts that data are recorded as expected by a functional detector. The offline part involves a deeper study of the data in storage. The method of data quality validation in both cases is to validate the recorded data against reference data. The data properties examined are identified to be relevant to secure Higgs boson searches. There should be no data drift — as this will give deviations in decisions made to achieve the desired Higgs boson result.
- Always investigate the needs: New ideas that are not relevant at CERN should of course also be considered. At CERN, there was no strong need for reasoning in metadata, whereas this is extremely important in Google searching machinery. At Grundfos, there are great successes in using this idea from Google in some digital services: using knowledge graphs for storing metadata and mapping this to streaming data. This makes it possible to easily reason with metadata, store data in a relational database despite a more advanced structure and achieve an easy onboarding of new customers despite different setups in different locations.
The overall conclusion is that since digital transformation is a new discipline in the industry, it is beneficial to explore best practices in the world of research. The digital challenges facing the industry are set in a framework that are very different from the academic domain. Shortcuts to solutions are, however, available by carefully considering methods used in research fields with strong IoT data focus: astronomy, particle physics and others. If nothing else, it is worth considering the motivation and implementation of data practices in research projects and how the practices brought results.
Although the World Wide Web was developed for CERN, it has proven to be very useful in a different setup outside of CERN.
By Lotte Ansgaard Thomsen
Lead Big Data Engineer, Grundfos
Lotte Ansgaard Thomsen is a Lead Big Data Engineer at Grundfos. Lotte has a background as a researcher at Yale University investigating and analyzing data from CERN. The experience gained by working on one of the world largest datasets has given her a deep knowledge of what is important for successful IoT/AI projects. The last two years Lotte has been working at Grundfos: using her experience to create guidelines for data pipelines, data quality and data architecture.
Lotte believes that with collaboration, proper data handling, and robust algorithm the world can be improved with the new capabilities
How a Data Pipeline Playbook Helps to Succeed With a Digital Transformation
In recent years, many companies have seen a need to adapt to the digital transformation to fulfil future business…
Home | CERN
CERN, the European Organization for Nuclear Research, is one of the world's largest and most respected centres for…
The Worldwide LHC Computing Grid (WLCG) | CERN
At CERN, we probe the fundamental structure of particles that make up everything around us. We do so using the world’s…
Trigger and Data Acquisition System
ATLAS is designed to observe up to 1.7 billion proton-proton collisions per second, with a combined data volume of more…