P&ID to iTwin: Accelerating Digitalization using Machine Learning

Co-written by Karl Alexandre Jahjah and Justin Dehorty

Published in

iTwin.js

5 min readAug 18, 2022

Creating a Digital Twin can be a daunting task. Even with the help of excellent tools such as Classifiers, a real-world object’s sheer size and complexity can often represent a significant roadblock to successfully creating a digital twin.

Consider the following scenario:

Imagine that it’s your first day on the job at the Nuclear Regulatory Commission. As your first task, you are given the huge responsibility of digitalizing the contents of 40 years’ worth of hand-drawn and scanned Process and Instrumentation Diagrams (P&IDs) as a part of an undertaking to create digital twins for various nuclear power plants.

You are not an expert with P&IDs, so your supervisor gets you started on a plant that has “only” a few dozen P&ID sheets. You manage to locate a legend sheet of all possible symbols to use in deciphering the drawings, and you quickly realize why no one has done this task before you: you have tens of other plants with documents spanning decades, and the symbols used are far from standardized and change depending on the location and time. You drink the last sip of your fifth coffee and realize you will probably retire before you are able to complete the massive task at hand.

Perhaps one could accelerate the labeling workflow with a software-based solution that parses through the P&ID documents and labels all of the symbols for you. Due to the critical role of nuclear plants, you would have to review the program labels to double-check for accuracy, but at least this would save you vast amounts of time and reduce your workload from tedious labeling to somewhat less tedious reviewing. However, many diagrams were made decades ago with old technology, and many are even hand-drawn. Due to the variance between symbols, even a highly optimized algorithm based on handcrafted heuristics would struggle in terms of accuracy.

Enter Machine Learning.

Thanks to recent advances in Machine Learning (ML), such a workflow is finally possible. The iTwin Platform’s new P&ID Labeling Workflow leverages a state-of-the-art Machine Learning architecture to identify symbols quickly and reliably in the diagrams.

You start by creating a project and importing the data that you have for a specific asset. The solution supports .pdfs (single page or multipage, vector or raster) as well as common image formats (.jpg, .png, .bmp, etc.). You import a 35-page .pdf and launch the processing before heading for lunch with some colleagues. You come back an hour later to look at your results; the ML model found over 9000 components on your P&IDs.

But you know ML models are not perfect. They are trained to recognize symbols by showing hundreds of examples of different components taken from hundreds of examples of varied data sources. Even after much training, ML models are still only “human” and can make mistakes. Because of that, they can benefit from the help of an actual human to validate their predictions. So, you open the first sheet and start reviewing the results.

The results are listed by component category: ball valves, centrifugal pumps, flow transmitters, and all rare or unusual components get lumped together in a Generic Equipment category. You open one of these categories and start reviewing the individual elements. Not only are the elements located on the P&ID sheet with a bounding box around which you can quickly zoom when querying an object, but most components also have a user label or tag. The ML pipeline combines the results of object detection with optical character recognition (OCR). The OCR then detects the text in a third model, which associates the text with the symbols they describe and extract unique identifiers.

The first sheet you are reviewing is handwritten, and with the quality of the image, you are pleasantly surprised that your first few elements are correctly classified and you approve them. Later, you spot some cases where the OCR has confused letters and numbers (8 instead of B) or where the algorithm has included a generic tag for the make of a valve in the unique identifier for the equipment. You correct the user labels and approve the reviewed components. Had you known the expected tag format for specific components, you could have supplied that information as regular expressions to the algorithm that could have automatically done most of these substitutions for you.

You continue the process, approving correctly predicted elements, deleting false detections, adjusting the bounding box when necessary, and changing the class or the user label when a mistake occurs. For classes that don’t have unique identifiers, such as pipe reducers, you quickly highlight all elements and review any false detections individually before approving the rest.

Within an hour, you have reviewed all 9000 predictions of the model. To complete the review process, you only need to add a few elements that were missed by the model, mainly because they are unique or rare vessels without many examples for the model to learn. You save the results of your reviewed P&ID sheet and export it as a .json format, and you excitedly start thinking of all the ways you can use this data to aggregate the information in the P&ID with other data sources such as asset inventory databases.

When you leave the office that night, you are well on your way to creating your first nuclear plant digital twin, and you can crack open a can of beer and relax, knowing a daunting task has become much more manageable thanks to artificial intelligence. That night, the ML model won’t get any sleep; it will be busy learning even more by looking at the reviews you have done and incorporating them into its ever-growing knowledge base.

P&ID to iTwin: Accelerating Digitalization using Machine Learning

Co-written by Karl Alexandre Jahjah and Justin Dehorty

Consider the following scenario:

Enter Machine Learning.

Written by Justin Dehorty