Synthesizing Data for Supervised Learning of Physical Systems

3 min readAug 16, 2022

Preparing chemical analytical standards. Source: Shutterstock

Forbes, The Wall Street Journal, and others have all reported that synthetic data will transform and dominate the field of artificial intelligence. While synthesizing AI training data is becoming fairly standard in applications such as fraud protection, autonomous driving, and image recognition, doing so where the data must accurately reflect attributes of a physically determined system is still an emerging practice. When the developed AI will be using sensor data as input, there’s even more complexity, since the training set must also accurately reflect the imperfect and slightly fickle personalities of the sensors involved.

In systems like these… data synthesis is a necessity for developing useful AI.

In such systems, collecting sufficient real-world measurement data to coax robust and accurate performance out of data-hungry machine learning algorithms is often impractical or unaffordable. As an example, in one recent project, Transcend Engineering used a base set of real laboratory instrument response to 63 carefully prepared laboratory samples, from which we synthesized 60,000 HPLC chromatograms representing exhaustive combinations of compounds and concentrations — all while faithfully representing other sources of variation in instrument response, too. Synthesizing these data saved $10M in laboratory analytical costs, not including sample preparation.

For another project, using process-based models validated against real-world data, we simulated tens of thousands of scenarios that represented myriad soil types, numerous irrigation schedules, and seasonally and diurnally variable root water uptake rates. It’s hard to fathom the size, complexity, and cost of a field campaign that would generate the range and diversity of data we produced in this synthesized ensemble. The finishing touch will be simulating how existing sensors, with known response characteristics, report the soil moisture data.

In systems like these, particularly those involving soils, hydrology, or chemistry, data synthesis is a necessity for developing useful AI. To summarize, data synthesis for physical data science must include:

Use of validated physics-based models (or empirically obtained basis functions) to represent the fundamental physical processes involved,
Probability distributions of model inputs that are faithful to distributions observed in the real world,
Deterministic and probabilistic additions of noise and uncertainty where chaotic processes, heterogeneous environments, or sensors are involved.

For best quality, you should also design your objective functions in a manner that causes the accuracy of your trained ML model to be greatest where it matters most to the decision that it will support and, conversely, to concentrate any remaining inaccuracy where it matters least to the decision being supported.

My team began its transition into AI three years ago, leveraging prior experience with physics-based modeling, characterization of sensor performance, and quantification of noise and uncertainty, to create vast synthetic data sets for advanced ML applied to physical systems. We’ve developed cloud-scalable data processes, analyses, and synthesis pipelines that evaluate, characterize, and amplify real-world. Respecting observed input distributions, known physics, and authentic representations of real-world messiness and imperfection, these tools meticulously produce physical realism and full-range variability in the data we use to train robust, advanced machine intelligence. These are the tools that are necessary to drive physical data science.

From soil breathing to vehicle suspensions, chemical analysis to irrigation efficiency, our diverse physical applications all share this feature. It’s exciting stuff, and we enjoy doing it.

Synthesizing Data for Supervised Learning of Physical Systems

Written by Stephen Farrington