Data Driven Statistical Models vs Process Driven Physical Models
In datascience, it is becoming increasingly common to employ data driven models where process based physical models may not fully describe the processes in operational situations. In some cases, a hybrid of physical and statistical models may be required, to solve certain problems.
Models are useful tools to understand the behaviours and processes in the real world, and to make inferences about the future. In science, there are essentially two modelling approaches: 1) data driven models; and 2) process based models.
Data Driven Models
The data driven models build relationships between input and output data, without worrying too much about the underyling processes, using statistical/machine learning techniques. On the other hand, the process driven models are based on well established mathematical/physical laws.
A linear regression model is an example of a data driven model that, for example, builds a relationship between a dependent variable and a set of independent variables. This regeression model can then be used to understand the relationship between the variables, and in some cases can also be used to make predictions.
Language Translators are built purely by a data-driven approach. For example, Google Translate does not know anything about any language. It is built using Deep Neural Networks based on statistics and probabilities rather than grammer rules of the language.
Process Driven Models
Physical models are driven by certain processes. These processes can usually be described by a set of mathematical equations. For example, Navier-Stokes (N-S) equations govern the motion of fluids and can be seen as Newton’s second law of motion for fluids.
A simplified version of N-S equations may have analytical solutions, but they are usually solved using numerical methods so that they can be applied for practical problems such as weather forecasting.
Pros and Cons
Data driven models have the advantage of built in error terms. A large amount of data is used to estimate the parameters to fit the model between the input and output data. Errors can be quantified and confidence levels can be estimated. On the other hand, the process driven models have been built using underlying physics. Here the errors may be introduced through uncertain initial/boundary conditions. Here the real-world observations are usedto evaluate the model.
Can we combine?
The availability of cheap computing power and enormous amounts of data have enabled us to combine both data driven and process driven models to address real-world problems. Although this is already happening in several disciplines (e.g., Inversion Modelling, Environmental Forecasting etc), it is not exploited enough in other sectors.
For example, a complex process may be physically modelled, but solving this model numerically may pose practical challenges. There may be costs or delays in obtaining initial/boundary conditions. Computing power may be cheaper but time to solution may be longer. In situations like these, data driven models can fill the gap left by the process driven models.
Asset Maintenance, Renewables Forecasting, International Cargo Transportation are the areas where there is a huge potential for this hybrid modelling approach.