Engineering Truisms
Engineering Truisms
Three truisms that are actually mathematical facts and bear consideration:
- A chain is only as strong as its weakest link & size dependence in engineering design
- Keep things simple & Occam’s razor
- There Ain’t No Such Thing as a Free Lunch, with a more detailed discussion on optimization, hyperparameter search, bias, and variance (the longest part of the article)
A chain is only as strong as its weakest link
The chain is only as strong as its weakest link, for if that fails the chain fails and the object that it has been holding up falls to the ground
Thomas Reid (1785)
This truism is often quoted in terms of organizational dynamics (the team is only as a strong as its weakest member) and evidence (your argument is only as strong as its weakest conjecture). While it has its roots in philosophy and literature, there is a direct and almost trivial physical meaning to it. Trivial that is, until you start considering engineering and geoscience.
In 1926, Pierce published a paper on tensile tests for cotton yarns where he said that “it is a truism, of which the mathematical implications are of no little interest, that the strength of a chain is that of its weakest link”. But pierce didn’t do numerical modelling, nor did he work in rock mechanics or mechanical engineering. Engineers go to great length to reduce stress concentrations and material heterogeneity. A sharp corner can rip apart a steel beam. A manufacturing defect in a metal can cause an airplane to fail. Engineering practitioners know that heterogeneity hurts.
I’m a geomechanical engineer, so I’ll illustrate this using the failure of rock in laboratory tests. If you core a rock specimen and test it in a load frame, the failure will always be initiated at flaws in the rock’s microstructure. That alone is interesting, but it has implications for the strength of rocks when we scale these tests to the subsurface (say in designing a tunnel or a wellbore). As you increase the size of the core specimen (say from 25 mm to 75 mm), the average rock strength will decrease (Brace, 1961). Add a couple discontinuities and you decrease its strength even more (Hoek, 1983). Put in different materials and it gets even worse (Tang, 1997).
It is a truism, of which the mathematical implications are of little interest, that the strength of a chain is that of its weakest link
Pierce (1926)
When it comes to stochastic processes and geomechanics, I think that this truism is worth careful consideration. Rare occurrences (flaws) are actually driving the large scale behavior of a yarn or a rock. The tails of a distribution can control the behavior of entire system when its put under enough stress. This is especially true when dealing with failure, but we see this phenomenon in avalanches, market collapses, and any chaotic system where the boundary conditions make a big difference. So I don’t agree with Pierce when he that that the mathematical implications are of little interest. Here are a couple suggestions to apply this in practice:
- Pay attention to your tails. In statistics we can measure the tendency for distributions to display heavy-tails using kurtosis. You should probably calculate it.
- Recognize chaotic systems and pay careful attention to boundaries. As systems approach failure, heterogeneity and boundary conditions dominate. Recognize the risks from these two driving factors to failure.
Keep things simple!
Entities should not be multiplied unnecessarily
William of Ockham (maybe… ~1300)
This truism has many forms, from the Keep It Simple Stupid (KISS) principle, to Einstein’s possibly misquoted phrase “Everything should be made as simple as possible, but not simpler”. As humans we tend to overcomplicate things and see patterns where none may really exist. So we employ this truism to force our minds back into a simpler explanation or simpler course of action. This is often harder than it seems (at least for me) and truisms like this always bear repeating in my mind when trying to solve a problem.
Imagine my surprise when I’m studying statistics and discover that not only has this been codified in analytical circles, but that there is mathematical proof for employing Occam’s razor. The absolute best reference I can recommend for this is MacKay’s Information Theory, Inference, and Learning Algorithms. I could probably spend the next decade studying this book and still wouldn’t fully get it. But in Chapter 28, he goes over model comparison and provides a very tidy explanation for Occam’s Razor.
If several explanations are compatible with a set of observations, Occam’s razor advises us to buy the simplest. This principle is often advocated for one of two reasons: the first is aesthetic (‘A theory with mathematical beauty is more likely to be correct than an u(Paul Dirac)); the second reason is the past empirical success of Occam’s razor. However there is a different justification for Occam’s razor, namely: Coherent inference (as embodied by Bayesian probability) automatically embodies Occam’s razor, quantitatively
David J.C. MacKay (2003)
It turns out that Occam’s Razor is naturally embodied in Bayesian Inference. I’d highly recommend John Kruschke’s Doing Bayesian Data Analysis for a nice introduction to this subject (with R / BUGS code). A discussion of Occam’s Razor in machine learning is also provided by Rasmussen and Ghahramani (2001) — two heavyweights in the machine learning and statistics field.
So how does this affect you as a decision maker, data scientist, or engineer? Well — keep things as simple as possible. Don’t use twenty-parameter models when four parameters will do. Constantly be striving to eliminate complication from the decision making process. When building models, strive for carefully informed parsimony (the desire to take the simplest and easiest path). This is much easier to write than it is to apply in practice, but here are a couple suggestions:
- Always look for correlated variables. If you have a decent correlation (say > 0.6), drop one of the variables because it is often explained well enough by another.
- Consider Bayesian inference. It’s a brave new world out there, and quantifying uncertainty and automatically imposing parsimony is well worth the entry fee.
- Count your predictive variables. Know how many variables you are setting in a model. Count them, and compare them against the number of observations you have to train that model with. If you have less than 10 observations per variable… tread lightly.
There Ain’t No Such Thing as a Free Lunch
I … came upon a bar-room full of bad Salon pictures, in which men with hats on the backs of their heads were wolfing food from a counter. It was the institution of the “free lunch” I had struck. You paid for a drink and got as much as you wanted to eat. For something less than a rupee a day a man can feed himself sumptuously in San Francisco, even though he be a bankrupt. Remember this if ever you are stranded in these parts.
Rudyard Kipling
This truism helps to remind us that nothing is free in life or statistics and to be wary of things that seem to good to be true. There is always a cost to everything, whether its the bias-variance tradeoff in data science, a craving for beer after salty snacks, or opportunity cost when attending that free presentation.
This truism has two important corollaries for data science. First, there is always a bias-variance tradeoff. Second, there is no perfect optimization or search algorithm. Each of these topics is worth a separate article, but they are worth an overview here after defining bias and variance, search and optimization, and cost functions.
The bias — variance tradeoff is incredibly important in statistics and machine learning (statistics at scale). I could dive into a detailed explanation, but David Dalpiaz has written an entire chapter, with code snippets and simulation that goes over this. Check out his Bias Variance Tradeoff with Code chapter from his Statistical Learning book, which is currently under development in the wild. To summarize, bias is the inaccuracy of your prediction, the amount of net error in a model, of the expected value of an estimator minus the actual population value:
The variance is the amount of spread in that prediction over repeated simulations or a single observation (i.e. conditioned on a single observation):
You can see the bias-variance trade-off as you move from a simple to more complex model in Dalpiaz’s Figure below. The Bias is shown in blue, the variance in orange, and error shown in black.
Search and optimization are arguably the most important functions in both engineering analysis and machine learning. The distinction is quite subtle — we search for feasibility (to find values that work) and optimize for …optimality (to find the best values out of a given set of constraints). Optimization methods are at the heart of most numerical analyses — whether it be geophysical inversion, solving partial differential equations, or picking the best pizzas for an office lunch. There is consequently a wealth of knowledge on different optimization methods, some of which can include gradient descent or nelder-mead for example.
The point of every optimization method is to find the best global values for a particular function, but in order to do that we need to define a cost function. A cost function is a numerical metric to evaluate the performance of a model. By global I mean that we don’t want to get stuck in a local optima, and want to find the ‘bottom of the real valley’ as it were. We use cost functions every day, but don’t think of them in terms of optimization. In regression, some of these include the correlation coefficient, the mean absolute error, or root mean square error for regression functions.
Back to the ‘there ain’t no such thing as a free lunch’ truism. David Wolpert and his colleagues did a nice job mathematically fleshing out the no free lunch theorem for optimization and machine learning. They showed that the performance of your optimization or search pattern is directly dependent on the cost function you choose (Wolpert, D. H. (1996), Wolpert, D. H., & Macready, W. G. (1997).). You might ask why this is important, but engineers and data scientists spend a lot of time optimizing and solving complex equations. The no free lunch theorem shows us that indeed, nothing is free. For one problem and cost function, random search might be better than the most complex gradient descent algorithm we can derive. For another, random search might be horrendous and thousands of time slower. So, and maybe as no surprise to many of us, detailed knowledge of our problem is essential for effectively solving it.
In short, there are no ‘free lunches’ for effective optimization; any algorithm performs only as well as the knowledge concerning the cost function put into the cost algorithm. For this reason (and to emphasize the parallel with similar supervised learning results), we have dubbed our central result a ‘no free lunch’ theorem
Wolpert and Macready (1996)
References
Brace, W. F. (1961, January). Dependence of fracture strength of rocks on grain size. In The 4th US Symposium on Rock Mechanics (USRMS). American Rock Mechanics Association.
Hoek, E. (1983). Strength of jointed rock masses. Geotechnique, 33(3), 187–223.
Kipling, R. (1899) American Notes. Henry Altemus, Philadelphia.
MacKay, D. J., (2003). Information theory, inference and learning algorithms. Cambridge university press.
Pierce F. T. (1926) Tensile tests for cotton yarns, V. The weakest link, theorems on the strength of long and composite specimens. J. Tex. Inst., 17, 355–368.
Rasmussen, C. E., & Ghahramani, Z. (2001). Occam’s razor. In Advances in neural information processing systems (pp. 294–300).
Reid, T. (1788) Essays on the Active Powers of Man. J. Bartlett.
Tang, C. (1997). Numerical simulation of progressive rock failure and associated seismicity. International Journal of Rock Mechanics and Mining Sciences, 34(2), 249–261.
Wolpert, D. H. (1996). The lack of a priori distinctions between learning algorithms. Neural computation, 8(7), 1341–1390.
Wolpert, D. H., & Macready, W. G. (1997). No free lunch theorems for optimization. IEEE transactions on evolutionary computation, 1(1), 67–82.