Darwin, Galton, correlation, and regression

In this blog post, I trace the historical roots of correlation and regression back to Darwin and Galton. I also use a subset of Galton’s original data on the stature of almost 1,000 adult sons and daughters to explain and visualize the modern-day equivalents of Galton’s pre-modern forms of ‘correlation’ and ‘regression’ analysis.

Darwin

Charles Darwin paved the way and got us on the right track on how species in animal kingdom evolve and adapt over multiple generations. Natural selection was the key word. In contrast, he was at a total loss regarding how one generation of his favorite finches passed on their genes (in our modern-day parlance) from one generation to the next. Of course, he saw with his own eyes how the size of, say, the beak of a male finch most often was similar in size to the beak of its ‘father-finch,’ but he had no clue as to how this happened. Or rather, he had his very personal clue — his infamous theory of Pangenesis (1868) where so-called ‘gemmules’ (i.e., a black box containing the information to construct the next generation) were passed on from one generation to the next by sexual or asexual transmission. Much can be said about this theory, but the most important thing is that it was utterly wrong — as Mendelian genetics would show some years later. In any event, Darwin had little to offer on the human condition. Also, the short-term inheritability from one generation to the next was not his interest; he was more into the long game of evolution over centuries.

Galton

Francis Galton, in contrast, who admired Darwin and was very influenced by Darwin’s work (and who also was his half cousin!), was very into humans and how both physiological and psychological characteristics seemed to be ‘inherited’ from parents to their offspring. (Actually, he started examining such inheritance for sweet peas around 1875, but that’s another story.)

Galton was also very much into measurement; he measured just about everything that came his way. On this, he was influenced by the undisputed champion of measurement of the human condition in the 19th century: Adolphe Quetelet. Quetelet, who trained as an astronomer, had in his math courses picked up the so-called normal or Gaussian distribution. Equipped with this, he went on to measure all thinkable (and unthinkable) kinds of human characteristics. As it happens, or rather happened, Quetelet had a big thing for the mean — as in the average — of the normal distribution, whereas Galton was much keener on the tails of this distribution. (But that’s also another story.)

Galton was also, and not surprisingly, a ‘numbers guy’ when it came to analyzing the measurements he took, and now we are closing in on why he became one of the most important forefathers of modern-day statistical analysis. One of many such data collection efforts (as we would call them today) involved the registration of the stature of almost 1,000 adult sons and daughters and the stature of these ‘children’s’ parents. These data are available here, and they were most probably collected sometime in the first half of the 1880s.

Galton, who obviously did not have access to a calculator or a statistics program, went on to plot the height of the ‘children’ on the y-axis (i.e., by first multiplying each female height with 1.8) and the height of the ’mid parent’ on the x-axis (i.e., the average of mother’s and father’s height). In doing so, he constructed one of the first versions of what we nowadays call a scatterplot.

Galton’s famous scatterplot

Today, given access statistics programs, we would probably go about it in a simpler way than Galton did. Perhaps we would go about as in Figure 1, where I have extracted only Galton’s ‘sons’ and their fathers from the original data (n = 481) and made the scatterplot. (The scatterplot for Galton’s ‘daughters’ looks very similar.)

Figure 1 Scatterplot of son’s height and father’s height. Galton’s original data (n = 481).

Galton is the pioneer of not only one but two fundamental techniques in our modern-day statistical analysis: correlation and regression. Let’s imagine what he thought the very first time he looked at his blackboard (?) version of Figure 1: “Well, it appears as if most of the sons are either in the upper right or in the lover left quadrant of the figure. What if I plot a straight vertical line for the average of the fathers’ height (like Quetelet would have done?) and a horizontal straight line for the average of the sons’ height?” Figure 2 takes care of that.

Figure 2 Scatterplot of son’s height and father’s height, with mean values. Galton’s original data (n = 481).

Now, Galton might have continued to reflect: “Yes, the pattern looks clearer: Tall fathers most often give birth to sons (figuratively, not literally; my emphasis) who become tall; smaller fathers most often give birth to sons who become smaller. I think I am on the trace of a statistical association.”

Correlation and regression

Galton had as mentioned touched upon the concept of correlation before he came to ‘analyze’ the dataset we are looking at now. But this was the first time in history a very-early-beta-version of regression analysis on data on humans saw the light of day. The paper was published in 1886. In terms of regression, Galton calculated by eye (!) where to place the typical trendline (later to be coined the regression line) expressing the general association between the x-variable (father’s height) and the y-variable (son’s height). Using a statistics program for this exercise, it is child’s play today to superimpose this trendline or regression line on the plot, as in Figure 3.

Figure 3 Scatterplot of son’s height and father’s height, with mean values and regression line. Galton’s original data (n = 481).

The slope of the regression line is 0.45 according to my statistics program. Galton had only a rudimentary grasp of the size of this slope, but he lacked the mathematical skills to calculate it formally. In any event, he understood that a positive slope, as in an upward-sloping line, indicated some degree of inheritability. (BTW, he called the slope r, which later became the symbol for the correlation coefficient.)

Formalization and regression to the mean

Based on his ‘scatterplot’ and his ‘regression’ line, Galton paved the way for modern-day attempts to calculate the strength of an association between two continuous variables: x and y. Karl Pearson later came up with the formula for the correlation coefficient (i.e., 0.39 for the data in Figure 1) and formalized the mathematical calculation of the regression slope. Following Pearson, Udny Yule, who was Pearson’s student and assistant, generalized the bivariate regression model into multiple regression. Finally, it should be mentioned that Galton also saved the day in terms of solving a problem in Darwin’s theory of evolution. The problem was this: Why does not the variation (what we today would call the variance or the standard deviation) in a species increase from generation to generation? The answer, Galton surmised, was regression to the mean. That is, very tall parents tend to give birth to children who become shorter than themselves, and very short parents tend to give birth to children who become taller than themselves. In this way, the outliers of one generation (to use another modern statistical term) give birth to children who become closer to the average height of the next generation.

Takeaways

Darwin laid the founding theoretical bricks on evolution in animal kingdom and sparked Galton’s later interest for human inheritance. Galton, in turn, laid down the groundwork for present-day correlation and regression analysis in the form of systematic measurement, data collection, and by developing pre-modern versions of these statistical techniques. Some of the background for this blog post is here.

Please find my recent blog posts on data analysis and data visualization in the field of sports here and here.

About me

I’m Christer Thrane, a sociologist and professor at Inland University College, Norway. I have written two textbooks on applied regression modeling and applied statistical modeling. Both are published by Routledge, and you find them here and here. I am on ResearchGate here, and you also reach me at christer.thrane@inn.no

--

--

Christer Thrane (christer.thrane@inn.no)

I am Christer Thrane, a sociologist and professor at Inland University College, Norway. You find me on ResearchGate. I do lots of regression modeling ... :-)