The world of statistics, mathematics and machine learning is chock full of inspirational tales and bright minds. Some innovators are well recognised, others less so. Some made breakthroughs working in academia, others whilst working in industry. This post is about William Sealy Gosset, someone you’ve probably never heard of, but whose key contribution has had a profound impact on frequentist statistics and statistical inference as we understand it today. Much like the majority of brilliant minds in natural sciences, he’s an unsung hero.
Anyone who knows me knows that when I fancy a pint, Guinness is one of my go-to choices. Guinness is technically a stout beer, but that’s not how I view it. To me, Guinness is, well, Guinness. It’s sort of like Dr. Pepper in that sense — quite unlike anything else. That’s why the recipe for Guinness is the company’s greatest asset. To this day, it’s such a tightly guarded secret that the exact specifications are only known by a handful of people.
Making a product as universally loved as Guinness is no small feat. The core product needs to be exceptional, but that’s only part of the equation. Consistency is key. A Guinness poured today in Ireland should taste no different than one poured tomorrow in Helsinki. Amazingly, Guinness has not only reached this level of consistency, they’ve been doing so for decades, if not hundreds, of years.
In order to be consistent, you need to be able to measure the quality of your product, and make sure that quality doesn’t suffer when your business grows.
That’s where our titular hero comes in.
William Sealy Gosset, pictured above when he was 32, was born in Canterbury, England to Agnes Sealy Vidal and Colonel Frederic Gosset. Gosset attended Winchester College before enrolling at Oxford to study chemistry and mathematics. At the age of 23, he graduated from Oxford, and moved to Dublin to start work at Guinness — or Arthur Guinness & Son, as it was known back then.
Gosset’s primary focus at Guinness was on the agricultural part of the business. One of the areas he worked on was improving the yields of what is arguably the single most important ingredient in Guinness: barley.
Barley, like any other crop, comes in many varieties. Some varieties yield more than others: one variety might produce twice the output compared to another variety, even though they’ve both been cultivated on the same unit area of land and under similar conditions.
This leads to a knotty problem: how can we determine, to a reasonable degree of scientific certainty, if one variety of barley yields more than another?
Nowadays, what we would do is collect two representative samples, run a statistical hypothesis test, and call it a day. But back in the early 1900s, statistics wasn’t like it is today. Hypothesis testing did exist, but not in a principled fashion for very small samples. Cultivating many varieties of barley, each on large patches of land, to get samples large enough for the tests that existed back then was the only real solution.
For Gosset, this was unacceptable. Despite being time consuming, growing vast amounts of crop for hypothesis testing is incredibly wasteful. I can’t speak to the mind of the man, but I suspect the 1900s equivalent of “there must be a better way!” was uttered more than once.
Leveraging his knowledge in statistics, Gosset set about solving the problem. Specifically, he formalised the continuous probability distribution that occurs when estimating the mean of normally distributed populations with small samples. On the surface, this formalisation may not seem like a big deal, but those familiar with probability theory will recognise its significance: put briefly, given two samples, we can calculate how (un)likely it is that the mean of one sample is equal to that of another, even when the sample sizes are small.
Gosset was on to something.
Being a true academic, Gosset wanted to share his formalisation with other scholars. Unfortunately, some years back, another Guinness employee had inadvertantly disclosed trade secrets in a published paper, leaving the company disinclined to allow futher publishing. Nevertheless, after pleading his case, Guinness granted Gosset a dispensation, on condition that he publish under a pseudonym.
In March 1908, William Sealy Gosset did just that. His paper, appropriately entitled The probable error of a mean, was published in Biometrika.
The distribution he formalised is what we now refer to as the t-distribution, and the pseudonym he used was Student.
Today, the Student’s t-distribution and its associated hypothesis test, the Student’s t-test, are ubiqituous in the field of frequentist statistics. They form the basis for frequentist A/B- and multivariate tests, both for paired and unpaired samples. They are taught to, and recognised by, students all around the world. They’ve fundamentally changed how we assess the performance of website layouts, the effectiveness of medical procedures, and, of course, the yields of barley.
In 1935, at the age of 59, Gosset moved to London to take charge of a new Guinness brewery. As he had done for decades, he continued to apply the scientific method to making the perfect pint. Tragically, only two years later, he died of a heart attack.
The person behind the pen name Student is rarely recognised in the annals of history. Gosset was one of those few individuals who, when faced with a unsolved problem, persist until they find a solution. That takes not only talent, knowledge, and perseverance, but also willingness to learn. That’s why William Sealy Gosset deserves to be mentioned as one of the true greats of statistics. History doesn’t disclose exactly why Gosset chose Student as his pen name, but I personally like to think he chose it to give us all a subtle reminder of what’s important.