What is a distribution and why do I care?
(Hold on tight, this is just some of the groundwork and housekeeping we have to do before we get to the fun stuff…)
A distribution is just a count of things that happened or exist in a specific context. An example of a sports distribution would be the number of points scored by the Golden State Warriors in every regular season game for 2015–2016:
I could list all 82 games, but that would be a lot of information to quickly internalize. It’s not very effective or efficient. So we turn to the magic of data visualization, which will help us understand the data quicker. Let’s start with the first five games:
From this simple plot, we can identify several things as factual:
- GSW scored 111 points; 112 points (twice); 119 points; and 134 points in their first five games
- GSW scored under 120 points (or 125 or 130) more often than over 120 points
Here’s what we can deduce from this information:
- Next game, GSW will probably score somewhere between 110 and 135
- Though it is more probable they will score between 110–120 than 120–135
- There is a better (though not necessarily good) chance (or likelihood) that they will score 112 again, compared to 111, 119, or 200 points — simply because it has happened more times historically
But that’s not all, we also can infer:
- There’s still a possibility that GSW can score less than 110 points and more than 134 points — including zero points and 200 points. Both are very unlikely, but definitely not impossible. In 1990, GSW scored 162 points in a game!
In their next game of the season, they scored 119 points. Our data and beliefs held up. \o/
However, in their 7th game, they only scored 103 points. So while our predictions are reasonable, they’re not always reliable. This is a concept we will return to time and again in this series.
But it’s not the end of the world. We don’t have to throw away our original beliefs. We merely have to change our way of thinking a bit to say that it’s more possible that GSW can score less than 110 points now because we’ve observed it. Here’s what that new distribution looks like:
This is the most basic use case of a distribution. The more data we have, the better our guesses get. We can continuously change and adapt our inferences and deductions to have a better understanding of the chance of something being very possible to very unlikely. The more information we have, the more certain we can be about our predictions.
Let’s look at the distribution of team points of the entire 2015–2016 season:
Now with all 82 games in a distribution, we see that most of the team points are centered around 112–114 points. We were able to fairly accurately predict how GSW scores over the season just by knowing the scores of 5 games.
But it’s not that accurate. If I had to bet real money on the outcome of their game, betting they’d score 112 is not a horrible bet. BUT that’s not a great bet either. It is probably the best gamble, but it is still not a good gamble.
And here’s a good place to talk about a normal distribution.
GSW’s average is actually ~115 (114.89) points per game. Knowing this, which is more likely in their next game: Scoring 114 points (one point difference) or 140 points (25 points difference)?
Given our observations, 114 is a better bet.
What about GSW scoring 112 or 103 points? Of course, 112 is a better bet. If they score, on average, 115 points, it’s more likely they’ll score closer to 115 than 80 or 160 their next game. Let’s update the distribution with another line:
Let the blue line running from left to right represent not just occurrence but also probability. The higher the line goes, the more probable it is — because it has actually occurred more often.
The probability is highest around scoring 115, and decreases as it goes in either direction. As we move away from that center, the probability decreases. GSW scoring around 140 seems very unlikely, but scoring around 110 is very likely. Scoring 50 seems less likely than both.
If we played more and more games, this curve would smooth out and look like the shape below. We’d realize that more games are scored closer to the average score, and games with extreme scores being less likely to happen (though they do still happen). This natural phenomenon occurs a lot in… nature.
Here’s the infamous “bell curve” showing a normal distribution for people’s height and what it might look like:
Think about it like this:
- There are more people around average height than people that are shorter and taller. There are less and less people of increasingly taller and shorter stature.
- An ice cream shop sells, on average, 75 servings of ice cream in an hour. Most of the time, they sell around 70–80 in an hour. Some hours, they only sell 5 or 6, in others they sell 125, but both cases rarely happen.
- My average bowling score is 187. Sometimes I score 80 or less; sometimes I score over 220; but they happen less times than when I score around my average. Scores on either extremes are less likely than scores around my average, but they do happen.
Of course, the chart in reality never looks as perfectly rounded as the example above. But the notion that there are more something near the average than at either extreme is a powerful idea. We can then take this powerful idea and use it as an approximation (we’ll get into this later). Once you have some data, and you believe that this data is representative of this approximation, you can make pretty good guesses at what is to come in the future.
Caveat: not everything is normally distributed (like above), there are other kinds of distributions, but for now let’s focus on the normal distribution.
But I’m not telling you anything that you already don’t instinctively know…
Here’s Salman Khan on Central Limit Theorem if you want something a little more detailed: