Reinventing P-values (Part 1)

Why and how you could have invented p-values.

Farhan Ahmad
Analytics Vidhya
17 min readMar 29, 2020

--

P-values are widely used in published research but few people, including experts, correctly understand what a p-value is. Experimental results are, often, misinterpreted because of a lack of understanding of p-values. Not only that, but an overreliance on p-values has been exploited to fudge experiments in order to overreport how surprising the results are, in what is known as p-hacking.

In this post, we will see how p-values are not at all that difficult to wrap your head around, instead, it is the lack of good explanations that make them seem so. In fact, we will reinvent p-values using simple concepts like frequency and chance.

To understand p-values, we need to understand the concept of distributions and to understand distributions, we need to understand histograms. Understanding histograms though requires little more than basic counting, taught in elementary school. So let us start with histograms.

Note: If you are already familiar with histograms of different type: frequency, frequency density, probability and probability density, feel free to read just the sections Introducing P-values and P-values Revisited.

Counts and Histograms

Histograms are used to maintain the count of some quantity. For example, suppose you went around a neighborhood, counting the number of bedrooms in each house. Initially, you would not know anything about the number of bedrooms. Suppose the first house you went to had 2 bedrooms, you noted down [2]. The next house might have 4 bedrooms and now your entry would read [2,4]. Since most houses have only a small number of bedrooms, by the time you had surveyed all houses, you would have had a long list with lots of small numbers in it: [2,4,3,1,2,3,3,7,4…..]. This list could be very long, and it does not tell you much about the number of rooms.

The easiest way to summarize this list would be to count the occurrences of the numbers in the list. Suppose you did that and found that the number 1 appeared 38 times, 2 appeared 566 times and so on, you could record this summary in a table like this:

A Summary of Number of Rooms

This summary, when plotted becomes a histogram.

A Typical Histogram of Number of Rooms

The quantity, in this case, is the number of bedrooms in a house, which is plotted on the x-axis. The y-axis represents the number of houses with that many bedrooms, which is our count.

The first useful bit of information that the histogram provides at a glance is the span of the number of rooms. In this case, each house has between 1 and 7 rooms. How do we know? Because their counts are tiny but nonzero but the number of houses with fewer than 1 bedroom is zero and so is the count of houses with more than 7 bedrooms.

The second bit of information is the counts. So roughly 2,000 houses have 3 bedrooms, while just 400 houses have 5 bedrooms. Statisticians like to call these counts frequencies, which makes sense since the y-axis tells you how frequently you encounter a 1-bedroom, 2-bedroom or k-bedroom house. From the histograms, 3-bedroom houses are some 5x (2000/400) more frequent than 5-bedroom houses.

The third bit of information is the sum of counts which is equal to the total number of houses surveyed.

The sum of the frequencies is the total number of houses.

Chance and Probability

Since 3-bedroom houses are most frequent, if one checks out a house, selected arbitrarily, it is more likely to be a 3-bedroom house than to be, say, a 2-bedroom house. The chance that an arbitrarily selected house turns out to have 3 bedrooms can be found out by taking the ratio of the frequency of 3-bedroom houses to the sum of all frequencies and turns out to be 2032/(38+566+2032+1531+353+61+14) = 2032/4595, which is roughly 0.442. The chances of an arbitrary house being a 5-bedroom can be computed as 353/4595 = 0.077. These ratios are called probabilities.

The table below contains all the probabilities corresponding to the frequencies computed earlier. Note that the largest frequency corresponds to the largest probability, the second-largest frequency corresponds to the second-largest probability and so on. This is bound to happen because the denominator or sum of frequencies, is the same in all ratios, so the numerator, which is the frequency decides how large a ratio will be.

The probabilities sum up to 1. Can you see why? This is a nice property we will use later.

We can now plot a new histogram with the probabilities on the y-axis, while the x-axis remains unchanged.

Why do we need probabilities? Why don’t we just use the frequencies? Because probabilities let us comapre chances. If we surveyed another neighborhood and it had 1200 3-bedroom houses, how can we determine in which of the two neighborhoods 3-bedroom apartments are more common? If the other neighborhood had 5000 houses, we could compare the probability of the 3-bedroom house in the new neighborhood (=1200/5000) to the probability of a 3-bedroom house in this neighborhood, (=2032/4595)

No Histograms Survives Contact with the Real World

Okay, lets us think about the real world. Unless you have an army of people surveying houses or your only life goal is to plot that one histogram of 4595 houses or someone already has collected the data, it is a task to survey all houses and to build the first histogram, with frequencies on the y-axis. You can still build the second histogram, with probabilities on the y-axis. You could do so with estimation. Here’s how estimation works:

  • Based on your time and budget you decide how many houses you can survey. Let’s assume a small number like 130.
  • You randomly select, maybe on a map, those 130 houses. This is called sampling in statistics.

The reason we want to be random is to remove any patterns in our data. For example, if you just picked the 130 houses nearest to you, and you lived in a housing society with mostly 2-bedroom apartments, you are likely to incorrectly report that 2-bedroom apartments are the most frequently encountered ones in the entire neighborhood. There are many ways of random sampling, but we do not want to go into those details here.

Random sampling ensures that a sample is representative of the data from which it is being sampled.

Estimation using sampling will not be able to provide you the actual frequencies. If what you are doing requires those actual frequencies there is no other option but to collect all the data. However, tasks requiring finding actual frequencies are rare and usually done by organizations or institutions that have the wherewithals to do so. For example, the population censuses done by governments all over the world. Luckily, many tasks require probabilities which can well be estimated from samples, since a good sample will maintain the proportion of actual frequencies.

Although a sample does not contain the same frequencies as the entire data, a good sample will have every frequency scaled down by approximately the same factor, so the proportion of frequencies remains approximately the same.

Building Histograms by Sampling

Suppose you plot the histogram of the number of bedrooms based on a random sample of 130 houses. Here’s how it might look like:

Histogram of Bedrooms for a Sample of 130 Houses

Notice that the number of houses, 130 is roughly 35(=4595/130) times smaller than the actual number of houses and so the frequencies are all roughly scaled by the same ratio. For example, the number of 3-bedroom houses is around 60, which is about 34(=2032/60) times smaller, very close to the factor of 35 we expected. Let us now plot the histogram of probabilities computed from the sample. To compare this histogram we plot it right next to the original one.

Left: Original histogram of probabilities. Right: Histogram of Probabilities Based on a Sample of Size 130

The probabilities look pretty similar and it seems that we can use the new histogram in place of the full-blown one, but there is a problem! Our new histogram does not have 7-bedroom houses in it. We know that there were just 14 of them and in a 35X smaller sample it is possible that none of them got included, despite random selection. In the next section, we see how to deal with missing bedrooms!

Dealing with Missing Bedrooms(Data)!

So how do we deal with missing data, for example, the 7-bedroom apartments in the histogram we constructed from a small sample? In the real world, we would never have access to the larger data telling us that 7-bedroom houses also exist. How can we ‘correct’ a histogram of probabilities built from sampled data? Before we talk about it, note that although in this case, the missing data was at the extreme right, that need not always be the case. If the frequency of 5-bedroom houses was tiny, we might have ended up with missing data for 5-bedroom houses too.

We are not going to dive into the details of how we can fix the missing data but in very simple terms, the missing data could be filled in with a little bit of cheating. We reduce the probability of the existing data by a small amount and allocate it to 0-bedroom and 7-bedroom houses. Here’s how a typical, ‘corrected’ histogram might look like:

Left: Histogram built from a sample of 130 houses. Right: Same histogram after ‘correction’ for 0 and 7 bedroom data. Note that the y-axis scales are different.

Note that the probabilities of existing data have been reduced, somewhat unevenly, and the stolen probability has been allocated to the new data. The nitty-gritty of the reassignment won’t be discussed here but it suffices to say that the assignment is based on trends. So 0-bedroom houses should receive a lower probability than 1-bedroom houses and 7-bedroom apartments should receive a lower probability than 6-bedroom ones because we know that the probabilities are becoming smaller on both ends.

If you are wondering whether it is a good idea to deliberately distort the histogram by introducing unknown data (0,7-bedroom houses), think about the case where someone asked you about the probability of finding a house with 7 bedrooms, and you could not answer them because your sample did not have 7-bedroom houses. It could be that there are actually no 7-bedroom houses in the neighborhood but it’s also likely that there are such houses but your

But why stop at 0 and 7 bedrooms only? There are no houses with fewer than 0 bedrooms but what about 8 or more bedrooms? Remember we only have access to the sample and we wouldn’t know if the whole data would have those kinds of houses or not. We can smear out the probabilities some more to include larger bedroom counts but the new probabilities will also have to be kept even smaller. So we might assign a probability of 0.0001 to a 10-bedroom house, which is so small that we can replace it with 0 if we want or we can keep going as far as we want.

No matter how we ‘correct’ a histogram, the probabilities must sum up to 1.

Introducing P-Values

With the knowledge that you have about histograms, if someone you trusted told you that they discovered a ten-bedroom apartment in the neighborhood, you would be able to look up your histogram and tell how probable that was. Suppose the histogram says that the probability is 0.00002 you would be pretty surprised. This is what a P-value is all about. It tells you how surprising an observation is, based on a histogram of observations and their probabilities.

Dealing with Fractional Bedrooms(Continuous Data)

Oh well, what does that even mean? A house cannot have 1.15 rooms. True, but many other quantities do. For example, the area of a house can be any positive number. Let us see how we can build a histogram for such data. We will be more realistic now. So assume we gathered data on the area of 130 randomly selected houses out of the 4595 houses in a neighborhood. The data might look like: [620, 650, 670, 850, 910, 920, 940, 970, 990, 1010, 1040, …,3740, 3760, 3950, 4260, 4410, 4600]. The numbers represent the living area reported in square feet.

For simplicity, let us say the data has been already sorted. So the smallest house is 620 sqft. while the biggest one is 4600sqft. and there are many intermediate values. As is evident most numbers are unique. How can you summarize these numbers with a histogram if each number seems to occur only once? The naive approach of recording the count of each number falls flat. Using the approach the summary will likely be as long as the list of numbers and when plotted, the histogram will look like the figure below with tons of missing data.

Histogram of Living Area Across 130 Houses (Truncated after 1000sqft because of less space)

This is pretty useless. But we can get something useful by binning the quantity (living area). To do so we bucket the living area into bins of 500sqft. So the first bin is for houses that have an area between 620sqft and 1120sqft, the second bin is for houses 1120sqft up to 1620sqft and so on. Now we can go through the list and for each number (living area) find the right bin and increment the bin’s count, resulting in this summary below. Note that the Upto Area column is the ceiling for each bin, and houses in a bin will strictly have an area less than this ceiling. So a house with 1120sqft living area will be counted in the second bin (1120sqft-1620sqft) because the first bin is meant for houses that have an area up to but excluding 1120sqft.

Summary Of House Counts by Living Area (binned into bins of size 500sqft.)

It is straightforward to plot the histogram now:

Histogram of Living Area of 130 houses, Built Using 500sqft Bins

However, there’s a subtle change in what sort of questions this histogram can answer. With the long list 130 numbers, if someone asked you “What is the frequency of houses with a living area of 620 sqft?”, you would scan the entire list, count how many times 620 appeared (just once in our case) and report the count. But once you summarize the data using bins, you can no longer answer that question. The question you can answer now is “What is the frequency of houses with living area between 620sqft. and 1120sqft. ?”.

During summarization using binning, the exact frequency of each value of the living area has been thrown away, but this is what has enabled us to summarize the data in the first place.

You might wonder why we cannot retrieve the exact frequency of houses with an area of 620sqft. from the histogram. After all, you can go to the point on the x-axis that corresponds to 620sqft. and then read the corresponding value on the y-axis, which is 13. It seems totally doable until you try to find the frequency for some other value of the living area falling in the same bin, for example, 635sqft. Because we have a binned histogram, and this value falls in the same bin, the frequency you read off the y-axis will again be 13.

Although the bin-width is 500, there is an unlimited number of houses that fall in this bin. For example, houses with area 620.1sqft, 620.523sqmt and 652.9sqmt, all fall in the same bin, and if you were to query their counts you would get 13 every time. But we no this is not correct. From the summary table, we know that the total number of houses in the bin was just 13. It was so because not all values of the living area were present in the data. For example, there was just one house having a 620sqft living area, but no houses with living area more than 620sqft and less than 650sqft. This is why we cannot retrieve or report individual frequencies. How do we fix this? That’s what we do in the next section.

Frequency Density to The Rescue

One way to fix the problem is to divide the frequency of each bar, that is its height, by its width. This converts the frequency into what is known called frequency density. So, in our case, the frequency of the first bin, 13, when divided by the bin width, 500, becomes 0.026, the frequency density. The second bin, similarly, will have a frequency density of 21/500 or 0.042.

Frequency(left) vs Frequency Density(right) of Living Area of 130 Homes, Binned into 500sqft Bins

Dividing the height of a histogram bar(frequency) by the width of the bar(bin width) produces frequency density.

Dividing the height of a histogram bar(frequency) by the sum of heights(frequencies) of all the bars produces probability.

With frequency density values on the y-axis, the calculation of frequencies requires multiplying the density by the bin width. For example, to calculate the frequency of houses with living area between 620sqft and 1120sqft, which corresponds to the first bin in the histogram. Simply multiply its frequency density(0.026) by its bin-width(500) to get 13.

We can also estimate the frequency for a smaller range, within a bin. For example, the estimated frequency of houses with living area between 700sqft and 900sqft (bin width = 900–700 =200) will be 0.026*200 or 5.20, which can be rounded down to 5. This estimate is pretty bad as the actual number is just 1 (850sqft). We could get better estimates if the bins were narrower, but we cannot make the bins too thin either otherwise we lose the benefit of summarization.

We can even, estimate frequencies across bin boundaries. For example, the estimated frequency of houses with living area between 1100sqft and 1200sqft can be calculated as (1120–1100)*0.026 + (1200–1120)*0.042 = 3.9 and rounded up to 4.

What about the frequency of an exact living area, like 759.25sqft? We cannot compute that because the bin-width is 0.

For a frequency density histogram, the sum of frequency and bin-width products gives the total count.

Onto Probability Densities

We have seen why probabilities are useful and how to compute them from frequencies. However, when using bins, the computed probabilities do not sum up to 1. The solution, once again, is to compute densities, similar to how we computed frequency densities. To compute probability densities, the probability of each bin is divided by its width. Let’s see how this works.

Consider the first bin spanning 620sqft to 1120sqft. Its frequency is 13. The sum of all frequencies is 130 (total houses). So the probability of the first bin is exactly 0.10. For the second bin, the frequency is 21 so the probability is 21/130=0.16. The center histogram below shows these probabilities. To calculate the probability density of each bin, we divide by the bin width, 500. So 0.10/500 = 0.0002 and 0.16/500 = 0.00032 and so on. This is depicted in the third histogram, on the right.

Frequency(L), Probability(C) and Probability Density(R) Histograms for 130 Houses (Bin Width = 500)

Using a probability density histogram, one can either compute the probability, frequency or frequency density. For example, the probability density of the bin from 620sqft to 1120sqft is 0.0002, so the probability of the bin can be computed by reversing the computation we did earlier, which is multiplying by the bin width (500), which gives us 0.10.

To compute the frequency, we can multiply by the sum of all frequencies, 130, which gives us 13. For finding the frequency density, we can divide again by the bin-width to get 0.026.

Notice how we multiplied by the bin-width to go from probability densities to probabilities, then multiplied by the sum of frequencies to get the frequencies, and divided again by the bin-width to get the frequency densities. The first and third steps are just inverses of each other and so we can go from probability densities to frequency densities by multiplying by the sum of frequencies. For example, the probability density 0.002 when multiplied by 130 gives us 0.26 the frequency density.

This makes sense too because for binned data we are interested in probability densities and frequency densities.

Conversion Among Different Types of Histograms

For a probability density histogram, the sum of frequency and bin-width products gives the total probability, which is 1.

P-values Revisited

Let us come back to P-values, only this time we will use a probability density histogram. Let us consider our running example of the histogram of the living area of 130 houses. Here’s the (probability density) histogram:

Note that this histogram is not ‘corrected’ yet so the probability density suddenly becomes zero to the right of the last bin(4620sqft-5120sqft) and to the left of the first bin(620sqft-1120sqft). Bins in the center have higher densities and thus higher probabilities while those towards the ends have lower probabilities.

Let us say someone reports having found a house with 4200sqft. To test how surprising this is, you would want to determine the probability of houses with 4200sqft living area. You cannot do that with a binned histogram. You need a range for the area, not just an exact value.

The trick, in this case, is to find out not just how surprising this value is but how surprising this value or a larger value is. So we want to determine the probability of a house having a living area 4200sqft or more. Since the histogram covers houses up to 4620sqft only, we would like to estimate the probability of houses with an area between 4200sqft and 4620sqft(bin width = 380). Both these values lie in the last bin(4120sqft-4620sqft) so computing the probability is straightforward. The width is 380sqft and the probability density is roughly 0.00005 (3/130/500). So the estimated probability is 0.018, or in percentage terms 1.8%, pretty small. Since the probability is so small, the observation (4200sqft house) is pretty surprising.

On the other hand, if someone reported a house with 630sqft living area, and we wanted to quantify how surprising it was to find such a house, we would instead be finding not just how surprising this value is but how surprising this value or a smaller value is. The smallest house in our data is 620sqft so we would like to estimate the probability of houses with living areas in the range 620sqft and 630sqft(bin width = 10). Again both values fall in the same bin (620sqft-1120sqft), so the probability can be estimated from the probability density of the bin 0.0002 and the bin width as 0.0002*10 or 0.002 or 0.2%. This is tiny and thus the observation (630sqft house) is very surprising again.

The probabilities we calculated in both these cases are p-values. Note that depending on towards which end of the histogram we were, we had to include either all values larger than the observation (right end) or all values smaller than the observation (left end). These two cases are generally combined into one by calling them “more extreme values”.

Finally, if we wanted to be more realistic we would smear out the histogram a little bit to ‘correct’ for values unseen in a sample. This can be done by introducing more bins to the right and left of the two ends of the histogram and assigning those new bins gradually tinier probability densities while ensuring that the probability of all bins combined remains 1.

The smaller a p-value is the more surprising an observation is, given a histogram.

Now that we understand p-values, in the next part, we will try to understand what probability density distributions are, how to compute p-values from these distributions and look at why p-values are misunderstood and misused and how to interpret them correctly.

--

--

Farhan Ahmad
Analytics Vidhya

Baking software since 0x07D7. Self-driving cars / Deep Reinforcement Learning researcher.