Statistics & Probability — Inference for proportions

Inference for categorical variables

Omar Elgabry
OmarElgabry's Blog
11 min readFeb 24, 2019

--

Proportions — source

This series of articles inspired by Statistics with R Specialization from Duke University. The full series of articles can be found here.

Doing inference for categorical variables, where the parameter of interest is a proportion, as opposed to the mean that we’ve been talking about.

One categorical variable

  • Two levels (values): binary (1 or 0) → proportion of success (1)
  • More than two levels: low, medium, high → proportion of each level

Two categorical variables

  • Two levels: Canada (yes or no; opinion on drugs) Vs USA
  • More than two levels: Status (low, medium, high) Vs Education (junior, senior)

CLT for Proportions

Revisit: Sample Vs Samlping Distribution

The distributions of the observations with in each sample is called sample distributions. While the distributions of the sample statistics of each sample is called sampling distribution.

The mean of the sample proportions in the sampling distribution is p^, called “p hat”, and we expect the p^ to be close to the true population proportion.

Central Limit Theorem (CLT)

It’s the same as we discussed for the means. The difference is, we observe the distribution of sample proportion, centered at the population proportion(p), and the standard error (SE) = sqrt(p x (1-p) / n)

The conditions are also the same except for the skewness. There should be at least 10 success (n x p >= 10) and 10 failures (n x (1-p)) >= 10 .

As before, sampling distribution can be applied to only one sample. And so, p^ = proportion of a sample.

(~) 90% of all plants are flowering plants. If you were to randomly sample 200 plants. Whats the probability that at least 95% of the plants in the sample will be flowering plants.

Given that:
p = 0.9
n = 200
SE = sqrt(0.9 x 0.1 / 200) = 0.0212
Calculate P(p^ > 0.95)?.

We can use equations in the CLT if the conditions hold.

  • Independence: Assume random sampled, and sample size < 10% of all plants.
  • Skewed: We have 200 x 0.9 (success) = 180, and 200 x 0.1 (failure) = 20, so both are ≥ 10.
source

We can plot using Normal or Binomial Distribution. If normal, then x-axis will have the proportion centered at 0.9, while in binomial distribution, it will have their respective count of flowering plants.

Zscore = 0.95 – 0.9 / 0.0212 = 2.36
The percentile of > 95%, which is the P(p^ > 0.95) ~= 0.0091

Confidence Interval

(~) What’s the 95% confidence interval of Americans who have good idea about statistics?.

Given that: Who have 
good idea = 571
wrong idea = 99
n = 670
p = 571 / 670 ~= 0.85

The parameter of interest (p): percentage of “all” Americans who have good idea. The point estimate (p^): percentage of “sampled” Americans who have good idea

First. To get the confidence interval, we check if CI conditions still hold.

  • Independence: Assumed to be random, and n < 10% of all Americans. So, observations are independent.
  • Skewed: 571 (number of success) and 99 (number of failure), are both ≥ 10. So, distribution is nearly normal.

Then. Calculate CI. We know that:

CI = point estimate +/- ME = p^ +/- (z* x SE),

where SE = sqrt(p(1-p) / n)

For calculating SE with CI, if p (population) is unknown, use p^.

So. CI = 0.85 +/- sqrt(0.85*0.15 / 670) = 0.85 +/- 0.027 = (0.823, 0.877)

We are 95% confident that 82.3% to 87.7% of all Americans have good idea about statistics. It also means that 95% of random samples of 670 Americans will yield CI that contain the true proportion.

Hypothesis Testing: One variable

source

(~) Research found that 60% of 1,983 randomly sampled Americans believe in evolution. Does the data provide convincing evidence that 50% or more of Americans believe in evolution?

As usual, the steps of hypothesis testing

— 1. Set the hypothesis

  • H0: p = null value = 0.5
  • HA: p > 0.5

— 2. Point estimate

The point estimate (p^) is the sample proportion, which equals to 0.6

— 3. Check conditions

  • Independence: Assume random sampled, and sample size < 10%.
  • Skewed: Each of success and failure is ≥ 10, where p = 0.5

— 4. Draw sampling distribution, and calculate the p-value

The test statistic (Z) = p^ - p / SE 
= 0.6 - 0.5 / sqrt(0.5x0.5 / 1983)
= 8.92
The p-value = P(Z > 0.6) ~= almost 0

There is almost 0% chance of obtaining a random sample of 1,983 Americans where 60% or more believe in evolution, If in fact 50% of Americans believe in evolution.

CI: Two variables

We will work with two categorical variables, each has two levels (success or failure), and we are going to compare their proportion of successes.

We will calculate the CI for the difference between the unknown two population proportions using data from our sample.

(~) In Canada, 59 out of 83, said Yes to banning guns, with 71% of success. In USA, 257 out of 1028, said Yes to banning guns, with 25% of success.

Given that: The success rate of
canada = 0.71, ncanada = 83
usa = 0.25, nusa = 1028

The parameter of interest is the difference between proportion (success, who said Yes) of Canada Vs proportion of USA (pCanada — pUSA).

The point estimate is the proportion (success) of sampled people form Canada Vs USA (p^Canada — p^USA).

First. To get the confidence interval, we check if CI conditions still hold.

  • Independence:
  • — Within group: Same as before, a random sample, and size < 10%
  • — Between groups: The two groups must be independent.
  • Skewed: Same as before, each sample should meet the success-failure condition.

Then. Calculate CI.

CI = point estimate +/- ME = (p^1 - p^2) +/- (z* x SE),where SE = sqrt((p1(1-p1) / n1) + (p2(1-p2) / n2))

Again. Use p^ in SE if p is not available.

So. CI = (0.36, 0.56), where p1 is for Canada, and p2 is USA.

We are 95% confident that the proportion of Canadian is 36% to 56% higher than the proportion of USA.

Shifting between USA and Canada in calculation would still give the same conclusion. Which means we are 95% confident that the proportion of USA people is 56% to 36% lower than the proportion of Canadians.

Hypothesis Testing: Two variables

Here, we’ll also work with two variables, each has two levels (success and failure).

source

(Q) Parents (mother and father) were surveyed asking whether their child has been bullied or not?.

Is the data provide convincing evidence that the proportion (percentage) of fathers Vs mothers is not equal?

The steps of hypothesis testing

— 1. Set the hypothesis

  • H0: pF = pM
  • HA: pF != pM

— 2. Point estimate

The point estimate (p^F — p^M) is the difference between sample proportions of females and males which equals to -0.12.

— 3. Check conditions

  • Independence: Sampled mothers and fathers are independent, and both groups are also independent.
  • Skewed: …

When calculating the difference in proportion in CI, we use the given sample p^1 and p^2. Usually, in hypothesis tests, we use p value (expected), while in CI, we use p^ (observed).

And since we assume H0 is true, we don’t have the values for p1 and p2 (of the true population)

So, how we calculate the success-failure condition (and SE)?. We can actually come up with a best guess for what, called “p-hat pool”.

p1 = p2 = p^(pool) = (# of success M + # of success F) / (n1 + n2) 
= 34 + 61 / 90+122
~= 0.45
source

So, the number of success and failures for each group.

Females: success = 122 * 0.45 = 54.9
failures = 122 * (1-0.45) = 67.1
Males: success = 90 * 0.45 = 40.5
failures = 90 * (1-0.45) = 49.5
Both are greater than 10

— 4. Draw sampling distribution, and calculate the p-value

To calculate the SE, we use p^(pool) we calculated:

SE = sqrt((p(1-p) / n1) + (p(1-p) / n2))
= sqrt((0.45 * (1-0.45) / 90) + (0.45 * (1-0.45) / 122))
= 0.0691
The test statistic (Z) = (p^F-p^M) - null value / SE
= -0.12 / 0.0691
~= -1.74
The p-value = P(Z > 1.74 OR Z < -1.74) = 0.08

So, a high p-value (above 5%), and so fail to reject the null hypothesis.

Hypothesis Testing: One variable, multiple Levels

This is when we have a categorical variables with multiple levels.

(~) Given the race distribution in a country who applies for jobs, and the distribution of people who are selected for these jobs.

Does the data (distribution) provides a convincing evidence that there is a discrimination?.

source

— 1. Set the hypothesis

  • H0: People were selected randomly, and the observed counts follow the same race distribution.
  • HA: People weren’t selected randomly, and the observed counts don’t follow the same race distribution.

If we can proof that the expected count (based on the above percentages) varies greatly from the observed count. Then, this provide enough convincing evidence for the alternative hypothesis.

This is called “a goodness of fit”; how well the observed data fit the expected distribution.

— 2. Point estimate

The point estimate is the distribution of races among people who were selected for a job.

— 3. Check conditions

  • Independence: Samples are independent, and if sample without replacement, n has to be < 10% of population. In addition, each observation has to belong to only one of the levels (i.e. races).
  • Sample size: Each level must has at least 5 observations.

— 4. Calculate Chi-Square statistic and the p-value

Out of 2500 person, we would expect:

white = 80.29 * 2500 = 2007
black = 12.06 * 2500 = 302
...

For sure, the observed data might be slightly (or greatly) different from that number. So, we calculate the variation between the expected and the observed.

When dealing with counts and investigating how far the observed counts are from the expected counts, we use this new test statistic called the chi-square statistic.

X^2 = SUM((O - E)^2 / E)  // observed - expected for each level

Like F statistics, it is right-skewed and always positive.

The chi-square distribution has one parameter: degree of freedom (df). It determines the shape, center, and spread.

source
df = k (levels) - 1

Again. As in F statistics, as df increases, the distribution becomes less right skewed (more normal), and the center moves to the right, and the curve flatten more.

So. To calculate the chi-square statistic and p-value

df = 5 - 1 = 4
X^2 (test statistic) = 22.63
p-value = 0.0002

Since p-value is very small, we reject H0. So, the data provide convincing evidence that the observed distribution of the counts of races does not follow the (expected) distribution in the population.

Hypothesis Testing: Two variables, multiple Levels

This is when we have two categorical variables, with at least one of them has more than 2 levels.

(~) A study shows the obese people and their relationship status. We have two groups: obese and relationship status.

Does the data provide convincing evidence that there is a relationship between weight (obese) and relationship status?.

source

— 1. Set the hypothesis

  • H0: Weight and relationship are independent. Obesity rates don’t vary by relationship status.
  • H1: Weight and relationship are dependent. Obesity rates do vary by relationship status.

If we can proof that the expected count (based on the formula below) varies greatly from the observed count. Then, this provide enough convincing evidence for the alternative hypothesis.

This is called “independence”; since we evaluate the relationship between two categorical variables. And it’s different from “fitness of good”; how well the observed fit the expected values.

— 2. Point estimate

The point estimate is the relationship between being obese and the relationship status.

— 3. Check conditions.

Same as before.

— 4. Calculate Chi-Square statistic and the p-value

In the previous lecture, we were given the true percentages, and we just needed to calculate the count.

Here, we don’t have the true percentages, so we calculate it as following:

  1. Calculate the rate of being obese = 331 / 1293 = 0.256
  2. Under the assumption that both groups are independent (H0), how many dating, cohabiting, and married for obese people?
dating     = 440 * 0.256 = 113
cohabiting = 429 * 0.256 = 110
married = 424 * 0.256 = 108

Do the same for non-obese people. The formula is

Expected count = (row total * col total) / table total
source

So. To calculate the chi-square statistic and p-value.

The test statistic X² is same as before. Its calculated for each cell in the table above. The df, however, equals to (rows-1) x (cols-1).

df = (2 - 1) x (3 - 1) = 1 x 2 = 2
X^2 (test statistic) = 31.68
p-value ~= very small value

So, it means that these data provide convincing evidence that relationship status and obesity are associated (correlation).

Does this test imply causation? No.

Can we conclude that living with someone is making some people obese, and marrying someone is making people even more obese? No.

Remember that this is an observational study, so what we’re seeing might be the effect of age (confounders). It is possible that there is a causal relationship but the type of analysis that we conducted here is not sufficient to deduce a causal relationship.

Chi-square test: Recap

Two types of chi-square tests.

  • Chi-square test of goodness of fit. Where we compared the distribution of one categorical variable with more than two levels to a hypothesized distribution.
  • Chi-square test of independence. Where we evaluated the relationship between two categorical variables, one of which has at least has more than two levels.

Thank you for reading! If you enjoyed it, please clap 👏 for it.

--

--

Omar Elgabry
OmarElgabry's Blog

Software Engineer. Going to the moon 🌑. When I die, turn my blog into a story. @https://www.linkedin.com/in/omarelgabry