Statistical Inferences

Shalini Shree
AI Skunks
Published in
17 min readMar 13, 2023

Shalini Shree, Nik Bear Brown

INTRODUCTION

LET’S BEGIN WITH AN EXAMPLE!

Do you like Pizza?

If yes, ever wondered which kind of pizza people prefer?

NO, then let's try to figure it out.

Imagine you’re a pizza chef who wants to determine what toppings are most popular among your customers. You have a menu with 20 different toppings, and you want to know which ones are the most popular so that you can optimize your pizza-making process and inventory management.

However, surveying every single customer who orders pizza from your restaurant would be time-consuming and impractical. Instead, you decide to use sampling to estimate the popularity of each topping.

To do this, you randomly select a sample of 100 pizza orders from the past month. You record the toppings that were requested on each of these pizzas and count how many times each topping was ordered. This gives you a sample distribution of the toppings.

Based on this sample, you find that the most popular toppings are pepperoni, mushrooms, and sausage. You can use this information to make informed decisions about your inventory management, such as stocking up on more of these popular toppings and potentially reducing the inventory of less popular toppings.

However, it’s important to keep in mind that the sample of 100 pizza orders may not be an exact representation of all the pizza orders your restaurant receives. The sample may be biased in some way, such as only including orders from certain times of the day or certain days of the week. To account for this uncertainty, you can calculate a margin of error and a confidence interval to estimate the range of values in which the true popularity of each topping is likely to fall.

Sampling can be a fun and useful tool for making decisions in everyday life, such as optimizing your pizza-making process!

Trying to find out the nature of a problem before making a decision using statistical tools on a sample of the entire population is referred to as Statistical Inference.

Inferential statistics also allow for the quantification of uncertainty and the assessment of hypotheses about the population based on the sample data.

Advantages of statistical inference

  • Enables making generalizations about a larger population based on a smaller sample of the population.
  • Allows for the quantification of uncertainty, so that the level of confidence in the results can be determined.
  • Helps to identify patterns and relationships in the data, and make predictions about future events.
  • Facilitates the testing of hypotheses and the comparison of multiple groups or variables.

Disadvantages of statistical inference

  • The conclusions drawn from statistical inference are only as reliable as the sample data. If the sample is not representative of the population, the results may be biased.
  • Inferential statistics can be affected by outliers and other anomalies in the data.
  • The choice of statistical methods and models used for inference can impact the results, so it is important to choose appropriate methods for the data and research questions.
  • The results of statistical inference are only an estimate and may not perfectly reflect the true population values.

SIZE OF THE SAMPLE?

The appropriate sample size depends on several factors such as the variability of the population, the desired level of precision, the confidence level, and the research question being addressed. In general, a larger sample size is preferred as it provides a more accurate estimate of the population parameter with higher precision and reduces the sampling error.

However, it’s important to note that the sample size should not be too large, as it can lead to a waste of resources and time. Additionally, the sample should be representative of the population to ensure the generalisability of the results.

TYPES OF SAMPLING TECHNIQUES

Probability sampling methods

  1. Simple Random Sampling: Simple random sampling is the most common sampling technique used in data analysis. In this technique, every individual in the population has an equal chance of being selected. For example, if we are finding the preference of toppings on a pizza, we can randomly select a set of orders from the order history to see the most popular choices of toppings in those orders.

Advantages:

  • It is easy to implement.
  • It is unbiased, as every individual has an equal chance of being selected.

It is unbiased, as every individual has an equal chance of being selected.

Disadvantages:

  • It may not be representative of the population if the sample size is small.
  • It can be expensive and time-consuming if the population is large.

2. Systematic sampling: Systematic sampling is similar to simple random sampling, but it is usually slightly easier to conduct. Every member of the population is listed with a number, but instead of randomly generating numbers, individuals are chosen at regular intervals.

3. Stratified Sampling: Stratified sampling is a sampling technique used when the population can be divided into distinct subgroups or strata. In this technique, individuals are randomly selected from each stratum in proportion to their size. This ensures that the sample is representative of the population.

Advantages:

  • It ensures that the sample is representative of the population.
  • It reduces the sampling error, as individuals are selected from each stratum.

Disadvantages:

  • It can be complex to implement if the population has many strata.
  • It may not be effective if the population is not well-defined into strata.

4. Cluster Sampling: Cluster sampling is a technique used when it is difficult or expensive to obtain a sample from the population. In this technique, the population is divided into clusters, and a random sample of clusters is selected. Then, all individuals within the selected clusters are included in the sample.

Advantages:

  • It is cost-effective, as it reduces the time and cost of sampling.
  • It can be effective if the population is geographically dispersed.

Disadvantages:

  • It may not be representative of the population if the clusters are not well-defined.
  • It may introduce bias if the clusters are not homogenous.

NON PROBABILITY SAMPLING METHOD

  1. Convenience Sampling: Convenience sampling is a type of non-probability sampling technique in which the sample is chosen based on the ease of access to the individuals in the population. Convenience sampling is easy to use and can be less time-consuming than other sampling techniques. However, convenience sampling is not representative of the population, as individuals who are more accessible may not be representative of the entire population.
  2. Quota Sampling: Quota sampling is a type of non-probability sampling technique in which the sample is chosen based on pre-defined characteristics such as age, gender, or income. The population is divided into subgroups based on these characteristics, and a quota is set for each subgroup. The sample is then chosen from each subgroup until the quota is met. Quota sampling is relatively easy to use and can be less expensive than other sampling techniques. However, quota sampling may not be representative of the entire population, as individuals who are not part of the pre-defined characteristics may be excluded from the sample.
  3. Snowball Sampling: Snowball sampling is a type of non-probability sampling technique in which the initial participants in the sample are chosen based on their knowledge of other individuals in the population who have similar characteristics. The initial participants are asked to identify other individuals in the population who meet the criteria for the study, and those individuals are then asked to participate. Snowball sampling is useful when the population is difficult to access or when it is difficult to identify the entire population. However, snowball sampling may introduce bias into the sample, as individuals who are more connected or more willing to participate may be overrepresented in the sample.
  4. Purposive Sampling: Purposive sampling is a type of non-probability sampling technique in which the sample is chosen based on the researcher’s judgment or purpose of the study. The researcher selects individuals who are likely to provide the necessary information or represent the characteristics of interest in the study. Purposive sampling can be useful when the population is difficult to access or when the research question requires a specific type of participant. However, purposive sampling may introduce bias into the sample, as the researcher’s judgment may be subjective and influenced by personal biases.

WHAT IS BIAS?

Bias refers to any systematic error or deviation from the true value that affects the accuracy and validity of the results. Bias can arise from various sources, such as the sampling method, the measurement instrument, the data collection process, or the data analysis technique used.

Some common types of bias include:

Sampling Bias: Sampling bias occurs when the sample selected for the study is not representative of the entire population or when certain groups are over- or underrepresented in the sample. Sampling bias can lead to inaccurate or misleading results that do not generalize to the population.

Measurement Bias: Measurement bias occurs when the measurement instrument used to collect data systematically under- or overestimates the true value of the variable of interest. Measurement bias can lead to inaccurate or unreliable data that do not reflect the true state of the phenomenon.

Response Bias: Response bias occurs when the participants in the study do not provide accurate or truthful responses to the survey questions or interview prompts. Response bias can arise from social desirability bias, where participants give answers that they believe are socially acceptable or desirable, or from recall bias, where participants have difficulty recalling past events or experiences accurately.

Confounding Bias: Confounding bias occurs when the relationship between two variables is affected by a third variable that is not accounted for in the analysis. Confounding bias can lead to spurious or misleading results that do not reflect the true association between the variables of interest.

Selection Bias: Selection bias occurs when individuals or groups are selected for inclusion in the study based on non-random criteria, such as availability, convenience, or personal preference. Selection bias can lead to a biased sample that does not represent the population of interest.

Types of statistical inferences:

Deductive inference: Also known as analytic or theoretical inference, this involves deducing conclusions based on prior knowledge or assumptions about a population, without the need for collecting sample data.

Inductive inference: Also known as empirical or observational inference, this involves drawing conclusions about a population based on a sample of data collected from that population.

Deductive Inference

A sort of statistical inference known as deductive inference involves drawing inferences about a population based on knowledge or presumptions about it without the requirement for sample data. Instead of using sample data to draw conclusions about the population, this sort of inference is founded on logical reasoning and prior knowledge.

There are several techniques used in deductive inference, including:

  • Theoretical modelling: This entails utilizing equations and mathematical models to portray the connections and interactions among members of a population. Based on the underlying assumptions and inputs, the models may be used to generate inferences and predictions about the population.
  • Formal logic: This implies drawing conclusions based on presumptions or past information by applying formal logic and reasoning. Formal logic may be applied to draw conclusions regarding the connections between variables and to generate predictions based on those connections.
  • Bayesian inference: This includes combining sample data and previous information to draw conclusions about a population using the Bayesian probability theory. With the use of fresh evidence, Bayesian inference may be used to revise prior assumptions about a population and produce population-specific predictions.

Inductive inference

Inductive inference involves making inferences about a population based on a sample of data. This kind of inference is predicated on the notion that conclusions about a wider population may be drawn from the characteristics of a sample.

  • Point estimation: This involves using sample data to estimate the value of a population parameter, such as the mean or variance. Point estimation techniques include methods such as the sample mean and maximum likelihood estimation.
  • Confidence intervals: This involves using sample data to construct an interval that is believed to contain the true value of a population parameter with a certain level of confidence. Confidence intervals can be used to estimate the range of values that a population parameter is likely to take.
  • Hypothesis testing: This involves using sample data to test a hypothesis about a population parameter. Hypothesis testing involves making a null hypothesis (e.g., the population’s mean is equal to a certain value) and an alternative hypothesis and then using sample data to determine which hypothesis is more likely to be true.
  • Regression analysis: This involves using sample data to fit a regression model that describes the relationship between one or more independent variables and a dependent variable. Regression analysis can be used to make predictions about the dependent variable based on the values of the independent variables.
  • Bayesian inference: This involves using Bayesian probability theory to make inferences about a population based on prior knowledge and sample data. Bayesian inference can be used to update prior beliefs about a population based on new data, and to make predictions about the population based on that updated information.

Basic Probability

Probability can roughly be described as the chance of an event or sequence of events occurring.

Experiment — are uncertain situations, which could have multiple outcomes. A coin toss is an experiment.

Outcome is the result of a single trial. So, if head lands, the outcome of or coin toss experiment is “Heads”

Event is one or more outcomes from an experiment. “Tails” is one of the possible events for this experiment.

  • The Probability that an event occurs with certainty is 1
  • The Probability that an event will not occur surely is 0
  • The Probability of the complement of an event is 1 minus the probability of that event.
  • The probability of at least 1 of 2 (or more) things that can not simultaneously occur (mutually exclusive) is the sum of their respective probabilities

Conditional Probability

conditional probability is the probability of event A happening, given that we have some information about the occurrence of event B.

The conditional probability of A given B is denoted by P(A|B) and is calculated as follows:

P(A|B) = P(A and B) / P(B)

where P(A and B) is the probability of both A and B occurring, and P(B) is the probability of B occurring.

Probability Distribution

A probability distribution is a statistical function that describes the likelihood of obtaining all possible values that a random variable can take.

If our random variable has discrete values, the probability distribution is the probability mass function (PMF) for that variable; similarly, if our random variable has continuous values, the distribution is termed a probability density function (PDF).

One of the most common Probability Distribution Functions is the Normal Distribution.

Normal Distribution

The normal distribution, also known as the Gaussian distribution, is a symmetric probability distribution about the mean, indicating that data near the mean occur more frequently than data distant from the mean. The normal distribution will show as a bell curve on a graph.

The graph shows that the distribution of target value is relatively flat or uniform over that range of values. Also, indicates that the PDF is approximately constant over that range.

Z-score

A z-score (also called a standard score) is a statistical measure that indicates how many standard deviations a data point is away from the mean of the distribution. It is calculated by subtracting the mean from the data point and then dividing it by the standard deviation.

z-score = (x — μ) / σ

E.g.: Now suppose we want to find the z-score for a particular pizza topping, pepperoni. We can do this by first calculating the mean and standard deviation of the number of pepperoni slices consumed by each person. Let’s assume that the data shows that, on average, each person consumes 2 slices of pepperoni pizza, with a standard deviation of 0.5 slices.

To find the z-score for a person who ate 3 slices of pepperoni pizza. We can plug the values into the formula as follows:

z-score = (3–2) / 0.5 = 2

This means that the person who ate 3 slices of pepperoni pizza is 2 standard deviations above the mean number of pepperoni slices consumed by a person.

Inference from Stats

Sample Mean and Population Mean

Central Limit Theorem

The central limit theorem (CLT) is a statistical theory that claims that given a big enough sample size from a population with a limited degree of variance, the mean of all samples from the same population will be nearly equal to the overall mean. Furthermore, all of the samples will follow an almost normal distribution pattern, with their variances roughly equal to the variance of the population divided by the size of each sample.

In the graphs above, the yellow curve is the predicted Gaussian distribution from the Central Limit Thereom. Notice that the rate of convergence of the sample mean to the Gaussian depends on the original parent distribution. Also, the mean of the Gaussian distribution is the same as the original parent distribution, the width of the Gaussian distribution varies with sample size as 1/√n.

Confidence Interval

Confidence Interval (CI) is a form of estimate derived from observable data statistics. This suggests a range of possible values for an unknown parameter (for example, the mean). The interval is accompanied by a level of confidence that the real parameter is within the recommended range.

z-critical value: 1.6448536269514722 — This is the z-critical value for a 95% confidence interval. It is obtained from the standard normal distribution using the ppf (percent point function) method.

A confidence interval of 95% would mean that if we take many samples and create confidence intervals for each of them, 95% of our samples’ confidence intervals will contain the true population mean.

It is easily visible that 95% of the time the blue lines(the sample mean) overlap with the red line(the true mean), also 5% of the time it is expected to not overlap with the red line(the true mean).

The plot shows that the sample means for each trial tend to cluster around the true population mean, and the confidence intervals become narrower as the number of trials increases. This is consistent with the Central Limit Theorem, which states that as the sample size increases, the distribution of sample means approaches a normal distribution with a mean equal to the true population mean and a standard deviation equal to the standard deviation of the population divided by the square root of the sample size.

Hypothesis Testing

Hypothesis testing, also known as confirmatory data analysis, is a test that may be performed by observing a process that is described by a set of random variables. A statistical hypothesis test is a statistical inference procedure.

NULL Hypothesis

The Null Hypothesis (𝐻0) is a generic statement or default stance in inferential statistics that there is no link between two measurable phenomena or no correlation among groups.

Statistical hypothesis tests are based on a statement known as the null hypothesis, which asserts that nothing meaningful is happening between the variables being tested.

Therefore, in our case, the Null Hypothesis would be: The frequency of Pepperoni topping is not different from the frequency of other toppings.

Alternate Hypothesis

The alternate hypothesis (𝐻1) is just an alternative to the null hypothesis.

The Null Hypothesis is considered to be true, and the statistical proof is required to reject it if in favor of an Alternative Hypothesis.

Type 1 and Type 2 Error

In statistical hypothesis testing, a type I error is the rejection of a true null hypothesis, while a type II error is the non-rejection of a false null hypothesis

P-Value

The p-value, also known as the probability value, is the likelihood of receiving test findings that are at least as severe as the outcomes actually observed during the test, provided that the null hypothesis is valid.

Assume we’ve set a significance level of () = 0.05.

This indicates that if the p-value is less than 0.05, we reject our Null hypothesis and accept the Alternative as true.

Here, the z-statistic is 0.7767, which indicates that the sample mean of the Pepperoni pizza orders is 0.7767 standard deviations above the population mean. The p-value is 0.43731, which is greater than the typical significance level of 0.05. This means that we fail to reject the null hypothesis and conclude that there is insufficient evidence to suggest that the mean order frequency of Pepperoni pizza is significantly different from the overall mean order frequency of all pizza toppings.

Gosset’s / Student’s t-test

The T-test is a statistical test used to detect if a numerical data sample differs significantly from the population or whether two samples differ from one another.

A t-test addresses this difficulty and allows us to conduct a hypothesis test on a lower sample size.

Since the p-value is greater than the common significance level of 0.05, we fail to reject the null hypothesis

Chi-Square Test

The Chi-Square test is a statistical hypothesis test that is used to determine whether there is a significant association between two categorical variables. It is used to test the independence of two variables.

The Chi-Square test requires the following assumptions:

  • The data are categorical.
  • The observations are independent.
  • The expected frequency for each category is at least 5.

The chi-Squared Goodness of fit Test is a statistical test used to determine whether a sample of categorical data follows a specified distribution. In other words, it helps us to determine whether the observed frequencies of a categorical variable are significantly different from the expected frequencies.

The formula for the Chi-Squared test:

where,

O = each observed value,

E = each expected value.

In this case, the chi-squared test statistic is 180.65, and the p-value is 0.472, which is greater than the significance level of 0.05. Therefore, we fail to reject the null hypothesis and conclude that there is no significant association between the price of a pizza and its frequency of purchase.

REFERENCES

MIT License

All code in this notebook is available as open source through the MIT license.

All text and images are free to use under the Creative Commons Attribution 3.0 license. https://creativecommons.org/licenses/by/3.0/us/

These licenses let people distribute, remix, tweak, and build upon the work, even commercially, as long as they give credit for the original creation.

Copyright 2023 AI Skunks https://github.com/aiskunks

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE

--

--