Why does variance matter? 🤔

Sushil Deore
Analytics Vidhya
Published in
5 min readFeb 1, 2021

Variance is important for two main reasons:

  1. For use of Parametric statistical tests, as they are sensitive to variance.
  2. The variances of the samples to assess whether the populations they come from differ from each other.

Wait, you are using words which are unknown to us!!! 😒

Ok. I will start from the beginning. 🙂

We all know the “Hypothesis Testing”.

A Hypothesis is an educated guess about something in the world around you.

Hypothesis testing in statistics is a way for you to test the results of a survey or experiment to see if you have meaningful results.

You’re basically testing whether your results are valid by figuring out the odds that your results have happened by chance. If your results may have happened by chance, the experiment won’t be repeatable and has a little use.

So, we have a null hypothesis a prediction of no relationship between the variables you are interested in and Alternate hypothesis is usually your initial hypothesis that predicts a relationship between variables.

We use a statistical test in hypothesis to determine whether a predictor has a statistically significant relationship with an outcome variable and estimate the difference between two or more groups.

Statistical tests work by calculating a test statistics — A number that describes how much the relationship between variables in your test differs from the null hypothesis of no relationship.

For any combination of sample sizes and a number of predictor variables, a statistical test will produce a predicted distribution for the test statistic. This shows the most likely range of values that will occur if your data follows the null hypothesis of the statistical test.

You can perform statistical tests on data that have been collected in a statistically valid manner — either through an experiment or through observations made using probability sampling methods.

To determine which statistical test to use, we need to know whether our data meets assumptions and types of variables we are handling.

Assumptions:

  1. Independence of observations i.e. no autocorrelation
  2. Homogeneity of variance: the variance within each group being compared is similar among all groups.
  3. Normality of data: the data follows a normal distribution.

These assumptions applied to the parametric statistical test. If data failed to follow the above assumptions we can apply nonparametric statistical tests.

Types of variables:

  1. Quantitative variables: represent amounts of things. Quantitative variables include Continuous and discrete variables.
  2. Categorical variables represent groupings of things. Categorical variables include Ordinal, Nomial and Binary variables.

Too much info on statistics, from variance importance reasons we wanted to know the parametric and non-parametric test 🙄

Ohh that's ok because above information is required for the basic understanding of tests. 😎

Parametric test:

A parametric test is a statistical test which makes certain assumptions about the distribution of the unknown parameter of interest and thus the test statistic is valid under these assumptions.

  1. Regression Tests: Regression tests are used to test cause-and-effect relationships. They look for the effect of one or more continuous variables on another variable.

2. Comparison tests: Comparison tests look for differences among group means. They can be used to test the effect of a categorical variable on the mean value.

3. Correlation tests: Correlation tests check whether two variables are related without assuming cause-and-effect relationships.

Non-parametric tests:

Nonparametric statistics is the branch of statistics that is not based solely on parametrized families of probability distributions.

Now, we will talk only on Variance.😉

The variance is a measure of variability. It is calculated by taking the average of squared deviations from the mean.

Variance tells you the degree of spread in the data set. The more spread the data, the larger the variance is in relation to the mean.

The standard deviation is derived from variance and tells you, on average, how far each value lies from the mean. It’s the square root of the variance.

Population variance:

When you have collected data from a population of data. we will get population variance.

Where, σ² = population variance, X = Each value, μ = population mean, N = Number of values in the population.

Sample variance:

When you collect data from a sample from a population, the sample variance is used to make estimates about the population variance.

where S² = sample variance, X = each value, x ̅ = sample mean, n = Number of values in sample

So, uneven variances between samples result in biased and skewed test results. That’s why we need homogeneity or similar variances when comparing samples.

Summary :

In this article, we understood the importance of variance. Also, we saw the difference between the standard deviation and variance, touch the population and sample variances. Next time we will take the help of Python modules to understand better these concepts.

Please fill free to propose improvement 😊

References:

https://www.scribbr.com/statistics/variance

https://www.statisticshowto.com/probability-and-statistics/hypothesis-testing/

--

--

Sushil Deore
Analytics Vidhya

Data Science enthusiast | Mechanical Engineer | A city dweller