Parametric vs. Non-Parametric Test: Which One to Use for Hypothesis Testing?
If you are studying statistics, you will frequently come across two terms — parametric and non-parametric test. These terms are essential for anyone looking to pursue Statistics and Data Science. However, seldom one understands the gravity of these terms, especially when dealing with a holistic understanding of statistics and its implementation in data science.
Parametric and non-parametric tests are the two main ways of classifying statistical tests. The exciting and complicated aspect of this classification, particularly regarding non-parametric statistical tests, is that there is no definitive definition of what defines a non-parametric test.
This complicates understanding the differences between these two terms, and you require a more nuanced approach.
One common approach is to use examples of parametric tests and then discuss their non-parametric counterparts. This is one of the best methods for understanding the differences. In this article, we will take this approach to understand the topic at hand.
Before discussing the differences between parametric and non-parametric tests, you must first understand the definition of parametric test and the properties of such tests that researchers and analysts typically agree upon.
What are Parametric Tests?
Parametric tests are the backbone of statistics and are an inseparable aspect of data science. This is simply because you must know about specific parametric tests to interpret many models, especially the predictive models that employ statistical algorithms such as linear regression and logistic regression.
However, to fully grasp the idea of what a parametric test is, several aspects of statistics need to be at the tip of your fingers. Before proceeding, let’s brush up on these concepts.
Also Read: Basic Statistics Concepts for Data Science
Before going forward, we have a short note-
Course Alert :
Managing data effectively for the company is the basis of making good decisions. With us, you will master this skill. Whether you are a new graduate or a working professional, we have data science courses with syllabi relevant to you.
Explore our signature data science courses in collaboration with Electronics & ICT Academy, IIT Guwahati, and join us for experiential learning to transform your career.
We have elaborate courses on AI, ML engineering, and business analytics. Choose a learning module that fits your needs — classroom, online, or blended eLearning.
Check out our upcoming batches or book a free demo with us. Also, check out our exclusive enrollment offers.
1) Population
Population refers to all individuals or subjects of interest that you want to study. Typically, in statistics, you can never fully collect information on population because-
- Either — the population is too large, causing accessibility issues. For example, suppose you want to know the income of all working Indians. In that case, asking about the income of millions of individuals in the organized and disorganized sector is almost impossible.
- Or —the volume and velocity of the population data are too high, which causes hardware issues (limited memory), making it difficult to process such data. For example, if you want to understand the spending pattern of the major bank’s customers, the sheer number of transactions happening at any given moment can be in millions. Analyzing a month’s data can be computationally so expensive that using the whole data is impossible.
2) Parameter
To answer any question, you will need arithmetic to quantify the population. Critical quantification methods include mean, standard deviation, median, minimum, maximum, inter-quartile range, etc. These significant values that describe the population are known as ‘parameters’.
3) Sample
As mentioned earlier, complete data on the population in question can be difficult due to various issues. However, to answer many questions, you need to understand the population. This is where the use of samples comes in handy.
Samples are nothing but the subset of a population that represents the population due to a concept known as the central limit theorem.
4) Central Limit Theorem
To put it roughly, the Central Limit Theorem (CLT) states :
If you have a large enough number of samples, i.e., the sample size (large theoretically means more than 30), then the mean of all these samples will be the same as the mean of the population.
Another aspect is that the sample distribution (also known as the sampling distribution) will be normal (gaussian) even if the population’s distribution is not normal.
5) Distribution
Distribution (commonly called data distribution) is a function that states all the possible values of a dataset along the frequency (count) of all values (or intervals as the values can be binned in groups).
The distribution is often represented using graphs like a histogram and a line chart. Different distributions have peculiar shapes and specific properties that help calculate probabilities.
These probabilities typically indicate the likelihood of a value occurring in the data, which can then be extrapolated to form a larger opinion regarding the sample space and the population from which it has been drawn.
6) Types of Distribution
Distribution can be symmetric and asymmetric.
- Symmetrical distributions are those in which the area under the curve to the left of the central point is the same as to the right.
- Asymmetric distributions are skewed and can be positive or negative. Common examples include Log-normal.
Another way of understanding symmetrical distribution in terms of shape is that there is no skewness, as the right side of the distribution mirrors the left side. Common examples include Gaussian, Cauchy, Logistic, Uniform, etc.
7) Gaussian Distribution and the 3-Sigma Rule
CLT causes a large sample to have a normal, also known as a Gaussian distribution. This refers to a symmetric distribution with a bell-shaped curve where the mean, median, and mode coincide.
Specific distributions have specific properties. One property of normal distribution is the three-sigma rule regarding the area under the curve (AUC) states that-
This concept is then expanded to calculate the probability of a value occurring in this distribution, which leads to hypothesis tests like the z-test.
8) Hypothesis Testing
Hypothesis Testing is an essential aspect of inferential statistics. As the name suggests, it checks whether the hypothesis being made regarding the population is true or not.
This is often done by calculating the probability of a value occurring in a population’s sample given the standard deviation in the data. Such tests help validate whether the statistics found through the sample can be extrapolated to form a particular opinion about the population.
9) Statistic
Certain arithmetic values that help define the population are known as parameters. However, as you often use samples, these values are known as statistics when calculated using a sample.
For example, if you know the income of all the Indians and you calculate the mean income from this population data, then this value will be a parameter.
However, when calculated using a sample of this population, the mean is known as a statistic.
To make sure the sample’s mean is truly indicative of the population mean and is not due to random chance, you use the concept of hypothesis testing.
With the crucial concepts laid down, you can now finally answer the question: what is parametric test? Let’s get started.
- Parametric Test: Definition
In statistics, a parametric test is a subtype of the hypothesis test. Parametric hypothesis testing is the most common type of testing used to understand the characteristics of a population from a sample.
While many parametric test types have certain differences, few properties are shared across all the tests that make them a part of ‘parametric tests’. These properties include-
- When using such tests, there needs to be a deep or proper understanding of the population.
- An extension of the above point is that to use such tests, several assumptions regarding the population must be fulfilled (hence a proper understanding of the population is required). A common assumption is that the population should be normally distributed (at least approximately).
- The outputs from such tests cannot be relied upon if the assumptions regarding the population deviate significantly.
- A large sample size is required to run such tests. Theoretically, the sample size should be more than 30 so that the central limit theorem can apply, making the sample normally distributed.
- Such tests are more powerful than their non-parametric counterparts for the same sample size.
- These tests are only helpful with continuous/quantitative variables.
- The mean is typically used to measure the central tendency (i.e., the central value of data).
- The output from such tests is easy to interpret; however, it can be challenging to understand their workings.
Now, with an understanding of the properties of parametric tests, let’s now understand what non-parametric tests are all about.
What are Non-Parametric Tests?
Let’s consider a situation.
A problem can be solved by using a parametric hypothesis test. However, you cannot fulfill the necessary assumptions to use the test. This assumption can be, for example, regarding the sample size, and there is nothing much you can do about it now.
Does this mean you can’t do any inferential analysis using the data? The answer is NO.
In hypothesis testing, the other type apart from parametric is non-parametric. Typically, its non-parametric cousin can be used for every parametric test when the assumptions cannot be fulfilled.
Non-parametric tests do not need a lot of assumptions regarding the population and are less stringent when it comes to the sample requirements.
However, they are less powerful than their parametric counterparts.
It means that the chances of a non-parametric test concluding that two attributes have an association with each other are less even when they, infact, are associated. To compensate for this ‘less power,’ you need to increase the sample size to gain the result the parametric counterpart would have provided.
Another peculiar aspect of the non-parametric test is that it can also be used with discreet variables (i.e., categorical variables). This is because non-parametric tests provide a ranking of values instead of using the original data.
While it helps solve certain problems, it is often difficult to interpret the results.
To put this in context, a parametric test can tell that the blood sugar of patients using the new variant of a drug (to control diabetes) is 40 mg/dL lower than that of those patients who used the previous version.
This interpretation is useful, as it can help us form an intuitive understanding of what is happening in the population.
On the other hand, its non-parametric counterpart, as they use rankings, will provide output in terms of 40, which is the difference in the mean ranks of the two groups of patients. This is less intuitive and less helpful in forming a definite opinion regarding the population.
To conclude:
While nonparametric tests have the advantage of providing an alternative when you cannot fulfill the assumptions required to run a parametric test or solve an unconventional problem, they have limitations in terms of capability and interpretability.
Now, to gain a practical understanding, let’s explore different types of parametric and non-parametric tests.
Parametric Tests for Hypothesis Testing
To understand the role of parametric tests in statistics, let’s explore various parametric test types. The parametric test examples discussed ahead all solve one of the following problems-
- Using standard deviation, find the confidence interval regarding the population
- Compare the mean of the sample with a hypothesized value (that refers to the population mean in some cases)
- Compare two quantitative measurement values typically mean from a common subject
- Compare two quantitative measurement values typically mean from two or more two distinct subjects
- Understand the association level between two numerical attributes, i.e., quantitative attributes.
Parametric Hypothesis Testing: Types
- Z-Test
When you need to compare the sample’s mean with a hypothesized value (which often refers to the population mean), a one-sample z-test is used.
The test has major requirements, such as the sample size should be more than 30 and the population’s standard deviation should being known
- One Sample t-Test
If either of the requirements mentioned above cannot be met, you can use another type of parametric test, the one-sample t-test.
Here, if the sample size is at least more than 15 and the standard deviation of the sample is known, then you can use this test. Here, the sample distribution should be approximately normal
- Paired (dependent) t-Test
A paired t-test is used when data from the same subject is collected, typically before and after an event—for example, the weight of a group of 10 sportsmen before and after a diet program.
Here, you can use the paired t-test to compare the mean of the before and after groups. The assumptions here include groups being independent, the values of before and after belonging to the same subjects, and the differences between the groups being normally distributed
- Two Sampled (Independent) t-Test
In situations where there are two separate samples, for example, the house prices in Mumbai v/s house prices in Delhi, and you have to check if the mean of both these samples is statistically significantly different, then a two-sampled t-test can be used.
It assumes that each sample’s data distribution should be roughly normal, values should be continuous, the variance should be equal in both samples, and they should be independent of each other
- One-way Analysis of Variance
An extension of two sampled t-tests is one-way ANOVA, where we compare more than two groups. Suppose someone asks you if that is ANOVA a parametric test, the answer to that is a definitive yes.
ANOVA analyses the variance of the groups and requires the population distribution to be normal, variance to be homogeneous, and groups to be independent
- Pearson’s Coefficient of Correlation
To understand the association between two continuous numeric variables, you can use a person’s correlation coefficient.
It produces an ‘r’ value where a value closer to -1 and 1 indicates a strong negative and positive correlation, respectively.
A value close to 0 indicates no major correlation between the variables. A part of its assumption is that both the variables in question should be continuous.
Non-Parametric Tests for Hypothesis Testing
In the above section, we discussed several parametric tests that can solve different statistical inferential problems. All those tests, however, are of the parametric types and have stringent assumptions to be taken care of, which you may or may not be able to fulfill. This is where non-parametric tests are helpful. Common types of non-parametric tests include-
- Wilcoxon signed-rank test
It is used as an alternative to the one-sample t-test
- Mann-Whitney U-test / Wilcoxon rank-sum test
They can be used as an alternative to the two-sample t-test
- Kruskal-Wallis test
It is an alternative to the parametric test — one-way ANOVA
- Spearman’s rank correlation
You can use this test as an alternative to pearson’s correlation coefficient. It’s important when the data is not continuous but in the form of ranks (ordinal data)
- Signed-rank test
It is an alternative to the parametric test — paired t-test
There are alternatives to all the parametric tests. So, if you cannot fulfill any assumptions, you can use their respective non-parametric tests.
Parametric vs. Non-Parametric Test
With exploring parametric and non-parametric tests, it’s time to summarize their differences. The following table can help you understand when and where you should use the parametric tests or their non-parametric counterparts and their advantages and disadvantages.
Now that you have a better understanding of the differences between parametric and non-parametric tests, you can use the type of test that suits your needs and provides the best results.
FAQs
- What are parametric and nonparametric test examples?
When trying to solve a problem, parametric and nonparametric tests can be used. When comparing a sample mean with a hypothesized value, one can use parametric tests such as a z-test, a one-sample t-test (if the sample size is less than 30), or a nonparametric test such as the Wilcoxon signed-rank test.
If there is a need to compare the mean of two independent samples, then the parametric two-sample t-test can be used, or the non-parametric Wilcoxon rank-sum test or Mann-Whitney U-test can be used. Similarly, for almost every other parametric test, a non-parametric test can be used if the assumptions for the parametric test are not fulfilled.
- What are the four non-parametric tests?
While there are several non-parametric tests, the four most common ones are the two-sample Kolmogorov-Smirnov test, the Wilcoxon signed rank test, the Mann-Whitney U-test, and Spearman’s rank correlation.
- Is ANOVA a parametric test?
Is ANOVA a parametric test — this is a pretty commonly asked question. ANOVA stands for Analysis of Variance. As the name suggests, it’s a type of hypothesis test that analyses the variance to compare samples/groups.
There are many types of ANOVA tests, such as one-way, two-way, repeated measures, and MANOVA. All these tests are parametric as they require (as part of the assumptions) that the population be normally distributed, variables be independent and random, and sample variance be homogeneous.
- Is the chi-square test a parametric test?
Chi-square, also known as the goodness of fit test, is typically used to test the independence of two categorical variables. You must remember that chi-square is a non-parametric test. There are numerous reasons for this, such as the variables being categorical (discreet), the groups being tested being unequal (whereas parametric tests require the groups to be roughly equal), and the data having no homoscedasticity.
We hope this article has helped you understand what parametric and non-parametric tests are all about, when to use and when not to use them, and their advantages and disadvantages.
Try out the tests mentioned in the article to gain a better understanding. You can leverage languages like R and Python and statistical software like SAS and SPSS. If you have any suggestions or feedback, please get back to us.
Additional Resources to Read:
- Confusion Matrix in Machine Learning: How it helps in solving Classification Issues
- Time Series Analysis and Forecasting for Data Analysis and Prediction
- Anomaly Detection: Definition and Techniques
- Descriptive vs. Inferential Statistics
Published originally on AnalytixLabs Blog.