ANOVA for statistics in Data science

Learnbay.co — Data Science Training in Bangalore

5 min readJun 10, 2020

ANOVA is a type of hypothesis testing which is used to find out the experimental results by analyzing the variance of the different survey groups. It is usually used for deciding the result of the dataset.
Analysis of variance(ANOVA) is a statistical method to find out if the means of two or more groups are significantly different from each other. It checks the impact of one or more factors by comparing the means of different samples.
When we have two samples/groups we use a t-test to find out the mean between those samples but it is not that much reliable for more than two samples, therefore, we use ANOVA.

Why do we use ANOVA testing?
In machine learning, the biggest problem is selecting the best features or attributes for training the model. We only require those features that are highly dependent on the response variable so that our model can able to predict the actual outcome after training the model. ANOVA is used to figure out the result when we have a continuous response variable and the target feature is categorical.

For example, we set up an experience of three groups of people, the very first group gets water drinks, second get some sugary juice and the third one like to take coffee or tea. Now, we need to test everyone’s reaction time and want to know if there is any difference between the groups or not.

The null hypothesis tells that all the three groups have the same reaction time, we have three groups here to experiment and find out the result so we need to apply the ANOVA testing in case of two groups we could use the t-test when we experiment we would notice that the result won’t be same.

The total variance of all these scores is made up of two parts:

The variance within the groups: As people have different reaction time in each group.
The variance between the groups: As the drinks are different which people prefer.

Example one:

As we can see here, there is a lot of variation in each sample/group, some of them are faster and some of them are slower but the groups are quite to one another, there is not much variation between the groups. So we can say that people are making a difference but not the type of drinks, in this case, we need to accept the null hypothesis we can’t reject that as the type of drink doesn’t put any effect on reaction time.

Example two:

Here we can see that there is not much difference within the groups but there is a lot of f=deifference between the groups. The people’s reaction time doesn’t make any effect on the groups, so here we will reject the null hypothesis.

In the example, we have seen a term hypothesis, what is the Hypothesis? ANOVA uses many terminologies with it.

Mean:

There are two types of mean that we used in ANOVA

Mean of each sample
Grand mean that is the mean of all the observation combined.

Hypothesis testing:

Hypothesis testing is statistical testing that is used to analyze the assumptions regarding the population parameters. There are two types of hypothesis in the hypothesis testing

Null hypothesis
Alternate hypothesis.

Hypothesis in ANOVA is

H0: μ1 = μ2 = μ3 …
H1: Means are not all equal.

where k = the number of independent comparison groups.

Types of ANOVA

One way ANOVA:

The one-way ANOVA is used to find out the statistically significant difference between the mean of more than two independent groups.

More specifically it is used to test the null hypothesis.

In one-way ANOVA µ = group means and k is a number of groups, if one-way ANOVA returns the significant result, in this case, we accept the alternative hypothesis, this means that the mean of two groups is not equal.

Two-way ANOVA:

A two-way is used to determine the effect of two nominal predictor features on a continuous outcome feature. It tests the effect of two independent variables on the expected outcome with the outcome itself.

F-value for ANOVA:

The F-value os ANOVA is a tool to help you to determine that, Is the variance between the means of two samples significantly different or not. The ratio of the between the groups and within the groups. It also helps us to find out the p-Value. The P-value is the probability of getting the result at least at the point where the null hypothesis should be true.

The formula for f-value:

Python code for f-value of ANOVA

Conclusion:

ANOVA is a statistical hypothesis testing process that tests the significant difference between two or more means. F-value is used to measure the size of the effects while comparing a ratio between those means of the samples/groups. The larger f-value shows a high effect, and if the f-value has lower value it shows there is no effect. When we noticed there is no effect we reject the null hypothesis and vise versa.

Statistics and statistical modeling play a wide role in data science and machine learning. A data scientists used is his regular work schedule by looking at the career opportunity in data science and data analytics Learnbay provides an industry accredited data science course in online training mode. They also provide data analysis and visualization tools like SQL, MongoDB, Tableau with their data science course, they have well-structured data science and data analytics course.

ANOVA for statistics in Data science

Types of ANOVA

Python code for f-value of ANOVA

Conclusion:

Written by Learnbay.co — Data Science Training in Bangalore