Predict Hospital Deaths with Biomedical Data and Machine Learning

Justin Tennenbaum — Mon, 09 Mar 2020 01:44:35 GMT

One of the best qualities of machine learning is that it applies to almost every type of work imaginable. Healthcare especially has been…

Continue reading on DataDrivenInvestor »

Using TensorFlow to Analyze Tweets

Justin Tennenbaum — Mon, 24 Feb 2020 01:59:49 GMT

I recently decided to participate in a Kaggle competition where the goal is to predict where or not a tweet is referencing a disaster such…

Continue reading on Medium »

Using Pandas to Impute Features for NLP

Justin Tennenbaum — Mon, 17 Feb 2020 01:31:14 GMT

I was looking for a new project to work on and decided the best place to start would be to look on Kaggle. I checked out their…

Continue reading on Medium »

Reinforcement Learning: Dynamic Programming

Justin Tennenbaum — Mon, 10 Feb 2020 02:49:57 GMT

Using Dynamic Programming to find the optimal policy in Grid World

Continue reading on TDS Archive »

Markov Decision Processes and Grid World

Justin Tennenbaum — Sun, 02 Feb 2020 23:38:03 GMT

In my previous article I discuss my first attempt at reinforcement learning by using it to tackle the Multi Armed Bandit problem. This…

Continue reading on Medium »

Starting with Reinforcement Learning: The Multi Armed Bandit Problem

Justin Tennenbaum — Mon, 27 Jan 2020 03:57:18 GMT

Going back to the basics of reinforcement learnings

Continue reading on Medium »

Coloring Photos with a Generative Adversarial Network

Justin Tennenbaum — Sun, 19 Jan 2020 18:42:41 GMT

Ever since I started learning about data science and machine learning, there has always been one algorithm that continually grabbed my…

Continue reading on TDS Archive »

Neural Networks and Image Recognition (CNN’s)

Justin Tennenbaum — Mon, 04 Nov 2019 17:44:20 GMT

Neural Networks have exploded in popularity the past couple of decades, and because of this we have adapted multiple variations to…

Continue reading on The Startup »

Automated Machine Learning

Justin Tennenbaum — Mon, 21 Oct 2019 13:30:46 GMT

Automated Machine Learning(AutoML) is currently one of the explosive subfields within Data Science. It sounds great for those who are not…

Continue reading on TDS Archive »

Analysis of Variance and its Variations

Justin Tennenbaum — Mon, 07 Oct 2019 16:40:03 GMT

In statistics, when trying to compare samples, our first thought is to perform a student’s t-test. It compares the means of two samples (or a sample and population) relative to the standard error of the mean or pooled standard deviation. While the t-test is a robust and useful experiment, it limits itself to comparing only two groups at a time.

In order to compare multiple groups at once, we can look at the ANOVA, or Analysis of Variance. Unlike the t-test, it compares the variance within each sample relative to the variance between the samples. Ronald Fisher introduced the term Variance and its formal analysis in 1918, with Analysis of Variance becoming widely known in 1925 after Fisher’s Statistical Methods for Research Workers. The students t-test follows a t-distribution, follows normal distribution’s shape, however it has fatter tails to account for more values farther from the mean in samples.

Source: Wikipedia

Anova however follows the F-Distribution which is a right skewed distribution with a long tail. For only two groups, we can use the F-distribution directly to compare the variances.

If U1 and U2 are independent variables distributed by a Chi-Squared distribution with d1 and d2 degrees of freedom respectively, then the Random Variable X (where X = (U1/d1)/(U2/d2)) follows a F-Distribution with parameters X~F(d1,d2) and its PDF given by:

Source: Wikipedia; (B=Beta Function)

Anova uses this same Distribution however the way it calculates its f-Value varies on the type of ANOVA test performed. The simplest form of Anova comes in the form of a 1-way Anova test which allows us to compare multiple groups by evaluating 1 independent variable and 1 dependent variable. In general Anova follows three main assumptions:

· The distribution of the dependent variable should be continuous and approximately normal

· Independence of Samples

· Homogeneity of Variances

The Anova then evaluates the ratio of variance between the groups compared to variance within in order to calculate its f-value. The one way Anova is given by the following table:

Source: Analytics Buddhu

Once we calculate our F-Ratio, we can compare it to our F-critical to determine if we can reject our null hypothesis. For an Anova test our alternative hypothesis is that at least one of the groups differ from each other so an ad hoc test such as Least Significant Squares or Tukey’s HSD (honestly significant test) . These filter through all combinations to determine which sample groups are different from each other. These tests are only performed if our Anova returns a statistically significant result.

Variations of Anova

As I said above the 1-way Anova can only account for a single independent and dependent variable. There are extensions of the 1-way Anova that allow us to circumvent these limitations. The first is the 2-way Anova. This test still requires us to have only 1 dependent variable, however we are able to include multiple independent variables to analyze the variance between groups. Since we have multiple variables, two calculations occur, the main effect and interaction effect. The main effect considers each independent variable separately, while the interaction effect looks all at the variables simultaneously.

A 2-way Anova is actually just a type of Factorial Anova, which means the test is going to contain multiple levels of independent variables (also called a factor). Simply put a two way Anova is a Factorial Anova with a level of 2. So a three way Anova has three independent variables, a 4-way has 4, and so on. The most common levels are 2 and 3, since above that can become difficult to interpret In a 1-way Anova the variability is compared between and within groups, while a factorial Anova compares the level of each factor with the other factors.

The other limitation of a 1-way Anova is fixed by performing a Manova, which allows us to compare the variances between groups on more than 1 dependent variable. The Manova will return a multivariate f-value as compared to the Anova, which returns a univariate f-value. The multivariate f-value will only indicate if the test is significant, it gives us no information on which particular variable is different between the groups. In order to learn which if 1 or more the dependent variables are significant, a 1-way Anova test needs to be performed to acquire the univariate f-value for each variable, followed by ad hoc tests.

A major assumption for all of the above tests is that our sample are independent of each other. This assumption means that we are unable to evaluate groups over time, or measure multiple results for the same subject. For only two groups this can easily be solved by using a dependent t-test. The Repeated Measure Anova is an extension of the dependent t-test and allows us to evaluate the same dependent subjects over multiple categories or times.

The repeated measure Anova is calculated very similarly to the 1-way Anova. Rather than being split between groups and within groups, it is split between times/conditions and within conditions/times has been split into 2 smaller categories, where the SSw equals SSsubjects + SSerror. Since we are using the same subjects in each group we can remove the variability of the subjects giving us a smaller error within-groups.

Source: statistics.laerd.com

This partition ends up increasing our F-statistic which means a repeated measure Anova has an increased power in finding statistical differences. However, this only leads to a more powerful test if our decrease of within group variability outweighs the decrease in degrees of freedoms.

I recently did my own project comparing the average temperatures each year over the course of multiple decades. I originally made the mistake of running a 1-way Anova, forgetting that my different “groups” were dependent of each other. I couldn’t shake the feeling that I had performed the wrong test and I wanted to dig deeper, hence why I am writing this. I dove back into my data, re-ran the test in Python using a Repeated Measure Anova and got the following results.

Given my results and an alpha of 0.05, I found an F-critical value of 2.77, and therefore I was able to reject my null hypothesis that the temperatures across different decades remain unchanged. It is always important to understand the assumptions and requirements of tests in order to make sure we are performing accurate and correct tests. As a data scientist it is our job to make sure we are always performing the right test, not just the ones that give us the answers we want.

Soures:

Analysis of Variance and its Variations was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Stories by Justin Tennenbaum on Medium

Predict Hospital Deaths with Biomedical Data and Machine Learning

Using TensorFlow to Analyze Tweets

Using Pandas to Impute Features for NLP

Reinforcement Learning: Dynamic Programming

Markov Decision Processes and Grid World

Starting with Reinforcement Learning: The Multi Armed Bandit Problem

Coloring Photos with a Generative Adversarial Network

Neural Networks and Image Recognition (CNN’s)

Automated Machine Learning

Analysis of Variance and its Variations