Experimental Design with G*Power: A Computational Tool for Statistical Power Analyses

Natasha Kacoroski
4 min readOct 15, 2019

--

Have you ever been out walking and then hit a tree? Well, maybe not literally, but perhaps it felt like that while working on a project. I suppose the more common metaphor is “hit a brick wall,” but I don’t see many brick walls these days… and I’m much more likely to run into trees. Or a rock. Or a person. Especially if I’m out bird-watching. Definitely if I’m out bird-watching.

Anyway, last week I designed an experiment for comparing manual and automated data collection in chickadee nest boxes. And I came to the big question…

How large do I make the sample size?

Cue Beethoven’s 5th. Dun dun da dun… My tree/wall/person had arrived.

At first I was like “Ha! Think you can stop me sample size tree/wall/person? I did this in Python using the StatsModels module. It was great!” And I pushed past the tree/wall/person.

But the tree/wall/person pushed back.

Before, I was comparing the means of two independent groups. Now, I was using the means of several different dependent variables to compare two independent groups. Instead of an independent t-test, I had a multivariate analysis of variance (MANOVA) on my hands. An F-test. This was a game changer.

I looked at the StatsModels documentation. Found the class of functions for generic F-tests. Whew! Good to go. Just need to-

Wait a minute.

Reading the description for effect size, “the mean divided by the standard deviation,” it appears to be based on Cohen’s d (the difference of the two means divided by the standard deviation). To my statistical spidey senses, this produces a squinty look of skepticism. I am suspicious that the effect size used for t-tests will also work well for MANOVA analysis. Going straight to the source, Statistical Power Analysis for the Behavioral Sciences by Jacob Cohen, my suspicions are confirmed, the multivariate measure for effect size is different. It is f2, the proportion of variance divided by 1 minus the proportion of variance. And the standard effect ranges are different too. A small effect is 0.01, medium effect is 0.06, and large effect is 0.40.

Excellent.

Now, given the effect size measure and standards for small, medium, and large effects, how do I conduct a MANOVA power analysis?

Continued searching leads me to G*Power, a tool for statistical power analyses. And it is awesome. Metaphorically speaking, it is the trampoline of my dreams to jump over my tree/wall/person. This app allows me to specify the statistical test (MANOVA: Global effects), and type of power analysis (a priori). Then I can input effect size, alpha level, power level, number of groups, and response variable. I have a choice of getting specific output parameters, or viewing a range of outputs in a graph or table.

It is so amazing that I almost feel guilty using it. As if I’m not doing enough work as a data scientist because I’m just inputting parameters instead of coding a function to get the output parameters. I looked at the manual to get a better idea of the math behind it, maybe enough to program it in Python, but I didn’t see a specific section on the MANOVA global effects test (the last manual was updated in 2017). It is clear that Cohen’s work has been improved upon over the years, and I don’t feel like I have the knowledge I need to code out a MANOVA power analysis in Python.

Uh-oh. My tree/wall/person becomes Godzilla!

But then I remind myself, that’s not my goal. My task is to determine an appropriate sample size for an experimental design. And I’m going to use the best tools at my disposal to do so.

Godzilla deflates back to a tree/wall/person.

I generated power and sample size data for a MANOVA (2 groups, 14 response variables), and independent t-tests. I’d rather go with a MANOVA, which considers all the dependent variables as vector, allowing me to actually compare the two data collection methods. My study has a limited sample size, though, so I also considered a series of independent t-tests with a Bonferroni correction to address multiple comparisons. It means that I will only be able to speak within the context of each response variable, not in aggregate. From my graph, only an independent t-test detecting large effects is a viable option. And my minimum sample size for each group is 26 observations. Below is the visualization I made for my study after using G*Power to generate the data.

Boing! I jump over my tree/wall/person.

Final thoughts. I think it would be valuable to understand more complicated power analyses and be able to run them in Python. In the meantime, G*Power is a fantastic tool to use. As a member of the scientific research community, I strongly believe that it is important to have a well planned experimental design, and a power analysis is a great method to help determine sample size.

--

--

Natasha Kacoroski

I am a data scientist with a passion for the outdoors and volunteer work. I love merging technology and community science!