A Problematic Meta-Analysis: A Critique of Greengross et al, 2020

7 min readOct 5, 2020

Gender and humor is contentious topic, and rightfully so. With women having been marginalized for centuries, comedy is a form of art that has only continued this. As such, there is reason for concern with the publication of a study by Gil Greengross et al, published in the Journal of Research in Personality, entitled “Sex differences in humor production ability: A meta analysis.” This study is a meta-analysis which claims that men have an advantage in ‘humor production ability,’ or as they interpret it, being funnier than women on average. I believe this meta-analysis suffers from numerous methodological, conceptual, and empirical flaws that prohibit any of the conclusions that have been reported in the media (see, e.g, this Psychology Today article published by the leading author advertising the study), or even in the study itself.

I will note as well that I will not be discussing the merits of the leading hypothesis, that the difference is due to men exerting more effort for evolutionary mating reasons. While I believe this hypothesis is flawed for several reasons (both relating to the studies cited & the theoretical founding), this article is focused on the direct methodology of the meta-analysis.

Firstly, I would like to critique the authors’ conceptualization of a concept of ‘humor production ability.’ This implies that humor is somehow something that can be objectively assessed and measured, which is simply not the case. An individuals’ own ‘humor production ability,’ as determined by the studies in the meta analysis, is simply of a rating of a joke they produced by various judges. All this reflects is the opinion of those judges at that given point in time in a given context. The authors give no indication that the research on humor psychology accepts or recognizes a uniform construct of ‘humor production ability,’ which brings their results into question in regards to interpretability & generalizability

Secondly, given the prior issue listed, this also highlights into the authors’ very concern that “… the procedures used to produce the overall HPA scores for each study vary, as there is no consensus on the best way to calculate a humor score. As discussed earlier, researchers employ various protocols to calculate a humor score, which are partially dependent on the tasks employed, and whether one or more stimuli are used to generate humorous responses.” This is concerning, as it questions whether these measures can truly be compared. It is well known that in order for a meta analysis to properly be interpreted, it must be based on a large number of higher quality studies with comparable measures (Lee, 2019). The measures themselves range heavily in content; judge ratings in humor (Howrigan & Macdonald, 2008), whether the subject merely attempted to be humorous and not ability (Saroglou & Jaspard, 2001), a comparison of an individuals’ highest caption and not average caption funniness rating (Ziv, 1983), or even self-report on a separate scale from humor production (Lehman et al, 2001). This variation is further highlighted in Table 3, with substantial differences shown in the range of the scales, the number of tasks, and the number of responses allowed per task. Considering this, and considering the lack of agreed upon definition of ‘humor production ability’ and a measure which reflects this, it’s highly questionable whether these studies are all directly comparable in such a way as to warrant a meta analysis.

Thirdly, there are issues to be had with the sample quality as well. A meta analysis without a well-defined sample population is not considered useful due to the lack of clear interpretation & generalizability (Russo, 2007). Further worry comes in the form of a small-sample effect, whereby the effect size is exaggerated in smaller samples (Greco et al, 2013). Both of these play a substantial role in causing concern here; the former comes into play regarding the generalizability of sample sources. While almost all of the samples are English-speaking, many similarities end there, as they vary from psychiatric in-patients to Psychology undergraduates to Amazon Mechanical Turk participants to comedians to deliberately oversampled religious populations. It’s clear from the authors’ language both in the paper and in the respective Psychology Today article that they intend for this to have been representative of the general population, however there is not a single representative population among the studies included, and as such any firm conclusions made are heavily unwarranted.

The latter concern of sample size also comes into play — the majority of the samples feature less than 100 participants, with many of those featuring less than 50. Many more feature a very unequal gender ratio, and the largest sample (Kaufman, 2016) is based entirely on unpublished data. As underpowered studies can substantially bias results in a meta-analysis (Turner et al, 2013), this becomes a large source of concern. Additionally, the number of judges is often far less, with the median being a mere 5.5, and the largest number being only 81. As those who are receptive to a joke are as important as those giving a joke, this is a large concern in regards to sample characteristics.

Lastly, and arguably most concerning, is that the authors appear to have violated their own inclusion criteria multiple times. Specifically, criteria 3–5 all have a large quantity of examples that which violate them, and I shall list them in sequential order.

Criteria #3 states that “Participants must have generated spontaneous new or innovative humor as part of the humor production task. Studies that were based on self-reports, or scales such as the Multidimensional Sense of Humor Scale (MSHS) (Thorson & Powell, 1993), or studies where participants had to complete a joke from multiple possible responses, given the setup, were excluded.” As such, from this criteria one would get the impression that the authors only included instances in which judges rated others’ attempts at humor; this is not so. The following is a list of studies which violate this criteria.

Lehmen et al, 2001 — This study did not assess humor production ability, but rather merely referred to a scale that measured the usage of humor to cope with stress as ‘humor production.’

Nusbaum et al, 2017 — While judges rating the captions of participants is a part of the study, and while the scale was a 1–5 which was labeled based off of funniness, this is misleading. The authors instructed the participants to “give higher ratings to responses that struck them as uncommon or unusual,funny, and clever, and lower ratings to responses that they found irrelevant, unsuitable, or boring.” As such, while both the authors of the meta-analysis and the authors of Nusbaum et al treat this as merely a measure of caption funniness, it’s clear that the Nusbaum methodology extends beyond humor and rates into perceptions of originality and what struck the raters as particularly special.

Renner & Manthey, 2017 — While this included judge ratings of captions intended to be humorous, the judges did not rate based on perceived funniness. Instead, they rated based on the perceived ‘wit’ and ‘fantasy’ of the respective captions. These two criteria may be related to funniness, but they are not conceptually equivalent, and the authors of the Renner & Manthey study do not treat it as such.

Criteria #4 states that “Judges had to be blind to any characteristic of the humor producers.” This would suggest that the judges know nothing about the participants and are merely independently judging captions. However, only four studies detail any blinding procedure, these studies being (Saroglou, 2002), (Nusbaum, 2017), and two studies by the head author, being (Greengross et al, 2012) and (Greengross and Miller, 2011). Every other published study fails to indicate a blinding procedure in regards to gender, and many give good reason to suggest there was none. For instance, in (Kim et al, 2013), one of the authors of the study was a judge of the subjects’ humor.

Criteria #5 states that “Judges rated the humor for funniness. Studies in which there were no judges, and the ratings of HPA were solely based on counting the number of responses or humor attempts, rather than actual ratings of the humor produced, were excluded.” As such, this would mean that they are only rating funniness and not the attempts of humor production. However, there are studies that violate this inclusion criteria, and they are listed as follows.

Saroglou & Jaspard, 2001 — They do not rate humor ability in this study, but rather merely whether participants attempted to be humorous.

Saroglou, 2002 — Much like the prior Saroglou study, this one only assesses whether participants merely attempted to be humorous than their ability.

Ziv, 1983 — In this study, they only rate the presence of humorous responses, not the degree to which it was funnier.

Thus, only counting those studies which are published, only 2/24 can be said to meet the authors’ own inclusion criteria; due to the nature of unpublished data I am unable to check as to whether those can be said to meet the criteria. However, even if all of the unpublished studies are assumed to pass the given criteria, the vast majority of studies simply fail to meet the authors’ own criteria for inclusion in the study. This is a very large issue that, even disregarding the other concerns raised here, strongly argues against the methodological soundness of this paper.

All of these issues put together strongly suggest that this study’s conclusions are invalid. Its own methodological issues and conceptual issues are too riddled with flaws to allow for adequate and meaningful interpretation of the results. In addition, considering the misogynistic conclusions that do arise from this study — namely the idea that women are innately less inclined to be funny than men — it should have underwent a much more thorough peer review process. As it stands, the current study should be retracted, owing to the harmful conclusions it presents, and the fundamental issues within its own methodology. One can hope that, for their own sake, and for the sake of social science research, that the Journal of Research in Personality may consider this.

-Mira Lazine

--

--

Mira Lazine
Mira Lazine

Written by Mira Lazine

Gaming, Politics, and Science Writing & Journalism

No responses yet