‘Quant or qual: The great debate, or just a misunderstanding?’

Rebecca Grier
Bootcamp
Published in
8 min readSep 19, 2023

An introduction to psychometrics, which focuses on the importance of understanding variability & defining a meaningful effect size to rather than data type in determining sample size.

A few years back, I started hearing about qual and quant UX research. I was perplexed as to the difference. To me research is research. This is because, I consider UX generally, and UX research in particular, to be an applied behavioural science. One of the foundations of behavioural science is psychometrics. Psychometrics means the study of how behaviour and thought are measured.

In my grad school courses on psychometrics, I learned that the definition of measurement generally agreed upon in all science is “the assignment of (symbols) to objects or events according to some rule.” This definition was written by Stanley Stevens and published in the journal Science in 1946. In the same article, Dr. Stevens identified different kinds of measurement based on the qualities of numbers afforded by the rules. Nunnally & Bernstein (1994) in their textbook, Psychometric Theory, expanded upon Stevens’ definition and described these rules as falling along a continuum between scaling and classifying. Scaling means “represent(ing) quantities of attributes numerically” and classification means “(defining) whether objects fall in the same or different categories with respect to a given attribute.” To be clear measurement does not mean quantitative. Classification of observations into categories is also measurement.

If we expand upon the idea that measurement is rules for assigning symbols to objects or events — to apply to research more broadly, we can describe research as having 4 dimensions instead of just qualitative and quantitative. Each dimension maps to 1 of these 4 questions.

1. When were the rules determined? Were they determined before data collection (i.e., deductive analysis) or determined after data collection (i.e., inductive analysis)?

2. Have the rules been vetted and scientifically agreed upon (i.e., a scientific construct) or were they created specifically for this study (i.e., operational definition)?

3. Were the rules applied during data collection/observation (i.e., empirical) or were they applied after data collection/ observation (i.e., analytical)?

4. What qualities of numbers do the symbols have? Do the symbols represent order and magnitude (i.e., scaling), order only (i.e., ranking), or neither order nor magnitude (i.e., classification)?

visual representation of the 4 continua described in the article.
4 dimensions of research

In the pictorial representation of these 4 dimensions, the right side (i.e., inductive analysis, operational definitions, analytical, and classification) are typically called Qualitative Research. Whereas the left side (i.e., deductive analysis, scientific construct, empirical, and scaling) are typically called Quantitative Research. It is important to note though that these dimensions are not dichotomies; they are continua. There is a lot in the in between. Furthermore, even if they were only dichotomies, it is entirely possible to conduct a study that uses the left side of some dimensions and the right side of other dimensions.

I’ll use the Olympics to explain how most (but not all) combinations of these 4 dimensions of research/ measurement are possible. At the Olympics, medals are awarded to the top 3 performers in hundreds of sports’ competitions. The rules for each competition are different, but are all defined prior to competition (deductive analysis). They are also written down and agreed to by an international sporting body (construct). In some competitions the rules for the ranking are mostly empirical (e.g., time, length, height, number of goals) and others are purely analytical or judged (e.g., gymnastics, diving, figure skating, dressage). I say that some sports like athletics, basketball, swimming, football are mostly empirical, because there is always a referee who judges whether the rules are being followed.

Regardless if the sport has empirical or analytical assessment, the recorded results of each competitor are scaled. That is the points, time, distance, etc represent who performed best, second best, third best (order) and how much better they performed than their competition (magnitude). However, only order is considered when the medals are awarded (ranked as gold, silver, bronze).

Thus far we have described both deductive-construct-empirical and deductive-construct-analytical as having both scaling and ranking. Once the Olympics are over, knowing someone is a medallist is a classification and conveys no numerical qualities. That is there is no ability to compare medallists across the years and sports. As an example, I refer you to Shaquille O’Neal and Simone Biles. They are both gold medallists but in very different sports and many years a part. Thus describing classification for both deductive-construct-empirical and deductive-construct-analytical.

To continue the sports metaphor of research, not all sporting events are the Olympics. Many of us are engaged in much more casual sports that often use house rules. Whereas the rules of sport at the Olympics are equivalent to scientific constructs, house rules are the equivalent to operational definitions. That is they are not agreed upon by an international body, but the people who are playing. They are also usually agreed upon before the game begins. So house rules are similar to deductive-operational definition. Whether they are empirical or analytical, scaling, ranking or classification depends upon the game being played.

Thus far we have described 12 possible classifications of research all falling into the deductive category. Deductive research is defining the rules before the research has started. Inductive research is defining the rules after data collection has begun. As such, it is unlikely that those rules would represent a construct (i.e., standardized rules verified by science) or empirical measurement (i.e., rules applied during data collection). So inductive research is typically synonymous with operational definitions and analytical assessment. That said the assessment could still involve scaling, ranking, or classification. Thus there are 15 different classifications of research.

Of these 15 classifications, Inductive-Operational Definition-Analytical-Classification is clearly qualitative research and Deductive-Construct-Empirical-Scaling is clearly quantitative. However, it is less clear where the other 13 classifications would fall into quant or qualitative research.

A table showcasing the 15 types of measurement on a spectrum from quantitative to qualitative. With Deductive Construct Empirical Scaling as the most quantitative The 15 categories precede in groups with Deductive Construct Empirical Ranking and Classification the next 2, then Deductive Construct Analytical, then Deductive Operational Definition, with Inductive Operational Definition Analytical the last 3.
15 possible classifications of measurement

This is important, because qualitative or quantitative is often used as a shorthand for the number of participants that are needed to be confident in the research results. It is said, that 5 participants is all that is needed for qualitative studies, but one often needs 50 or more participants for quantitative studies. While I agree that 5 participants (per user role) is a great starting point for inductive analyses, I disagree that it is possible to set a universal minimum number of participants for deductive analyses. A lot depends on what is a meaningful effect size for your goals and the variability of your population.

Effect size is how large a result there is in the study. A meaningful effect size is how large a result needs to be observed to make a decision based on the data.

For example, let’s say I asked participants to rate how much they liked two different features (A & B) on a 10-point scale. The goal of this data is to provide a recommendation as to which feature to prioritize in development. Let’s say Feature A has a mean rating of 7, and Feature B has a mean rating of 7.2. The effect size observed is 0.2. Is that 0.2 a sufficient enough difference to prioritize Feature B over Feature A? What if I said Feature B cost twice as much money to develop as A? Does that change your interpretation of the 0.2 difference?

Let’s change things a bit and say that the 2 features sets cost roughly the same amount to develop. However, you learn that all participants rated A between 6 and 8, whereas the ratings on B were between 4 and 10? Are you confident in that 0.2 difference?

Often in UX meaningful effect sizes are quite large. We want clear answers that one path is the best path. If an effect is large, then it can be observed with a small sample size. Sometimes as small as 5 participants.

Though it is easiest to talk about effect sizes when the data collection is scaling, one can also discuss effect size in ranking and classification. For example, if instead of a 10 point scale we coded the verbatims of people on the two features and 72% of people had positive things to say about Feature A and 64% had positive things to say about Feature B, then your effect size is 12%. If we were comparing several different features and we asked them to rank the features, it takes some statistical knowledge to calculate effect size, but it is possible.

Regardless of the nature of the study, the variability is important to examine. Recall your confidence in that 0.2 difference when you learned that all participants rated A between 6 and 8, whereas the ratings on B were between 4 and 10? Sometimes understanding the variability is more important than the synthesis. If there is a great deal of variability in the data, then there is something about the user population that is not understood. Thus, it is always a good idea to start research with 5 people per user role and see what you have learned.

In behavioural science, this initial research with 5 people is referred to as a pilot study. In inductive research you may find that 5 is all you need, or you may learn that there is more research to be done. That is after 5 people, you have recorded similar data each time, then you can be reasonably confident that you will not learn much more by having more participants.

If however the data are not similar, then there is a variable at play that you do not understand. Looking at the differences among the participants can provide hypotheses on that variable, which will help to determine next steps. When the study is deductive, the behavioural scientist typically uses the results of the pilot study to do a power analysis. The power analysis tells one how many participants are needed to find a meaningful effect size with a certain amount of statistical confidence.

When I worked on a team evaluating the operational effectiveness of DOD systems for the US Congress, we used power analyses to determine the number of data points to be collected to have certain levels of confidence that the results would be neither false positives nor false negatives. Power analysis is a statistical equation that considers the meaningful effect size and the anticipated variability. To have higher confidence, typically would require more observations. We would compare the cost of collecting more data to the confidence that could be achieved to determine how many data points needed to be collected.

So, the process of research is to assign symbols to objects, observations, or events in the hopes of making a decision/ recommendation. The majority of research we are conducting is somewhere between qualitative and quantitative. Furthermore, to make a decision/recommendation with confidence, we need a meaningful effect size. How we define a meaningful effect determines the number of observations we need. If we are looking for a large effect — we can find it with as few as 5 participants regardless if it is qualitative, quantitative, or something in between. There is far more that can be discussed in terms of psychometrics and designing effective user research studies. I will be posting more on the topic. I also recommend “Quantifying the User Experience” by Sauro & Lewis.

Note: The views expressed are solely those of the author and do not necessarily represent the views, opinions, or positions of any organization, institution, or entity with which the author may be affiliated or associated.

--

--

Rebecca Grier
Bootcamp

UX reseacher who has worked across many business sectors on technolgies as varied as augmented reality, AI, medical devices, & autonomous vehicles.