A Basic Guide to Evaluating Validity of Experiments & Data (Part 2)

Marc Ryan
11 min readFeb 6, 2022

--

Photo by Jeremy Bezanger on Unsplash

In the first part of this article, I talked about the concept of External Validity in measurement. External Validity issues are to be expected in all data and measurements, they’re immensely difficult to avoid, but understanding them allow us to moderate decision making appropriately. On the other hand, we have Internal Validity.

External issues are kind of like a drunk friend at a party who says something inappropriate that everyone ignores because… well, that’s just Steve. Internal issues are more insidious, it’s the party equivalent of the house catching on fire, the party’s effectively over. What does this mean in data and measurement? Internal validity problems means that a fundamental flaw in how the data was generated invalidates the research in some way.

Internal issues are the main reason that we hire professionals to create data sets. When you hire a research company, you’re effectively bringing in a specialist whose job it is to ensure no internal validity exists in your data collection methods. As a starting point here’s a simple example: imagine I plan to run an ad campaign for a new yogurt. I want to evaluate the changes in consumer opinion as a result of the ad campaign, so I run a pre-test of my audience to gauge their feelings about my brand. Later after the campaign is over, I’ll run another measurement of my audience and compare the pre campaign and post campaign results to see how opinions changed. Sounds like a good idea, right? Well, it’s not bad, but presents a high degree of risk of Historical bias. Historical bias is what happens when additional events happen between the pre and posttest. Imagine if between the first and second measurement that the FDA announces a product recall of my yogurt due to food poisoning. In that case I’d be hard pressed to use the data from the second measurement as a reflection of my ad’s performance but more likely the reflection of people’s reaction to the recall. This is an internal validity issue, there’s a problem in the measurement set-up that makes the data invalid for the purpose.

Internal validity issues are well cataloged in the aforementioned article from Campbell & Stanley. As with my summary of external validity issues I’ll provide a similar summary for internal validity issues:

Historical (Event Bias)

These biases are clearly illustrated in the example above. This is a common bias that often is ignored especially when using pre/post designs. Running a pre/post design is effectively gambling that nothing else is changing in the time in between your first and second measurement. And while the example above is an obvious change there are many other changes that are hard to enumerate such as seasonality, cross promotions, competitive behavior. As a product person I see this all the time, wherein a change in a product implementation doesn’t result in a clearly delineated change in metrics because multiple systems are deploying changes at the same time making a root cause analysis difficult to accomplish.

Maturation (Perception Shift Biases)

You’ll likely find these biases primarily occur in experiments that run over a period of time. Its where the act of time going by is changing the result of the experiment and not what’s being measured. Let’s use a simple example that I see frequently. In this case we have a consumer beverage manufacturer that wants a deep understanding of how people consume their beverages. To accomplish this, they enlist a series of consumers to download an app and log into that app daily to record their beverage consumption habits. In exchange the consumers are given money daily for participation. Now imagine you’re one of those consumers, at first, it’s fun to log in and record your habits but after a week it feels more like a chore, and ten days in you’re not really answering accurately, just doing enough to get your incentive. If we’re lucky you’ll have stuck around for two weeks. This is maturation bias in action, over time your perception of the experiment changes and as a result the data you create changes.

Testing (Response Bias)

This is as simple as collecting data from someone who’s previously participated in an experiment. In the world of surveys this is the simplicity of having the same person taking the same survey twice. Having taken the survey once already the user will know what’s expected of them and will likely produce a different set of data based more on their first experience of the survey vs the questions themselves.

Instrumentation

This is perhaps the most common form of biases that occur in the field of data and experimentation. While Campbell & Stanley use one definition for Instrumentation bias I’m going to subdivide this one into two sections: Instrument Bias and Instrument Change Bias.

Instrument Bias

Very simple to comprehend, you can boil this one down to a base concept: your research instrument is either incorrect, ambiguous or leaves out critical data. Imagine I conduct a survey of a randomly selected population in the U.S.. I’m using the survey to determine the most popular color amongst Americans, so I ask the following question: “Please choose your favorite color from the list below”. Respondents are given the following choices: Red, Orange, Yellow, Blue, and Purple. I assume you’ve spotted it? This list contains all the primary colors except Green. Given that, how would you answer if your favorite color was Green? There’s no “Other” option, and presumably there’s no way to continue with the survey if you don’t answer this question. This is a perfect example of instrumentation bias, something which plagues surveys mainly because everyone feels qualified to author a questionnaire. But there are nuances in how questions are asked that are important. Two questions with the same intent can produce wildly different results. Take the following two questions:

Is Joe Biden doing a good job? Yes or No

How would you rate Joe Biden’s performance? Good or Bad

Both questions are trying to get to the same place, but the first question makes a basic assumption that Joe Biden is doing a good job. As a rule, people are generally agreeable and as such it’s likely that the first question will result in more people reporting Joe Biden’s performance as good. Looking at this simple example you can see why it’s important to understand the measurement instrument being used.

Outside of the survey world instrument biases exist all over the place in big data. We’ve seen this in the past when players such as Facebook end up having to restate their internal analytics. In a world driven off algorithms, sometime those algorithms don’t measure what we think they measure.

Instrument Change Bias

A secondary form of instrumentation bias that’s also quite easy to understand. It’s often the step change in data that’s observed when you make changes to the underlying measurement tool. For example, if I’m running a daily survey over the course of a month, and I change the wording of a question halfway through the month, that change will likely result in an underlying data change. The same thing happens when we change analytical algorithms. As in the case of the Facebook data error mentioned above, once Facebook fixed the issue in their algorithm for calculating video view time, all of the data shifted 60–80%.

Statistical Regression (Extreme Selection Bias)

It sounds like a complicated problem and well, maybe it is. Let me try and describe the problem simply. It’s based on the idea that if you select users for your experiments from the extremes then you’ll get an artificially big response. If that’s still confusing, I’m going to use an example related to Campbell & Stanley’s original book which makes it easier to understand. If I take a group of students, some of them are going to be A students, some B/C students and some are going to be D/F students. If I plotted all of the students scores on a chart we’d likely get the familiar bell curve where most students are B/C students and fewer are A or D/F students. In this case the A and D/F students are at the statistical extremes. If I run an experiment with the students at the extremes, the chance that their grades change based on my intervention is higher than if I select students randomly. Think of it this way. If I select a D/F student and try a new way of teaching them the material, there’s a higher likelihood that they’ll get better because the current system isn’t working for them. Likewise, if I do the same for an A student, they’re more likely to get worse because the current system is working for them. If my research shows a 30% improvement in scores for D/F students I might be inclined to think my new teaching method is great, but without studying the effect against A students I’m effectively inflating the effect of my process.

Selection (Selection Drift Bias)

Wait, wasn’t selection bias listed as a bias in the article on External Validity? How is it that it can be on both lists? Well, there’s a lot that goes into selecting populations to measure and the problems with that selection process can results in either threats to the external validity (how representative the work is of the population being studied) or threats to internal validity (whether the measurement is factually valid). To simplify things, for those that read the previous article (link) we already know that if you select the wrong audience, it can change the context of the results. However, internally to the study if you change the audience definition in the middle of the study you threaten internal validity. Go back to my yogurt example. If I choose 500 women to answer my pre-test, but then choose 500 men to do the post test I’ve effectively invalidated my experiment. This would be a failure of internal validity.

Experimental Mortality (Abandonment Biases)

Personally, I think the original definition name of Mortality Biases sound quite morbid, but when it comes to medical research it’s a real challenge that must be dealt with. If I’m testing a new drug and have my test group and my placebo group, I’ll want to track the effects of the drug in both groups over time. Over a long enough period, I may have a situation where participants either pass away or quit the study. This “mortality” in participants results in a shifting mix of respondents that can inadvertently create a selection drift bias in the study. More commercially, we see this all the time in longitudinal research using consumers. A common research technique used in business is a recontact approach. In this approach I send a survey to a group of consumers asking for some input, I then take some action based on that survey, then I’ll recontact the original survey participants to solicit feedback on the action I took. In most surveys of consumers, you’d be lucky to get 10% of people to come back for a second survey, there’s a high degree of abandonment. Even in point in time measurement abandonment is a risk. You can do all of the heavy lifting in avoiding selection biases to then lose all your participants by asking them to take an overly long or complicated survey that results in a large proportion of the participants abandoning the research.

Selection-Maturation Interaction (Faster Maturation Bias)

Another type of selection bias that could be considered less common is the selection-maturation interaction. This occurs very specifically when selecting multiple groups for an experiment. A selection-maturation interaction occurs when one group you selected is more likely to mature (see above) at a faster rate than the others. This means that even if you have a maturation bias in your data, you may have a hard time correcting it as it’s not equally present in all the groups you measured. As an example, let’s say I want to measure the happiness of two groups of college students. To pick random groups I pick two classes in the same academic program (let’s say finance). One class is a group of 100 Spanish students. The second group is a class of 100 psychology students. Both groups of students are in the finance program so ostensibly picked from the same pool of people. By choosing the two classes I’ve given myself an easy way to create two random populations of finance students. At the beginning of the school year, I administer the happiness questionnaire and see there’s no difference between the classes. Throughout the school quarter I conduct happiness seminars and activities amongst all the students. At the end of the first quarter, I administer the questionnaire again and see a difference but strangely I see a bigger difference amongst psychology students. It could be that the difference I see in the results is not because of the seminars alone but because I selected psychology students who spent the quarter introspecting on the psychology of human behavior. This group have matured in their perspective of happiness vs the Spanish students. So, by selecting two groups that mature at different rates, it’s difficult to understand the true impact of the program.

Internal validity is about knowing your experiment is working properly. You can work exhaustively on making sure that your measurement is representative of the audiences you’re looking to measure, but if the basic tools or methods you’re using are flawed it’s all for naught. The good news is there are ways to try and control for internal validity issues so the first objective will be to have awareness that they exist. Similarly to how I tackled External Validity I’ve put together a list of questions you can use to understand Internal validity in data.

Event Biases: Are you comparing groups from different time periods? The change you see in the groups could be from something that changed over those time frames other than what you’re measuring.

Perception Shift Biases: Does your experiment run over a long time frame with the same group of people? Is there a risk their excitement about participating will change over time?

Response Bias: Is the person participating in your research familiar with the research? Have they participated recently?

Instrument Bias: Are you asking questions in your research that are leading the participant to answer in a specific way?

Instrument Change Bias: Did you change your data collection methods (e.g. survey) in the middle of your project?

Extreme Selection Bias: Are you trying to generalize about a group but only measuring a subset?

Selection Drift Bias: Did you change the way data collection happened in the middle of your study?

Abandonment Bias: Are there a larger number of participants leaving the experiment in the middle?

Faster Maturation Bias: Are you comparing two groups where something outside of the measurement is changing one of the groups response faster than the other group?

Reading through this it might seem like a lot to know and remember in designing an experiment or research study. The truth is that all research suffers from biases in some way and you’ll never be able to avoid them in your research project. The best you will be able to do is to enumerate them and think about how they relate to the conclusions you’re drawing from the data. Understanding internal and external biases is part of the critical thinking process and the more people that understand the concepts the more useful experiments and data will become. The worst thing you can do is to be ignorant of the biases in data and make reckless decisions from data.

See Part 1 of this article:
A Basic Guide to Evaluating Validity of Experiments & Data (Part 1)

--

--

Marc Ryan

Long time media and adtech researcher. Recovering Chief Product Officer & Chief Data Officer. Now Chief Technology Officer. https://www.linkedin.com/in/marcryan