A Practical Guide to the Scientific Method and the Evaluation of Research Articles
Science is not a moral framework, a claim on human values, or the opinion of a specific person. Science is a philosophy of deep skepticism. It is our collective best guess at what is true, given the information that we have, while making the assumption that anything that cannot be absolutely verified is not true. The systems we have designed to make these best guesses as accurate as possible are astounding and should be known to every educated person. The limits and potential corrupters of science should also be known, so that it is clear what we are able to tell from science and what we are not. I am not by any means the foremost knowledge on the scientific method, but I am in the business of performing research trials and analyzing research in order to make the best decisions for my patients. In medicine, we need to know that our best guesses are as close to absolute truth as possible before administering a treatment, or there can be disastrous results. So let’s get into how we can be so sure.
The black plague was a form of bacterial infection that claimed the lives of approximately 50 million people during the 14th century. At the time, scholars differed between believing that this infection was a punishment to nonbelievers and believing that it was a misaligning of planets. Given what we know about the bacteria that caused the plague, Yersinia Pestis, and our current ability to knock it out with antibiotics, we know now that those explanations were not true. However, these claims illustrate what smart people thought was happening before we had the ability to gather and examine evidence in a standardized fashion.
Philosophy, though always aimed at figuring out what was true, became much more specific in the endeavor in the next few centuries. Rene Descartes noticed that the only thing he could say for certain was true was his own consciousness (I think, therefore I am). From there, he adopted a method of extreme skepticism and doubt towards everything else. He found doubt itself to also be self-evident, because if you doubt a doubt it’s still a doubt. He then built his own philosophy from there, leaving the following maxims to the scientific community: Doubt everything, break every problem into smaller parts, solve the simplest problem first, and be thorough.
The first of these is the most important and the basis of modern science: a high degree of skepticism for every claim, with the default being to believe that the claim is not true. This was pretty smart, looking back, because it boils down the veracity of every claim to only what can withstand the highest level of scrutiny. In scientific research, we codified this as the null hypothesis. The null hypothesis says that there is no difference between a set of observations. So if I were to say that drug X increases blood pressure, the null hypothesis would be that no, they do not. In this case, the null hypothesis is that the observations I made with drug X are no different than observations without it. The alternative hypothesis is the claim that there is a difference between these observations. Because the null hypothesis assumes that a claim is not true, we use this as the default.
To disprove the null hypothesis we need a level of evidence that assures us that what we are seeing is not due to chance. How do we do this? By determining how likely we would be to see this effect if it was random. A P-Value signifies how likely we would be to see this effect if it was just due to chance. A P-Value is expressed between 0–1 and is the inverse of how likely it is we would see this effect if it was coincidental. For instance, a P-Value of .1 would indicate there is a 1 in 10 chance, while a P-Value of .01 would indicate that there is a 1 in 100 chance. As P becomes lower, we become more and more certain that we are seeing a true effect that is not due to chance. When you are reading a research article and it states that they set P=.05, they are stating that there is a 1 in 20 chance that what they found is bullshit. This is why you should always look at the P-value in research articles.
How do we determine that an effect would only be found coincidentally at that level? One of the ways is to ensure that we have a good sample size. A sample size, signified by (n), is the number of observations we have. So if I enroll 1000 people into a study, that would be n=1000. This is also important to look at when examining a research article, as the higher the sample size, the more robust the evidence. How we take sample size and determine P-Value is a matter of statistics, and is beyond the scope of this article. However, a large sample size and therefore a low P-Value indicates that the evidence against the null hypothesis and for the alternative hypothesis is robust.
Once we know that an effect is there, it then becomes important to determine how big the effect is. We call this Effect Size or Correlation. It is very important to look at what the effect size is in a research article because if the effect size is small but the P-Value is sufficiently low, it means that while we are relatively sure that this effect exists, it is a small effect overall. Effect size/correlation is signified as a number between -1 and 1 with the following basic parameters: 0-.2 is small, .2-.5 is medium, and anything above .5 is high. A positive number indicates a positive correlation, a negative value, a negative correlation. For instance, if an article found that there was a small increase in blood pressure for every 10mg of drug X we give someone, that may be a correlation of .1. A large increase in blood pressure may be seen by an effect size of .8.
However, just because two variables are correlated, that doesn’t mean that one causes the other. For instance, maybe people who take drug X are also likely to drink caffeine, and it’s actually the caffeine that’s making their blood pressure rise. Here we would say that caffeine is a confounding variable. That is, it is a hidden variable that explains the effect we are seeing. This is what we mean when we say that correlation does not necessarily equal causation. For instance, there is a correlation between the divorce rate in Maine and the per capita consumption of margarine. Common sense tells us that one probably did not cause the other.
So how do we make sure that what we are seeing is causal and not due to a confounding variable? We do this through experiments that include a control group and an experimental group. A control group is a group of people who were not given an intervention, and an experimental group is a group of people who are. We design experiments to have both so that we are able to see directly the effect of an intervention, while controlling for all other variables. So maybe I design a study that gives one group 10mg of drug X while giving the other 10mg of a sugar pill while controlling the amount of caffeine they drink. We then have a better sense of the direct causes of drug X on blood pressure.
Why would we want to give people in the control group a sugar pill rather than just giving them nothing? This is to make sure that the effect isn’t just due to the placebo effect, which in this case would be how much someone’s blood pressure may rise given that they were given any sort of intervention at all (a sugar pill). The placebo effect can be very powerful, so we must be sure that the effect is greater than the placebo effect to make sure there is a causal relationship.
However, what if I were to tell people in the experimental group that they were receiving the actual treatment while telling the placebo group they were just getting sugar pills? Maybe me telling them that they are getting the real drug is what’s making their blood pressure increase? This is why a good experiment includes blinding. Blinding is the process of making sure that whoever is given intervention in an experiment does not know whether they are getting the intervention or the placebo. We call an experiment double-blinded when neither the experimenter nor the experiment subject knows who got the real drug and who got the placebo. We then further eliminate bias by randomizing who gets the placebo and who gets the real drug, so we can be sure that of all the people that were enrolled in an experiment, there was nothing special about who was given the real drug.
This is why randomized controlled studies are very good at eliminating types of bias, of which there are many. A good study should control for the various types of biases that can contaminate a study. A good study also has a discussion at the end about what biases may have been overlooked and how that may have affected the study. Observation bias is the bias that may come from the experimenter knowing who got the real drug and perhaps giving them subtle cues. Selection bias comes from enrolling a sample of people into a study that is not representative of the population you’re trying to study. For instance, if I were to ask college students whether or not they believed in God, I could not then take those results and generalize them to both people who have gone to college and those who have not. You, therefore, want to look at the selection criteria in an experiment and make sure that it matches the population the experiment is trying to study. Another type of bias is measurement bias, where the tool with which we measure the outcome may be biased. In this case, we would want to make sure that the tool that is measuring the subject's blood pressure is accurate. Publication bias is another big one to watch out for. It is the tendency for experiments that show an effect to be published more often than those that do not. As we saw above, if an experiment has a P-Value of .05, there is a 1 in 20 chance that its findings were due to chance. If the experiment is performed 20 times around the country, one of the experiments may show an effect just by chance.
A type of bias that many people are concerned about is financial bias. Every study that gets published in a reputable journal requires that researchers stipulate at the end of the research article any financial interests they may have that could interfere in the overall validity of the study. This is taken very seriously in the scientific and medical community, to the point that even PowerPoint presentations given during casual medical lectures include any financial interests the presenter may have.
A good study should also be replicable, meaning that a third party who decided to do the experiment in the exact same way found the same results. Since studies that are done in the same way may show slightly different results, the best form of evidence is always the meta-analysis. A meta-analysis takes data from many different studies of the same type to find the pattern between all of them. This allows us to take smaller sample sizes in different studies and make them into one big sample.
In general, when evaluating evidence, anecdotal evidence is seen as the least reliable form of evidence as it involves an N of 1. For instance, we know that smoking generally decreases life expectancy, even though some people have met someone who lived to be 100 and smoked a pack every day. Above that, in terms of evidence, is a case series (literally a series of individual cases). Above that are cohort and case-controlled studies that look for correlations between treatment and effect from a block of data, but don’t actually administer treatment in a randomized, controlled fashion. A randomized control trial is there for the best level of evidence below a meta-analysis.
Peer review is also an important part of the scientific process and includes other experts in the field looking for bias, holes, or bad research procedures in a study before it is published. Remember, science is built on a philosophy of skepticism, and researchers enjoy poking holes in other people’s studies if they are not up to snuff.
Some other things to watch out for…
Beware of studies that imply that correlations they have found are causal unless they show exactly how they are causal, as in a randomized controlled trial. Beware of studies that try to correlate too many things at once. This is often done in the soft sciences such as psychology and sociology. An example may be a study where I claim that frowning at people causes them to lose their job. Maybe I found that frowning at people makes them unhappy, and then I reference another article that found a correlation between people who are unhappy and their proclivity to lose their job. Beware of studies that make moral claims. Remember, science is not a claim on human values, and is not a prescription for how people should live their lives. Science also has a hard time answering questions that are too idiosyncratic, for practical reasons. For instance, there is no experiment on how the twitching of the buccinator muscle of my face correlates with how often I think of Westworld. Science cannot make claims on things for which we have no information. Scientific consensus can and often does change with new information. This is not a bug, it is a feature. The scientific community, when working effectively, adapts to information we did not previously have.
Science is the perfection of skepticism and questioning. It is a deep meditation on how we can say something a claim is true or false. It is what builds rockets that don’t blow up, and it is what helps us find life savings treatments. It is a system that every reputable doctor bets their patients' lives. It is what I have bet my life on, and the way I make practical decisions as a doctor. I hope this article has been illuminating. Please respond with any questions you may have or corrections to mistakes I may have made.