Imagine you wake up in a different universe and visit a lab. The scientists are busy injecting mice and pipetting reagents. You ask them to describe their work and they respond “we’re putting various things together and looking for something interesting”. When you try to explain the concept of the scientific method — forming hypotheses, using controls, recording procedures, they reply “oh, ‘science’ is just a term people invented to justify getting paid more”.
Although this sounds like a ridiculous story, it often describes discussions of data science. When I hear “data science is just statistics” it gives me the same unease as if you would hear “bench science is just pipetting”.
Data Science is the application of the scientific method to data — making testable hypotheses and evaluating them with sound, reproducible design, using controls and baselines, thinking about how many tests you ran for a positive outcome, compiling multiple lines of evidence and presenting evidence clearly.
In bench science, you apply these principles using reagents and machines to make discoveries. Science doesn’t just happen when you pour reagents together. You need to structure a sound design based on what you know with a well-considered plan to test what you don’t. In data science, your reagent is data. Data science is about making sure you have the right reagent (data), that they are not contaminated or expired, that you them in well-designed experiments
Though I would this would seem self-evident and the distinction between science and data science irrelevant, the amount of “data science is statistics” or “data science is just something to get you paid” makes me think it’s quite important to clearly articulate this definition of data science.
This article is an outgrowth of a discussion good friend and former colleague David Shaywitz, who asked “what is a data science mindset” and “how is it different from what pharma does”. I think David found this answer unsatisfying because he feels that this is what pharma already does.
Generally, I care more about what data science is. I’m not here to throw shade on pharma practices. I’ve seen well and poorly designed research at both the bench and the terminal done in pharma, tech, and academia. However, I think I can articulate some important distinctions:
Data as an Input Versus Data as an Output
I think that there is one large cultural difference in “data science” as opposed to traditional bench science (and this difference is holds true in many but not all pharma settings). Traditionally, people are used to considering data as an output of experiments.
It’s not like the concept of the scientific method is foreign in pharma. David’s example of randomized clinical trials is one of the most rigorous manifestations of scientific design and practice you can have. But in a trial, look at where all of the emphasis on experimental design goes. It all goes into designing the enrollment, procedures, and collection. It is true that managing data is an essential part of an trial, but not in a way that affords any agency or discovery in the data. In fact, a trial is specifically designed (and rightly so) to limit the ability to do anything but yield a single, fixed statistically valid outcome.
The difference in data science is that data is an input. The problem is that many are conditioned to think of data as the object of value which comes out of experiments, so there is an attitude that just by compiling a big enough warehouse of these valuable outputs discoveries, will naturally arise. Unfortunately, it’s very difficult to design effective studies when you don’t have control over the processes that give you your inputs (imagine if in enrolling a clinical trial you had to take any random person, whether they had the disease of interest or not).
Although it would be an amazing TV experience to put 6 scientist in a lab full of randomly assembled reagents with an “Iron Chef” challenge to make a discovery, it would be pretty hard to do good science in this way. (I would definitely watch this, though!)
As an aside, I think some of this mentality explains the tension of “data parasites”. When you think of data as the valuable output of your experiments, which naturally contain your valuable papers, you are protective of it. When you see data as the starting inputs for well-structured science, this mentality seems weird. Since you are doing the same type of scientific design, you feel more like a data symbiont than a parasite.
When “data science” began to be a widely used term, I was initially quite hesitant. I had some of the confusion and objections that I often hear (e.g. “What is it”). But as I settled on this understanding of it, I have become much more of an advocate for the term.
These thoughts are certainly incomplete. There is far more to say about what data science is than what is here, but I hope this makes the case for what we should strive for data science to be.
It is very hard to do good, rigorous science. It also isn’t a binary thing, you can always design your experiments a little bit better. There is always room for growth and improvement. Whether we are at the bench or at the terminal, it’s good to be reminded that the processes that have led to huge leaps in understanding, innovation, and quality of life in the last several hundred years are not magic, they are driven by a well-tested set of principles. That is science, whether data or otherwise.