Observational Vs Experimental Data: What’s The Difference?
In a previous article we talked about data types in a dataset, but there’s more to it than that. In this article we’ll talk about how data is collected and how it’s structured.
Depending on how data is collected, we can describe it as observational or experimental. Understanding the difference is important because it it impacts how we interpret and analyze the data. When we don’t understand the difference, we end up making common mistakes (i.e. correlation ≠ causation). If we want to be serious about working with data (specially as data scientists or statisticians), we have to be serious about the methods for collecting data.
Observational Data
Observational data is collected through (you guessed it) observation. This means that anything that can be heard or seen is collected. Hearing and seeing also applies to computers (i.e. website trackers).
It’s very likely that most of the data in your company is observational. Mostly because it’s easier to obtain. If you’re already generating data, why not store it right? Observational data is the easiest to collect (and it’s free).
Observational data might be things like website data (visits, clicks, time spent on site, etc.), sales, emails, number of calls, etc. Most of the times this data is stored for a reason (we might not know how to analyze sales data, but we know its important enough to store it), but there are cases where we collect data without a specific reason (we don’t know how to use it now, but it might come in handy later).
Observational data can also be called found data. Although found data is referring mostly to data that is a byproduct of other things. For example, a social media post is data by itself, but if we could also collect the relationships found in that post (with other users, pages, etc.), we would also have found data.
Experimental Data
Experimental data is a tougher cookie. It’s collected by doing experiments through the scientific method with a prescribed methodology. This means that experimental data is not passively collected.
That sounds like a mouthful, but it’s basically doing experiments to get data
Experimental data is collected for a specific purpose or question. You just don’t stumble upon it. That’s why it’s the Wagyu Beef of statisticians and data scientists. The thing is, that just like Wagyu Beef, it’s also harder and more expensive to collect.
A treatment must be randomly assigned to something (to avoid any bias that might make the data unreliable). Depending on what the experiment is about, guidelines to collect the data might change, but in essence they will all have the purpose to increase reliability. Common mistakes with experimental data include not controlling for confounding variables, small sample sizes and using the wrong statistical tests.
Clinical drug trials would be a classic example. Remember how they always split the patients into groups, and one group is given placebos? Well, it’s part of increasing reliability. Patients are chosen based on similar characteristics and without diseases to get the best possible “clean slate”.