What are data?

Erin Fahrenkopf
Jul 24, 2017 · 3 min read

Data are a lens to see the world. They are systematic and realized. Data capture codified aspects of what happened. So data can be anything articulated or measured and then documented and captured.

Data on an Indian restaurant’s performance may include records on all customer orders so each data observation is one customer order. These data may include variables on dishes purchased, date and time order was placed, what combinations of dishes were ordered together, when a dish is sent back to the kitchen, and how many people were in the party. All these variables can easily be documented and transformed into variables made up of numeric values. However, capturing the deliciousness customers enjoyed is less readily available in data form because deliciousness lacks discrete and tangible observables.

If we want to use the data, analysis them and learn something useful then we can keep in mind a few data characteristics that will help us learn from data better.

First, we can reflect on the extent to which the data are observational or lack manipulation. Data are considered observational, or lacking manipulation, when they cover what was observed or happened without any interference to optimize their content for the subsequent analysis and investigation. Data from an experiment have manipulation and are not observation while most “internet data” (besides from A/B testing which would be experimental data) are observational and lack manipulation. Observational data require more assumptions than manipulated data to identify causal mechanisms.

Second, we can reflect on the distance between what is captured in the data and what we want to be captured in the data. This difference arises from measurement error — when our data collection method does not precisely identify variables of interest — and when we rely on variables more readily available than the actual concepts/parts of the world we want in the data.

For example, we could use data on incomes to understand if people have enough resources to thrive. We could use individuals’ salaries to measure their income but we have error from people getting income from investments and gifts. We will also have a difference between income and access to resources because income is just a proxy for access to resources since people may spend their income in ways not ending in additional resources for themselves or get resource access from wealthy friends who offer open access to their houses, pools, and pantries.

We can encounter problems when the variance of the distance between what’s in the data and what you want to know about is not constant or if the size of this error is correlated with any variable you are interested in. The best case is when this difference looks like “white noise” such that it is just randomly in the data without anything to do with anything.

Third, we can reflect on what pie-in-the-sky distribution the data are likely from. We can consider our current dataset as one realization of many possible different datasets which are all from the same pie-in-the-sky distribution. If we know what pie-in-the-sky distribution our dataset come from (maybe a Normal or Power Law distribution) then we can better understand what data values (such as number of customer orders) are likely in the future and which data values are highly improbable.


Originally published at ablifeing.blogspot.com.

Erin Fahrenkopf

Written by

Interests are statistics, data, the organization of work, evolution and using science practically.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade