What Type of Data Is It Anyway?

Published in

Technology @ Prospa

6 min readAug 31, 2023

Overview

Meet Squeek and Jinx (on the left and right, respectively). Aside from being a shameless plug for my furbabies, I’m enlisting their help to describe the different data types we all encounter daily. This article serves to give everyone who interacts with data (for data practitioners and non-data muggles alike) a deeper understanding of what sort of data we’re looking at, what functions we can and cannot use on the data, and some common pitfalls to avoid.

The table below summarises what this article covers and serves as a quick go-to reference once we’ve gained an intuition into the different types:

The Nitty Gritty

So let’s say for <insert reason> we decide to data capture the details of Squeek and Jinx into a data table, as shown below:

Nominal Data

nominal [adjective]: (of a role or status) existing in name only.

Let’s start with the Name column in this table, which is a perfect example of nominal data. Intuitively we all know that we can confidently say that Squeek <> Jinx. But does it make sense to say Squeek > Jinx? Or what about Squeek + Jinx = ? Or what about Jinx / Squeek = ? And can we average this column??

This list of questions illustrates the characteristics of nominal data in that this data type serves only to label a record/cat. But we cannot infer an order nor aggregate this information. When analysing this sort of data, at most we can count their occurrences, or if we want to get a sense of its central tendency (i.e. the typical behaviour of this information) we can’t use averages, but we can use the mode.

mode: the most frequently recurring value

In this example, there is no mode because there exists one record for each furball. But if we recorded the same data for a rescue shelter, we may use the mode to identify the most frequently occurring name. Other examples of nominal data that we could have recorded are colour and breed.

Watch out! The ‘Id’ column in this table looks temptingly numeric, but it’s just another nominal column. For example, is Jinx’s Id twice as big as Squeek’s Id? What is the average of Ids? While you can certainly apply these operations in a spreadsheet, the results would be nonsensical.

“But what about the order of the Ids, Brad? Surely having a larger Id means something about the time of allocation?” The answer to which is… Maybe… It all depends on how Ids are assigned (numbered in sequence or not) and what type of information you’re trying to uncover. It could be a proxy of determining order, but it’s probably best not to because we rely too heavily on potentially false assumptions of how these values are assigned.

Ordinal Data

ordinal [adjective]: relating to the order of something in a series.

Now to the Size column in our table. Seeing that Squeek is small and Jinx is large, what we now have is data with which we can infer order. In this example, we know that Jinx is larger than Squeek. But what we don’t know from this column is how much larger. This is a key distinction of ordinal data from interval and ratio data — we can see the hierarchical order, but the interval between ‘large’ and ‘small’ is not evenly-spaced or measurable. For this reason, we can’t add, subtract, divide, multiply, or average this column. We can, however, sort it. And to measure central tendency, we can again use mode.

Watch out! This data type often catches us when they are turned into numbers instead of words. Consider a survey that asks us to rank, on a scale of 1–5 with 5 being total agreement and 1 being total disagreement, how much we agree with the statement that cats are better than dogs. We often see researchers report the results of these questions using averages or describing them as being x times more than the other, but that is not technically correct. In this example, if I answer 4 in this question and you answer 2, does that mean I like cats twice as much as you do? How can we be sure that the distance between your 2 and 4 are the same as mine? Simple — we can’t. This is what we mean when talking about evenly-spaced distances between values. With ordinal data, we have no concept of the distance between your and my numbers, but we do know the order.

Coincidentally, turning ordinal data into numeric scales (as in the example above) does afford us one advantage — we can measure its central tendency using medians (but not averages!). In simple terms — we get the median of a set of values by sorting them and picking the middle number.

median: the middle of a set of numbers

Interval Data

interval [noun]: a period between two events or times, or the space between two points.

Now consider the ‘Year Born’ column in our table. As with the previous data types, we can infer:

whether or not Jinx and Squeek were born in the same year or not; and
the order in which they were born

But this data type gives us one extra detail — the distance between their birth years. We know that there are 365(-ish) days in a year, and Jinx was born two years before Squeek. Because of this aspect of this data type, we can meaningfully add or subtract values of this data type from each other as well as calculate their averages.

However, this data type doesn’t have a meaningful zero. Think about it — the year zero doesn’t mean that there are no years (and don’t forget those B.C. years). This is the same with temperature in Celsius or Fahrenheit — zero for these values doesn’t mean the absence of heat. (Temperature recorded in Kelvin is the exception here because its zero is the coldest cold there can be). Because of this aspect, we can’t meaningfully divide, multiply, square root, or calculate exponents for this data. E.g. is the year 2000 twice the year 1000?

Watch out! This is the most difficult data type to differentiate from ratio data. Remember to ask yourself if zero is actually zero for your data.

Ratio Data

ratio [noun]: the quantitative relation between two amounts showing the number of times one value contains or is contained within the other.

Now consider the ‘No. of Teeth’ column. While I have no idea how many teeth Jinx has, Squeek actually has none (aside from being a touch drooly, he’s fine). With this data we can tell that 18 teeth is more than zero, and the space between 1 and 2 teeth is evenly spaced by one tooth. But what’s more— because this has a meaningful zero (zero teeth = no teeth), we can turn these into ratios! If Squeek had half the teeth of Jinx, he would have nine teeth. And we can apply all the operations and aggregations with this data, including geometric means (the description of which falls outside the scope of this article).

Conclusion

This article shows the different data types we interact with and how prevalent they are, even if we’re just creating a table of cat characteristics. Understanding the key differences between them will go some way to understanding how to interrogate your data more effectively without needing to be a flash data analyst or stats whizz.

This article is by no means exhaustive - and there are certainly more operations and tests that can only be applied to certain data types - but this knowledge forms the basis that you will need to explore these further…