How to Identify Data Types
Fundamentals Series
Understanding the data that one has at hand is key to any data science endeavour. Every variable in your dataset has a type. Mistyped variables are not uncommon; data types of variables differ between languages and can become quite granular. In this article, we will look at the primary data types in mathematics, R, Python, and SQL.
Data types in mathematics
In mathematics and statistics, we split data according to whether they are quantitative or qualitative. Quantitative data allows inference to be drawn using numerical methods, whereas qualitative data only allows inference through comparison.
Quantitative data
In general, quantitative data refers to numerical information. We can further categorise data according to whether they are continuous or discrete.
Discrete variables present as specific points on a number line in one dimension, whereas continuous variables present as any number within a specified range.
Examples of discrete variables:
- A person’s age in years
- The number of cars on a road
- The difference in scores between teams in a sports match
Examples of continuous variables:
- A person’s height
- The length of a car
- The date and time of an event
In two or more dimensions, discrete functions present as disconnected points, whereas graphical representations of continuous functions are connected. Additionally, a function is called smooth if it is a continuous function, and with a continuous derivative; otherwise, it is called non-smooth.
In mathematics, there are standard number sets that can be represented as a hierarchy of quantitative variables:
Discrete number sets
- ℕ: The set of natural numbers {1, 2, 3, …}
- ℤ: The set of integers {…, -3, -2, -1, 0, 1, 2, 3, …}
Continuous number sets
- ℚ: The set of rational (quotient/fraction) numbers
- ℝ: The set of real numbers
Qualitative data
Qualitative data refers to the type of data that allows for interpretation, such as audio, visual, textual or Likert scale responses, that one might encounter in questionnaires and self-report surveys.
If the data consists of named categories, then it is nominal; if it has an inherent order, it is ordinal, regardless of whether it is textual or numerical.
- Nominal: From the Latin word nominalis, which means “pertaining to a name”. Includes nouns such as countries, cities, colours, brand names, and so on.
- Ordinal: From the Latin word ordinalis, meaning “an order or place in a series”. Includes placings such as 1st, 2nd, 3rd, and sets of words which have an inherent hierarchy such as {good, better, best}.
Ordinal variables are closely related to quantitative variables since numbers have an inherent ordering.
A note on dates
Since dates present in various formats (e.g. day, month, annual year, date, date and time, seconds, seasons, financial quarters), their data type depends on specific applications.
Data types in R
R is a statistical programming language widely used for data analysis and statistical software development. Therefore data types of its variables most closely resemble those used in mathematics and statistics.
Tidyverse is a popular data analysis software library for R and influences some of the data types shown below.
Data types in Python
Python is a general-purpose programming language used for web applications, data analysis, information security, software development and artificial intelligence. It also influences other programming languages used for data science, such as Julia and Ruby.
Pandas is a popular data analysis software library for Python and may influence some of the data types shown here.
Data types in SQL
SQL is a database programming language used for building and querying relational databases. Many different flavours may influence some of the data types shown here; these include MySQL, T-SQL (used with MS SQL Server), PostgreSQL, Oracle and SQLite.
Summary
We derive data types from mathematical categorisations of data, so understanding the main number sets helps to convert data types between the most common data science languages, as shown in the following table.