Untidy data

Kim Rodgers
Analytics Vidhya
Published in
3 min readNov 8, 2019
Photo by Sharon McCutcheon on Unsplash

Why tidy data?

It is hard to manipulate, model and visualize raw data. This is because raw data is messy. A lot of time and effort is spend in preparing data to make it ready for manipulation, modelling and visualization. I am going to talk about handling untidy data i.e tidying untidy data. The idea of tidying data is well covered by Hadley Wickham in tidy data paper.

This is an introduction to the topic of tidy data. If you are familiar with it you are free to jump to a follow up post that goes into the detail of tidying up untidy data.

What is tidy data?

Data is said to be tidy if it meets the following conditions:

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.

Data is therefore messy if rows, columns and tables are matched up with observations, variables and types in way that does not meet any of the above conditions. This is better shown with an example.

Untidy data example

I am going to use a hypothetical data for five students performance in two tests in three terms.

Untidy data

This data is untidy due to the following common issues of untidy data:

  • Some column headers are values, not variable names.

The column headers term 1, term 2 and term 3 are values. These need not to be column headers but values within another column like term.

  • Multiple variables are stored in one column

The column sex and age has two variables, obviously sex and age. For example Mike’s value for this column is m_12, meaning that he is a 12 year old male. The sex and age values need to be in different columns.

  • Variables are stored in both rows and columns

The test number column is not a variable, it stores names of variables test 1 and test 2. Therefore, test 1 and test 2 need to be headers of columns as they are variables.

  • Multiple types of observational units are stored in the same table

The table above has data for two types of observational units: the student and student performance in two tests for each term. These different observational units need to be in their own tables for consistency. Therefore, a table for student should have the id, name, phone, sex and age. The table for performance should have the rest of the data, plus the id column which should be common in both tables.

This marks the end of exploration of an untidy data. However, these are not the only issues that make data untidy. There are many more issues that I have not covered here.

You are now ready to get your hands dirty and attempt to make this data tidy! For a guide on how to do this have look at this follow up post.

References

  1. Tidy data paper by Hadley Wickam

--

--