Tidying data using Python

Kim Rodgers
Analytics Vidhya
Published in
4 min readNov 9, 2019
Photo by Mika Baumeister on Unsplash

This post is a continuation of untidy data blog I did previously. To understand what an untidy data is, kindly have a look at the post. As a recap, data is said to be tidy if it has the following properties:

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.

Here I am going to tidy up an untidy data I introduced in that post. I am going to use Python to do this, to be more specific pandas library. I will then evaluate the final ‘tidy data’ to check if it meets these requirements.

Original untidy data

Untidy data. For example, its has both student and performance observational units

For full solution kind have a look at this github repository.

I am first going to import pandas

Import python as its common alias

Reading the data into a pandas dataframe and getting the columns of the data. The following code achieves this.

Read the csv file into a dataframe and check the columns

Now that we have the dataframe, we can use powerful pandas techniques to tidy up the data. This data had the untidy data issues as discussed before. I will now go through the issues and discuss their fixes. The following were the issues with this data as explained in the previous post:

  1. Some column headers are values, not variable names
  2. Multiple variables are stored in one column
  3. Variables are stored in both rows and columns
  4. Multiple types of observational units are stored in the same table

To discuss the fixes, I will not follow the order of the issues above. I will will resolve all of them by first splitting the data into different tables(for student and performance) and tidying up each dataset.

  • Multiple types of observational units are stored in the same table

This data has data for both student and performance. The first thing I will do is to split these different observational units for each to have its own table. The columns id, name, phone and sex and age will be in the student table. test number, term 1, term 2 and term 3 will be in the performance table. id column will also be added to the performance table to identify the performance’s student.

We initialize two dataframes student_df and performance_df having relevant columns from the original dataframe.

Splitting original dataset into student and performance datasets. This solves issue number 4 above.

The student_df now looks as shown below

Student table has data probably recorded when the details for a students were being collected

The performance_df looks as below

Performance table has data probably recorded after an exam

It can clearly be seen that the dataframe now contain different observational units. It is almost safe to assume that data in the student table was recorded when registering students. For performance table, it is obvious that this data was recorded after examinations, although spread over a period of time. This does not mean it is impossible to collect for example sex and marks of a student as one observation. This is probable after doing an exam before admission to a school for example. However, the chance of this happening for a student in multiple test spread over three terms is quite low.

  • Multiple variables are stored in one column

The sex and age column in student_df has data for both sex and age. The sex and age for each student is extracted and added to the relevant column. The original sex and age column is the dropped from the dataframe. This is achieved using an apply and drop functions of the dataframe.

Extracting sex and age then add put them in separate columns. The original composite column is then dropped.

The student table now looks as shown below.

Tidy student table

From the table above,

  1. id, name, phone, sex and age are all variables that form columns of the table.
  2. Each student data is an observation that forms a row.
  3. The table has only one type of observational unit i.e for a student

Now that we have a tidy student data, the attention is going to shift to performance data.

  • Some column headers are values, not variable names

term 1, term 2 and term 3 are values but are used as column headers. This is fixed by using pandas melt.

Melting performance dataframe

In the molten dataset term 1, term 2 and term 3 columns are now in the term column and their previous values are in the marks column.

Molten performance data

The term number is extracted from the term values.

Extracting term number
  • Multiple variables are stored in one column

test 1 and test 2 are different variables but are stored in test number column. test 1 and test 2 columns are added with their respective marks as the values.

Add test 1 marks and test 2 marks

We are almost there! We now drop marks and test number columns as their data is in test marks columns. We then drop duplicate performance entries into the table and sort the the data by student id.

The final tidy data for performance is as shown below.

Tidy performance data

From the table above,

  1. id, term, test 1 marks,and test 2 marks are all variables that form columns of the table.
  2. Each student performance data in a test is an observation that forms a row.
  3. The table has only one type of observational unit i.e for performance of a student in tests.

Conclusion

We now have separated data for student and performance into separate tables. For student dataset, the sex and age column that initially held data for sex and age was split into separate columns. For performance dataset, term 1, term 2 and term 3 values that we initially used as column headers were changed to be values of a term column. Finally test 1 and test 2 variables were changed from being values to column headers.

This marks the end of this second part blog of data tidiness.

References

  1. My first post on untidy data.
  2. Tidy data paper by Hadley Wickham
  3. Pandas library
  4. Pandas dataframe
  5. Pandas dataframe apply
  6. Pandas dataframe drop
  7. Pandas melt.

--

--