Tidying data using Python
This post is a continuation of untidy data blog I did previously. To understand what an untidy data is, kindly have a look at the post. As a recap, data is said to be tidy if it has the following properties:
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.
Here I am going to tidy up an untidy data I introduced in that post. I am going to use Python to do this, to be more specific pandas library. I will then evaluate the final ‘tidy data’ to check if it meets these requirements.
Original untidy data
For full solution kind have a look at this github repository.
I am first going to import pandas
Reading the data into a pandas dataframe and getting the columns of the data. The following code achieves this.
Now that we have the dataframe, we can use powerful pandas techniques to tidy up the data. This data had the untidy data issues as discussed before. I will now go through the issues and discuss their fixes. The following were the issues with this data as explained in the previous post:
- Some column headers are values, not variable names
- Multiple variables are stored in one column
- Variables are stored in both rows and columns
- Multiple types of observational units are stored in the same table
To discuss the fixes, I will not follow the order of the issues above. I will will resolve all of them by first splitting the data into different tables(for student and performance) and tidying up each dataset.
- Multiple types of observational units are stored in the same table
This data has data for both student and performance. The first thing I will do is to split these different observational units for each to have its own table. The columns id, name, phone and sex and age will be in the student table. test number, term 1, term 2 and term 3 will be in the performance table. id column will also be added to the performance table to identify the performance’s student.
We initialize two dataframes student_df and performance_df having relevant columns from the original dataframe.
The student_df now looks as shown below
The performance_df looks as below
It can clearly be seen that the dataframe now contain different observational units. It is almost safe to assume that data in the student table was recorded when registering students. For performance table, it is obvious that this data was recorded after examinations, although spread over a period of time. This does not mean it is impossible to collect for example sex and marks of a student as one observation. This is probable after doing an exam before admission to a school for example. However, the chance of this happening for a student in multiple test spread over three terms is quite low.
- Multiple variables are stored in one column
The sex and age column in student_df has data for both sex and age. The sex and age for each student is extracted and added to the relevant column. The original sex and age column is the dropped from the dataframe. This is achieved using an apply and drop functions of the dataframe.
The student table now looks as shown below.
From the table above,
- id, name, phone, sex and age are all variables that form columns of the table.
- Each student data is an observation that forms a row.
- The table has only one type of observational unit i.e for a student
Now that we have a tidy student data, the attention is going to shift to performance data.
- Some column headers are values, not variable names
term 1, term 2 and term 3 are values but are used as column headers. This is fixed by using pandas melt.
In the molten dataset term 1, term 2 and term 3 columns are now in the term column and their previous values are in the marks column.
The term number is extracted from the term values.
- Multiple variables are stored in one column
test 1 and test 2 are different variables but are stored in test number column. test 1 and test 2 columns are added with their respective marks as the values.
We are almost there! We now drop marks and test number columns as their data is in test marks columns. We then drop duplicate performance entries into the table and sort the the data by student id.
The final tidy data for performance is as shown below.
From the table above,
- id, term, test 1 marks,and test 2 marks are all variables that form columns of the table.
- Each student performance data in a test is an observation that forms a row.
- The table has only one type of observational unit i.e for performance of a student in tests.
Conclusion
We now have separated data for student and performance into separate tables. For student dataset, the sex and age column that initially held data for sex and age was split into separate columns. For performance dataset, term 1, term 2 and term 3 values that we initially used as column headers were changed to be values of a term column. Finally test 1 and test 2 variables were changed from being values to column headers.
This marks the end of this second part blog of data tidiness.
References
- My first post on untidy data.
- Tidy data paper by Hadley Wickham
- Pandas library
- Pandas dataframe
- Pandas dataframe apply
- Pandas dataframe drop
- Pandas melt.