First Post: Pandas Data Cleaning
Time to try out some of the new skills I’ve acquired.
For my first project as a participant in General Assembly’s Data Science Immersive, I examined datasets detailing state participation rates and scores for the 2017 and 2018 SAT and ACT. After spending the first couple of weeks of the course getting comfortable with Python, Pandas, and Jupyter Notebooks, this project provided a good opportunity to practice some simple dataset cleaning and Exploratory Data Analysis on a very manageable (and thus unintimidating) amount of data. In this post, I’ll describe some of the data cleaning I went through, and I’ll follow up in a future post to describe some of the EDA.
Dataframe Cleaning
Four datasets on statewide participation and scoring rates were provided for the project (2017 ACT, 2017 SAT, 2018 ACT, 2018 SAT).
Using the .describe() pandas method, it was easy to determine that there were issues with the data provided. The 2017 SAT dataset, for example, had a minimum math score of 52, even though the minimum possible score one could achieve on the test is 200.
Reference to actual scores for the 2017 SAT confirmed that this was in error, and I was then able to replace the incorrect data:
Looking at the result, it turns out that the correct Math score could have been imputed by taking the difference of the Total score and the Writing score, but, since the data was readily available, and it was only one value, the safest choice was to go straight to the source.
Similarly, the 2017 ACT dataset showed a minimum Science score of 2.3, which, while technically possible, seemed highly unlikely for a state average, especially when compared to the other minimum scores shown.
Once again, reference to the actual scores confirmed that the data was entered incorrectly. Here, too, the data had to be corrected:
Looking at the datatypes attributes of my dataframes, I discovered that SAT Participation, ACT Participation, and ACT Composite scores were all formatted as objects, rather than numerical types (which explains why no summary stats for these columns showed up when I called .describe() on the datasets).
In order to work with these columns, I’d need to do some datatype conversion.
For both datasets, the reason the participation rates were interpreted as non-numerical data was that each value contained a percentage symbol. This complicated converting the values to numerical, and I had to remove the percentage symbol before conversion. To do so, I defined a simple function that could identify the symbol, replace it with an empty string (effectively erasing it), and then convert the clean, symbol-less percentages to integers.
Attempting to change the ACT Composite score to the correct datatype using the pandas.to_numeric() method, I was initially unable to do so.
Luckily, the ValueError indicated very clearly why the code wouldn’t work: the pandas to_numeric() method doesn’t like when it gets non-numbers as inputs!
After re-assigning that value, to_numeric() method runs nicely, and the datatypes are all set:
Now, my 2017 datasets were both clean, and almost ready to be merged. But first, I needed to change the column names, so that the columns would stay meaningful in the context of the merged dataset. To do so, I prepended ‘2017_sat_’ to all of the column names other than ‘State’ in the 2017 SAT dataset, and ‘2017_act_’ to all of the column names other than ‘State’ in the 2017 ACT dataset (I also set the column names to lowercase, to make them easier to work with going forward):
I did not prepend anything to the ‘State’ columns, since those columns would serve as the merge key between the two datasets. I joined the datasets on that key, and saved to a csv file so I could use the clean dataset with the 2018 data.
After cleaning the 2018 datasets and merging them with 2017, I was finally ready to do some EDA. I’ll detail that in my next post.