Week 2 and Useful Pandas Techniques
In week two of General Assembly’s Data Science Immersive course, the cohort covered basic statistic principals (such as the Central Limit Theorem, Confidence Intervals, and Z-scores), in addition to covering Pandas, Data Visualization, and sqlite3 / postgreSQL.
Instead of doing a full recap of the week (which could only be extensive), I think it would be best to go through some of the useful things I picked up with pandas. These were my solutions to problems I encountered while processing data using what I learned in class.
Print the head and tail of a DataFrame with one command using np.r_
Feeding np.r_ slices of the first and last five indices generates a single array of integers. This array can be passed to iloc, the pandas integer based indexer, which gives us the head and tail of our DataFrame.
I find this to be useful when looking at the data for the first time, or when chaining methods such as value_counts() and sort_values().
Generate multiple plots in one cell using loops and matplotlib’s subplot function.
In the below, I loop through the columns in the iris dataset (excluding species), and plot each column to a violinplot by species.
I find this especially useful because alternatively reproducing the four violinplots below would require cut and pasting code (modifying some values) to separate cells in the notebook. This produces clutter, and the added scrolling means that the plots cannot be as easily compared. Additionally, should the code not run the first time, something will need to be altered in all four cells.
Masking with DataFrame.loc and Pandas.to_numeric()
Prior to masking, I would conditionally select the values I wanted as a Series, and use pd.Series.to_dict() to create a dictionary. I would then map the dictionary to a column in my DataFrame. While this is very useful, masking can be more efficient and straightforward.
Consider the DataFrame I created to the left.
Below, I used masking with location based indexing (.loc) to select the rows in the DataFrame I wanted to alter, and then modified values in specified columns.
However, adding strings to integer columns causes pandas to change those column’s datatypes to object, which do not understand math operations. Fortunately, I found on stack overflow that pd.to_numeric has an argument called errors which converts non-numerics to null values.
By passing ‘coerce’ to the errors argument in pd.to_numeric, I was able to both change the two ‘unknown’ string values to null values, and convert the column to float64.
I hope anyone reading this was able to learn something, or was perhaps even reminded of a pivotal moment of understanding with regards to working with pandas. Until next week.