Unpacking Pandas for Data Science
Using the classic Titanic dataset to unleash the power of Pandas.
If you are already familiar with NumPy, Pandas is just a package build on top of it. Pandas provide more flexibility than NumPy to work with data. While in NumPy we can only store values of single data type(dtype) Pandas has the flexibility to store values of multiple data type. Hence, we say Pandas is heterogeneous. We will unpack several more advantages of Pandas today.
Since we will be referring to NumPy in every section, I’m assuming you have knowledge of NumPy if not I will be dropping links to resources at the end of the article.
I’m considering the very popular Titanic datset to unpack the abilities of Pandas. You don’t have to worry because I will still be introducing the concepts of Pandas step-by-step keeping in mind you are a newbie to Pandas package.
Let’s just quickly import Pandas, Numpy, and load the Titanic dataset.
I know, a lot to digest at once but we will break it down at the course of time. For now, don’t worry about line 3 and line 5(highlighted). Just understand that the seaborn package has the dataset in it and we loaded it, that’s all. You might have already figured out that ‘df’ holds our entire dataset but wait, what is the data type of 'df’? and what on Earth is ‘df.head(10)’. This brings us to our first topic Pandas objects: Series and DataFrame.
Pandas Series object
Series is a fundamental data structure of Pandas. We can think of it as a one-dimensional array of indexed data.
I have grabbed the ‘survived’ column from our dataset which is of type Series. I have converted the values of survived to float-point to differentiate between index and value. We can see it has a sequence of both index and value. Also, Series belongs to the class ‘pandas.core.series.Series’. We can access index and values separately with attribute index and values. Values are simply of type NumPy array and index is an array-like object of type pd.Index.
Just as NumPy we can access values of Series with it’s associated index by using square bracket notation.
The essential difference between array and Series is that array have only an implicit index to access value while the series has an explicit index as well. The explicit indexing capability of Series gives an advantage, we can have an index of any type, the default is integer values as we have seen. Let’s see how we can change the integer index of our survived Series to string. Additionally, we will also learn how to create a Series object from scratch using an array(values of ‘survived’ which is of type array).
This also gives us scope to think series as an upgraded version on the Python dictionary.
Pandas DataFrame object
Remember we raised a question in Figure-0 “what is the data type of 'df’?” You got it, ‘df’ is of type DataFrame. DataFrame is another fundamental data structure of Pandas. As we analogically said Series is a one-dimensional array with flexible indices, here, Pandas is a two-dimensional array with both flexible row and column indices. As multiple one-dimensional arrays gave birth to a two-dimensional array, multiple Series gave birth to DataFrame. Now scrolling back to Figure-0, we can see our entire dataset is of type DataFrame which has multiple Series commonly referred as columns of DataFrame.
Like Series, DataFrame has an index attribute and column attribute.
You can notice that columns are also of type pd.Index. Which means, we can access values using the column name just like any other Series or array. Exciting isn’t it! This is the power of DataFrame. Indeed, this is how I grabbed the values of column ‘survived’ in the Pandas Series object section.
Let me reveal another secret, neither the Series ‘survived’ nor the DataFrame 'df' have 10 rows. Actually, they have 891 rows. This brings us back to our question from Figure-0 “and what on Earth is ‘df.head(10)’ ”. With the magic of the Pandas method head() we can display only the number of rows we pass as a parameter starting from the 1st row and the default value is 5. This is why all the way down we saw only 10 rows. The opposite is the method tail(). Our entire dataset looks like this:
Although we can create a DataFrame from scratch we are not going to discuss because as a data scientist we rarely have to create any dataset.
While we can use the Python style square-brackets notation to index and select values from our DataFrame, Pandas provides a more powerful attribute called Indexers: loc and iloc. Pandas indexers have an advantage over the regular square-bracket style. They provide us the flexibility to index a DataFrame like any other NumPy array.
iloc[ <row>, <column> ]
All DataFrames are indexed in two styles: explicit and implicit. We already know that we have column names survived, pclass, sex,… which act as explicit index but they are also indexed internally with integers starting from 0 just like any other Python list acting as an implicit index. Indexer iloc always refers to implicit indices. Although we have columns names we can slice them using integer indices. Let’s extract rows from 1–6 and columns from ‘sex’ to ‘who’.
loc[ <row>, <column> ]
Same way, loc always refers to explicit indices. Let’s do the same thing as above with loc.
If you provide a column name for iloc or integer indices for loc it will throw an error.
We can perform many useful and complex operations using iloc and loc. Suppose we need to know the total number of kids and teenagers aged less than 20 who survived the disaster and were alone on the Titanic.
Handling missing values/data
In real-world datasets, we can always find many missing values. This happens because not everyone provides all the information we need for our analysis or prediction, like a personal phone number. We also cannot discard the entire dataset or the column phone number from the dataset. In such cases, Pandas represents it as a NaN( not a number) and provides several methods for deleting, removing, and replacing them.
Method isnull() returns us a boolean mask of the entire dataset in just one line of code. True if the value is missing and False otherwise.
More insightful is to perform aggregation on it such as to see the total number of missing values in each column.
Method notnull() works the exact opposite of isnull(). False if the value is missing and True otherwise.
Method dropna() is used to drop the missing values. The catch here is we cannot drop only the missing value, either we have to drop the column or row having the missing values. Axis parameters is used to mention if we wish to drop row(axis=0) or column(axis=1).
We just dropped all the columns having NaN values that are, ‘age’, ‘embarked’, ‘deck’, and ‘embark_town’. We can note the number of columns reduced from 15 to 11.
We just lost four critical features of our data ‘age’, ‘embarked’, ‘deck’, and ‘embark_town’. We cannot afford to lose such features just like that. So here comes the lifesaver method fillna(). With fillna() we can replace the NaN value with our desired value.
Here the critical decision will be to decide with what values we have to fill the missing values. In our case, the most sensible approach will be to fill the NaN values of age with a mean value of all the ages. Now we don’t have any missing values in our age column. The ability to decide what value to replace with will come with practice.
Sorting is another powerful tool by Pandas. Unlike list and array, DataFrames are sometimes not sorted by index as well. Hence, there are two types of sorting.
- Sort by index - sort_index()
- Sort by value - sort_values()
Let us take a small section of data from our Titanic dataset with the help of indexing we learned to demonstrate sorting. I’m going to introduce another way of indexing called vector indexing where we can specify row and column name we want in any order as a list.
We notice that both our row index and column index are unsorted. Let’s try to sort them both. As we already know we use axis=0 for the row which is the default value and axis=1 for the column.
Sort by values is pretty self-explanatory, we just have to decide on which column values we need the sort which can be done with the help of ‘by’ parameter.
In case of conflicting values for example if two persons have the same age then who has to be on top can be decided by adding another level of sorting which can be accomplished by passing a list of column names in the ‘by’ parameter (Ex. smallData.sort_values(by=[‘age’, ‘fare’]). We can also specify the order of sorting using parameter ‘ascending=True’ and ‘ascending=Flase’.
We are seeing ranking since our 1st grade. We are always being ranked be it by our marks or our quarterly performance. Since ranking is widely used Pandas provide rank() method to ease our work. There exist few standard methods to rank like minimum, maximum, dense, and average. Let’s explore them.
Suppose I wish to rank people who board the Titanic based on the fare they paid. For the ease of understanding, we will only look at the ‘survived’ and ‘fare’ column of our data and all methods of ranking next to it.
By default, ranking is done in ascending order. Method plays its part only when repeated values occur.
- The default method is average. When values repeat the average of the positions is taken. In our case value 7 is repeated thrice and is ranked 135: (134+135+136)/3
- In the max method the highest possible position is ranked. Since 7 is repeated thrice there is a competition between ranks 239,240 and 241, finally, 241 is assigned. Rank 239 and 240 are given to nobody.
- The exact opposite is method min. For value 7 there is a competition between ranks 29,30 and 31 finally, 29 is assigned and the positions 30 and 31 are given to nobody.
- In the dense method, there is no competition between positions. It is exactly as min method but succeeding positions are not skipped, unlike min method.
If you are reading this sentence, congratulations! You have learned more than the basics of Pandas. There are many more concepts to be covered which will be done in Unpacking Pandas for Data Science: Part 2. I will try my best to make it available as soon as possible.
NumPy - The very basics!
This article is for people who have zero knowledge of NumPy so that they can get a little hang of it to kick start.
Advanced NumPy for Data Science
We will be covering some of the advanced concepts of NumPy specifically functions and methods required to work on a…