What is Pandas?
Pandas is an open-source data analysis and manipulation tool for Python.
The name? It comes from the econometrics term “panel data”, which is multi-dimensional data with measurements over time. It’s also pretty cute so that’s a bonus!
At its core, it allows us to easily use spreadsheet-like data. From there, you can clean the data, preform any additional modifications, and analyse it to gain some insight into your data.
Installation and Importing
If you have Python installed, you likely already have pip. If not, there are easy instructions to install it here. Pip allows us to easily install packages from the Python Package Index from the command line. Pandas can be installed with pip, along with its dependency NumPy.
Whenever you need to use pandas for one of your projects, it can be imported like so:
We import using the abbreviations np and pd to make it easier to call upon the functions and classes in the module. If you want to read more about how the import system works in Python or how to use NumPy, you can check out my blog posts on either subject.
What Is the Difference Between a Module, a Package, a Library, and a Dependency?
When I first began programming, I thought you had to do everything by scratch.
Reading files with Pandas
If you’ve got data you want to do something with, you probably don’t just have it memorized. Most likely, you’ve got it in some sort of file. Conveniently, Pandas lets us easily read data from a wide variety of ways it can be stored! This will bring the data into Pandas with one of Panda’s objects, which we will talk about later.
For example, the most common ways are:
As always, all of the code for this blog post can be found my GitHub.
Seeing our data
If you want to get a basic overview of what your data looks like, you can use the head and tail methods. They allow you to see a snippet of the first few and last few (default 5) entries of your data respectively.
For the purposes of this guide, I’ll be using a tool known as Jupyter Notebook so that we can easily see the results of our code. I used a data set from the Food Network show Chopped, courtesy of Jeffrey Braun. All of the data is stored in a CSV file called “chopped.csv.”
According to the Pandas documentation, a series is “a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.).” Importantly, the series is built off of the NumPy ndarray, so all of the functionality from that also carries over into our series.
Essentially, its like having a spreadsheet with two columns: labels, which we will refer to as the index, and values, which could be anything from countries, prices, or MySpace “Top 8” rankings.
Conveniently, there are several ways to create them:
Series are ndarray-like
Series act very similarly to NumPy ndarrays, and as such a lot of similar functionality can be used on a series.
Additionally, because of this property, Pandas Series can also take advantage of “vectorized operations.” Essentially, this means that we don’t have to loop through each element in a Series to preform operations on it. Instead, our operation can be “scaled” to the size of our Series and be done just like a matrix operation. For example, if we wanted to add 3 to each element in our Series, all we have to do is add 3 to the Series itself:
The scalar value 3 got “stretched” into a vector or array full of 3’s that was the same size as our original Series. Consequently, the computation was much easier and faster.
Series are also dictionary-like
Series also act very similarly to a Python dictionary, with the index label acting as a key.
The DataFrame is a two-dimensional data structure with potentially different types of data in each of its columns. It is essentially a spreadsheet or table, just with a lot of added functionality. It is an incredibly key object in Pandas.
Data Frame Creation
Conveniently, it accepts many different types of data as input:
A DataFrame from several Series:
A DataFrame from a dictionary:
A DataFrame from a 2D array:
DataFrame getting and setting
You can treat a DataFrame exactly like a Python dictionary, with the keys being the names of each column, and the corresponding values being a Series. For example:
If you want to get a row in the DataFrame as a Series, you can use the built-in Pandas loc method. Additionally, you can use the method to get and set values for specific elements in the DataFrame.
If you need to add an entire row of new data to the DataFrame, this can be done with the append method.
However, this can be computationally expensive, so try your best to first gather all your data and then create a DataFrame.
One helpful feature of Pandas is the ability to do boolean indexing. Essentially, this allows us to get only the values which meet a certain condition. For example, if we only want to grab people who are over the age of 25:
There are many applications of boolean indexing, and I encourage you to try out several different ideas. (What if you wanted to grab all the values about the mean age?)
Often, data is not very clean. There will be typos, missing information, extraneous rows, mismatched types, and just general weirdness. One of the most important jobs of a data scientist is getting a data set into a state that is “clean” and devoid of all of this weirdness. That way, any insights taken from that data isn’t the result of bad data.
This blog post by Malay Agarwal does an amazing job of explaining how to clean data, and goes into much more depth than I could in this single guide.
Pythonic Data Cleaning With Pandas and NumPy - Real Python
We'll cover the following: Note: I recommend using Jupyter Notebooks to follow along. Often, you'll find that not all…
Finally, one of the most important things we can do with a DataFrame is manipulating its data to create new insights.
For example, imagine you work at Amazon and you have sales data from some technology items.
Lets say you wanted to find out how much sales was generated from each item. We could define a new column called “Total Sales” which simply takes in the product of “Unit Price” and “Quantity” for each item. Similarly, let’s say you wanted to find out which items made up the largest fraction of sales. You could define a “Ratio of Sales” which takes in the “Total Sales” for each item and divides it by the sum of “Total Sales” for all items.
This is straightforward to accomplish with DataFrames:
And now you can generate some insight as to which items are the most important for your overall profits! Look at you, a whole Data Scientist!
Pandas is a powerful tool for dealing with data. Whether you want to draw insights from Food Network shows, or drive a small business, Pandas is an integral part of the Python data science ecosystem.
I hope you were able to glean some information from this post! It was an interesting topic to look into! If there is any aspect of this post that you would like me to go into more depth with, please contact me.
Thanks for reading!