A Gentle Introduction to Python’s Pandas Library — The First 5 Functions You Need to Know

Chris Bruehl
Learning Data

--

Learning Pandas is like a moderately challenging hike: It’s not easy, but you’ll make it if you’re willing to sweat. Photo by Francesco Califano on Unsplash

What is Pandas?

Pandas is a data analysis library in Python. It is one of the most complete data analysis tools out there. With hundreds of functions and methods for data manipulation and analysis, it has become one of the most widely used tools in the world, and is Python’s primary data analysis library. It can be intimidating to learn, and personally it took me a few tries to really learn and start using regularly.

How does it compare to other tools?

While not a perfect comparison, I like to view it as a hybrid of SQL and Excel. It has a familiar tabular data structure, the DataFrame, which is similar to an SQL table (or potentially very well formatted Excel worksheet), and we can perform standard data manipulation tasks like sorting, filtering, pivoting, aggregation and more.

Additionally, DataFrames have a host of built in analytical and visualization methods and functions similar (and often surpassing) that of Excel. These would be almost impossible to replicate in SQL.

I will dive more deeply into the nuts and bolts of the DataFrame in another post, but I wanted to share the first functions you should learn to begin exploring your data.

What do I need to know right now?

First, a brief vocabulary lesson. Functions and methods are similar ideas, the major difference is that while both perform sets of instructions, methods are tied to specific objects, e.g. strings, lists, DataFrames, while functions are not.

One major tell in terms of syntax is that functions can be called in a vacuum, e.g. my_function(a, b, c), while methods are chained to an object via dot notation: my_dataframe.method(a, b, c). They are similar enough that I didn’t want to clutter my title!

To explore these functions and methods, I’ve grabbed oil price data from January 2013 to August 2017.

pd.read_csv()

The first thing we need to do is read in our data. It is possible to create DataFrames from Python data types like dictionaries and lists, however this is rarely done in practice. More likely, as an analyst, you’ll be reading in data from a flat file like a csv, excel worksheet, or from an SQL database. We’ll cover reading in data in a separate post later, but its worth being aware that in addition to read_csv, the functions read_excel, read_sql, and others can be used to read from other sources.

Let’s take a look:

First we import the pandas library as pd, then read in our csv, storing it in the variable oil_prices.

That’s all there is to it! We’ve successfully read in oil_prices.csv and stored it in the variable ‘oil_prices’… but what do we do now?

.head() and .tail()

As a next step, let’s inspect our data. There are 1218 rows in this data, so there’s no way we can look at them all, nor would we want to. The .head() and .tail() DataFrame methods allow us to look at the first n and last n rows of our DataFrame, respectively. By default n=5.

Our DataFrame has 4 columns. NaN is used to represent missing values.

Our data here looks ok so far — our date, dcoilwtico (Oil Price), price_change, and change_category are our column headers, and the bold numbers on the left represent our index. Both can be used to access data.

Looking at the data itself, there are no obvious errors here: NaN is one of two values in Pandas used to represent missing data. We won’t dive into this now, but at this point we’ll leave these as is. Missing data won’t cause us problems for our work in this article.

If we want to inspect more (or less) data, we can pass an integer into the method to display the number of rows specified:

I don’t usually deviate from the default of 5, but occasionally looking at more data is helpful.

The tail method, perhaps not surprisingly, works similar to head, but shows us the last n rows of data. This can be especially helpful for time series type data — we’ll be able to view the final dates in our data set (if sorted) and can spot check whether or not things like missing frequency or formatting differ significantly from our first few rows.

Nothing looks off in the tail end of our data.

.info()

Inspecting our data visually can help reveal obvious problems quickly, but in order to really know what’s going on, we need to look under the hood. The info() method gives us a useful summary of our DataFrame, including information like its size in terms of rows and columns, memory usage, the data type of each column, number of missing values, and the type of index we have. Let’s break it down:

Here we can quickly see that there are missing values in every column except date. Reporting the non-null count is a bit counterintuitive, and there are more specific methods for looking at missing data, but if the value there doesn’t equal the number of rows, there is missing data in that column.

For example, we can see there are 1218 entries (rows) based on the output, and there are 1216 non-null values in the ‘price_change’ and ‘change_category’ columns, indicating two missing values. These are the same missing values we saw with our call to the .head() method.

One other potential data issue that we can spot here is that our date column has type ‘object’. Objects are the default type for string data, and when reading in date columns, Pandas will default to object unless specified otherwise. But Pandas has a dedicated datetime datatype that allows us to do things like time series aggregation and date parting.

So we should change the datatype here. There are a few ways to do this, but I’m going to use the parse_dates argument in read_csv, which instructs the function to cast the columns specified as datetime64 if possible.

And voila, our column dtype is now datetime64. Now let’s take a look at our last function, which serves as both an analytical summary and additional data QA step.

.describe()

The describe method returns of summary statistics for our columns. This is really helpful for getting a sense of the the distributions in your data and summary statistics like mean and median. Additionally, these statistics can reveal things like extreme outliers or other potential issues that require treatment before we can confidently report out any metrics.

Note that by default, only numeric columns are ‘described’ by this function. We get some great info on the distribution of data in these columns. Taking a look at min and max can inform us if there are any extreme values like negative prices (except that one time during Covid) or absurd values like 99999. We can also see that our mean price change was -.037, indicating that the average change was as a 3.7 cent decline in price. Our median (50th percentile), however, is 0, indicating that the most typical day saw no change.

Finally, our count can also tell us about missing values. Since count only counts non-null values, once again we can see that our dcoilwtico column has 1175 rows with non-missing data.

What about our other columns? It is possible to tell describe to focus on columns specific data types. I prefer other methods for inspecting categorical columns, but let’s take a look at include=”all”, which will include all columns.

datetime_is_numeric=True allows us to look at the distribution for dates.

The output starts to get pretty ugly, but if we look at change_category, we can see that there were 1216 non-null values, 3 unique values, with the top, or most common value being negative. This value had a frequency (‘freq’) of 592, indicating 592 rows saw price declines.

The handful of methods we looked at here will get you oriented to your data quickly and allow you to identify many of the most common issues you’ll encounter with your data. There are hundreds of methods and functions in Pandas, but getting familiar with a handful will get you productive quickly. We’ll continue to explore this amazing Python library over time. Thanks for reading!

Interested in learning more about Python from Chris B. and the Maven pros?

Master Python in just 10 weeks:

Admissions for our immersive programs are now open!!

Key Deadlines:

Super Early Bird — Save 25%: Friday, March 29th

Early Bird — Save 10%: Friday, April 12th

Admissions Close: Friday, April 19th

Here’s what you can expect from each immersive…

✔️ Guided portfolio projects!

✔️ Access to your own private learning community!

✔️ Weekly, 1-hour live sessions with recordings!

✔️ Live support from expert instructors!

✔️ Unlimited Maven access for an entire year!

…and all of this is built into a program that’s designed to fit into your busy life, at a more affordable price.

Space is limited; secure your seat today!

--

--