Pandas 101: Indexing

Hello Everyone,

Its been a while I’m sorry, I’ve been lazy! Currently I am traveling in China & Southeast Asia. Internet at my grandma’s is not the best…and all they do is play ma jong.

Today, I am going to talk about Pandas! No, not the cuddly bears (I haven’t even seen any in China yet), but instead, the awesome python library that helps you manipulate data! I’m going to go in depth a bit on two main concepts of Series & Dataframes. And then, what I found helpful in life, is becoming fluent in indexing dataframes. Feel free to download the jupyter notebook exercises from my github I’ll go through later here. If you don’t know how to use jupyter notebooks, here is their guide on how to get started.

Pandas

So what is Pandas? Pandas is yet another library in python that is mostly built off of NumPy. It contains all the awesome features of NumPy and more. If you don’t remember what that was, you can read my earlier blogpost on it. It has been common to see pandas to be imported and aliased as such.

import pandas as pd

And really, the two things you need to know about the Pandas library are the Dataframes object & Series object.

Series

I think of Series objects as basically the same thing as a one dimensional ndarray. The primary difference is that Series have a flexible way of being paired with data labels or “index”. By default, the index will be incremental integers. However, from below you can see that you can set it as whatever you want and also manipulate and sort by the index however you want. You can see how this concept becomes more relevant and important when we talk about Dataframes.

series_example = pd.Series([1,2,3,4], index=[‘d’, ‘b’, ‘a’, ‘c’])
output:
2 1
3 2
4 3
5 4
dtype: int64

Dataframes

Dataframe objects can be described as basically a bunch of Series that share the same row index or like a 2 dimensional ndarray. Another way of thinking about it for those coming from the good old days of excel spreadsheets, is that its basically like a spreadsheet (rows & columns).

Dataframes can be initiated in a couple of ways. However the three primary arguments you should at least remember are “data”, “columns”, “index”. Data is typically passed in as a dictionary of series. Columns are the column names or column “indexes”. And index are the row labels or row “indexes”. By default, similar to when initiating series, the row index will be incremental integers.

dataset = pd.DataFrame(data = {'age': [10,28,30], 'weight': [120,133,155],'height': [160,165,175], 'color':['blue','green','pink']}, columns = ["age","weight", "height","color"])

Manipulating Dataframes: Indexing

So here is where the foundation of manipulating dataframes is. Is understanding how you can retrieve parts of your dataframe. I am going to go through three main ways to get parts of your dataframe.

Original Standard Indexing

The standard way of indexing is through [] notation. To call upon columns you can use the following:

Through dict like notation

dataset["age"]

Through attribute notation

dataset.age

To retrieve certain rows you can use the row indexes

dataset[0:2]

To be fancy and only want to call certain columns & certain rows

dataset[0:2]["age"]

You can see here that I can only call the columns by its column and row index/position.

To add a condition on the rows you want you can do something like this

dataset[dataset["age"] > 15]

You can evaluate that as “dataset[“age”] > 15" returning you a list of true and falses. Then, the operation will return entire rows where it was evaluated as true.

Position Indexing: .iloc

.iloc is a method on the dataframe that lets you retrieve data based off of position. It’s syntax is dataframe.[row_indexer,column_indexer]. Note: This way of retrieving data is EXCLUSIVE of the end position similar to usual array indexing.

dataset.iloc[0:3,0:2]

This gives me the first 3 rows and the first 2 columns of my dataframe. This is pretty useful when you don’t have the energy to be spelling out all your column names and convenient for looping.

Label Indexing: .loc

.loc method is retrieving data using the column and row index lables. dataframe.[row_label,column_label]. Note: That it is actually INCLUSIVE of the end label. Therefore the below will return the same as the above.

dataset.loc[0:2,["age","weight"]]

Beware, This actually grabs the first three rows and the first two columns of my dataframe. In this case, my row index happens to be numerical and therefore the “position” and “label” are the same. But in reality, if my row labels were something like “a,b,c”, calling “0:2” would throw an error.

Conclusion

If you had made it this far, I am super proud of you. Key conclusions here is to understand Series & Dataframes and how everything actually just builds on top of each other. Its also fairly important to get familiar with the basics of calling different parts of your dataframe. This blog post was getting a little too long but in my exercises there is a bit more on cool nifty tricks with indexes. As always, the learning can sometimes feel like a black hole and there are endless ways you can do things. Pro tips in this is to just get the basics down and google everything else later! Please comment below on what else you would like to see next, where I messed up, and how to make these posts better!