Indexing and Querying data frames using Pandas

In the previous article we saw how we can use pandas to load a csv file and perform basic cleaning tasks. In this article, I want to talk about how pandas can be used to Index and Query data frames. This can be useful while performing exploratory analysis of data. I found the Coursera Data Analysis with Python course very useful and would encourage everyone to check out the course. I m using the olympics.csv file which is available as part of Course 1 downloads from the course. Link to files.

First step is loading the file onto pandas

The first row has a NaN value and its clear that, it has values which are column names. We can use the index call to indicate which column should be the index and we can also use the header parameter to indicate which row from the data file should be used as the header. Let’s re-import that data and center index value to be 0 which is the first column and let set a column headers to be read from the second row of data. We can do this by using the skip rows parameters, to tell Pandas to ignore the first row, which was made up of numeric column names.

Since our column names are not properly named, we should clean up the column headers. Panda stores a list of all of the columns in the .columns attribute. We can change the values of the column names by iterating over this list and calling the rename method of the data frame. Lets print our column names to see what we are dealing with here.

Here we just iterate through all of the columns looking to see if they start with a 01, 02, 03 or numeric character. If they do, we can call rename and set the column parameters to a dictionary with the keys being the column we want to replace and the value being the new value we want. The in place function is used to update the data frame directly.

Boolean mask

A Boolean mask is an array which can be of one dimension like a series, or two dimensions like a data frame, where each of the values in the array are either true or false. This array is essentially overlaid on top of the data structure that we’re querying. And any cell aligned with the true value will be admitted into our final result, and any sign aligned with a false value will not.

For instance, I want to see, only those countries which have achieved a gold medal during the Summer Olympics. To build a boolean mask for this query

Now lets impose this boolean mask onto our data frame to see results.Only countries which satisfy this condition should now be seen.

We see that the resulting data frame keeps the original indexed values, and only data from countries that met the condition are retained. All of the countries which did not meet the condition have NaN data instead.When we use the only_gold[‘Gold’].count() command, this returns a value of 100 which is what we need. However when we use df[‘Gold’].count(), this returns a value of 147. We might want to drop these additional rows which have NaN values. We can use the dropna function to achieve this.

We can even query this dataframe using the following two lines to omit the where function where we can pass the boolean mask as a column.

only_gold = df[df[‘Gold’] > 0]
only_gold.head()

Now lets say we want to look at all the countries which received a gold in the summer olympics and also a gold in the winter olympics. We want to see all countries which won a gold at sometime.

len(df[(df[‘Gold’] > 0) | (df[‘Gold.1’] > 0)]) . When we apply this function to the dataframe we can see that it returns 101 which means that there are 101 countries which won gold at sometime.

Have there been any countries which have won a gold only in the winter olympics and never in the summer olympics? We can query that with

df[(df[‘Gold.1’] > 0) & (df[‘Gold’] == 0)]

There is just one country which is returned by this query.

In our original data frame the the default index is the country name. Using pandas we can reindex the data frame with any column we wish. Lets index the data frame by the number of gold medals won in summer games. To do this, we need to preserver our original index country, else it will be lost. Once we save it, lets change the data frame to index it by number of gold medals won during summer olympics.

df[‘country’] = df.index
df = df.set_index(‘Gold’)
df.head()

Lets move to an other data file which is the census.csv. This file is part of Course 1 downloads from the course. Link to the files. Lets load the data frame and take a look at the data.

Just like sql we can query for unique values in a column using the unique funcion.

df[‘SUMLEV’].unique()

There are just 2 unique values in this column, 40,50. We can also query for a particular value within a column.

Also, we can keep a list of the columns we want and remove all other columns. Lets just keep the total population estimates and total births.We can do this by creating a list of column names that we want to keep then project those and assign the resulting DataFrame to our df variable

We can load the data and set the index to be a combination of the state and county values and see how pandas handles it in a DataFrame. We do this by creating a list of the column identifiers we want to have indexed.

We can also query the data frame using a specific value to see all rows for that value or query a combination of values.

Moving on to an other file which is log.csv also available as part of Course 2 downloads. Link to the files. As usual the first step is to load the data onto pandas.

We can se that the columns are not ordered. Lets try to sort the time column.To do this we can set the time column as the index and then sort by the index.

df = df.set_index(‘time’)
df = df.sort_index()
df

We can see that we have more than one user logged in at the same time. Lets do a dual index of time and user.

df = df.reset_index()
df = df.set_index([‘time’, ‘user’])
df

This is very useful because we can see that at the same time two users, cheryl and sue logged in. We can also see that we have many missing values in our dataframe. Pandas has a range of options where can either fill an individual missing value with a different value or we could also iterate through a column and fill a missing value with a value in the previous row. df.fillna is a function we can use to fill in missing values. Pandas lets you look at the various options for a specific function as well. You can just type df.fillna?

Lets try the forward fill function on our data frame.

method : {'backfill', 'bfill', 'pad', 'ffill', None}, default None
Method to use for filling holes in reindexed Series
pad / ffill: propagate last valid observation forward to next valid

ffill will replace the missing value with the last valid observation.

df = df.fillna(method=’ffill’)
df

We can see that all the Nan values after 10.0 until the next valid value 5.0 were replaced with 10.0.

Thanks for reading and hope you enjoyed this short article about indexing and querying data frames using pandas.