DataFrames with Pandas: Part 1(Introduction & Slicing)

Md Aminul Islam
4 min readMay 12, 2023

--

Data Scientists need to work with large set of data. Most often, these data are stored in tables. In this part, we will discuss about DataFrames which is used to represent, analyse, and manipulate data tables. We will talk about Pandas, the python package to work with DataFrames.

First, we will import the pandas package. Then, we will load an elections dataset stored in a CSV file and show it. The elections variable is an instance of a Pandas DataFrame.

import pandas as pd

elections = pd.read_csv("elections.csv")
elections

Data in a DataFrame are 2D tabular data containing rows and columns. Each row represents a single record in the data. Each row has a row label, in this case the 0, 1, 2, 3, 4 are the row labels for each row. The row labels need not to be an integers, it can be strings too. The row labels are called index. By default, Pandas assign row labels in incrementing orders starting from 0. Each column represent a single feature of the data. Each column has a label and the data type across a single column must be same. Each column name must be unique.

Slicing DataFrame

Slicing is a way to create a new DataFrame by taking some rows and columns from an existing DataFrame. For slicing operation in Pandas, loc and iloc properties are used.

There are two arguments for the loc property. The first argument is the row label and the second one is the column label. The below example takes the row index label 1and theCandidate column.

elections.loc[1, 'Candidate']

We can also slice out multiple rows and columns.

elections.loc[0:3, 'Candidate': 'Popular vote']

To get all the rows or columns, pass an empty slice in stead of labels.

elections.loc[:, 'Candidate': 'Popular vote']
elections.loc[0:3, :]

To get some of the the specific columns, pass the column names as a list.

elections.loc[:, ['Candidate', 'Popular vote']]

Now, let’s look at this:

elections.loc[:, 'Popular vote']

The above output does not seem to be a DataFrame. It is actually another data structure in Pandas called Series. So, if we only select a single column, the output will be a pd.Series object. But if we pass the single column as a list, it will produce a DataFrame.

elections.loc[:, ['Popular vote']]

Similarly, if we get select a single row, that will also create a Series object.

elections.loc[0, ['Candidate', 'Popular vote']]

But if you pass the single row as a list, it will produce a DataFrame object.

elections.loc[[0], ['Candidate', 'Popular vote']]

In Pandas, we frequently access the column names. So, there’s a shorthand for loc operation.

elections['Candidate']
elections[['Candidate', 'Popular vote']]

Slicing using iloc is similar to slicing using loc. The only difference is that iloc uses the indexes of rows and columns where loc uses row and column labels. The example provided below will take the rows from 0 to 3 and take the columns from 0 to 2.

elections.iloc[0:4, 0:3]

Shorthand for iloc is also possible. The below example will take rows from 0 to 3 and all the columns.

elections[0:4]

Today, we have been introduced with DataFrame, Series object of Pandas package. We have seen how we can load data into a DataFrame from a CSV file. We have also learnt how we can slice across rows and columns using iloc and loc properties of DataFrame. You can see this GitHub repository where all the examples and dataset mentioned here are available. You can play different things using loc and iloc from this repository. In our next part, we will see how we can filter rows from DataFrame, and some other useful methods of DataFrame and Series. You can read the part 2 from here. Please follow or subscribe myself if you want to be notified for future posts related to DataScience, Machine Learning, and Casual Inference.

--

--

Md Aminul Islam

I am a PhD student in Computer Science at University of Illinois Chicago. My interests are in reasoning with data using Data Science and Machine Learning.