5 Fundamental Operations on a Pandas DataFrame

Published in

School of ML

5 min readAug 30, 2020

Tips and Tricks for Data Science

Pandas is a powerful and easy-to-use software library written in the Python programming language, and is used for data manipulation and analysis.

Installing pandas: https://pypi.org/project/pandas/

pip install pandas

What is a Pandas DataFrame?

A pandas DataFrame is a two dimensional data structure which stores data in a tabular form. Every row and column are labeled and can hold data of any type.

Here is an example:

First 3 rows of the Titanic: Machine Learning from Disaster dataset

1. Creating a pandas DataFrame

The pandas.DataFrame constructor:

pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False

data
This parameter serves as the input to make a DataFrame, which could be a NumPy ndarray, iterable, dict or another DataFrame. An ndarray is a multidimensional container of items of the same type and size. An iterable is any Python object capable of returning its members one at a time, permitting to be iterated over in a for-loop. Some examples for iterables are lists, tuples and sets. Dict here can refer to pandas Series, arrays, constants or list-like objects.

index
This parameter could have an Index or an array-like data type and serves as the index for the row labels in the resulting DataFrame. If no indexing information is provided, this parameter will default to RangeIndex.

columns
This parameter could have an Index or an array-like data type and serves as the index for the column labels in the resulting DataFrame. If no indexing information is provided, this parameter will default to RangeIndex.

dtype
Each column in the DataFrame can only have a single data type. This parameter is used to force a certain data type. By default, datatype is inferred from data.

copy
When this parameter is set to True, and the input data is a DataFrame or a 2D ndarray, data is copied into the resulting DataFrame. By default, copy is set to False.

Creating a Pandas DataFrame from a Python Dictionary

import pandas as pd

d = {'Name' : ['John', 'Adam', 'Jane'], 'Age' : [25, 18, 30]} pd.DataFrame(d)

The index parameter can be used to change the default row index and the columns parameter can be used to change the order of the keys:

d = {'Name' : ['John', 'Adam', 'Jane'], 'Age' : [25, 18, 30]} pd.DataFrame(d, index=[10, 20, 30], columns=['First Name', 'Current Age'])

Creating a Pandas DataFrame from a list:

l = [['John', 25], ['Adam', 18], ['Jane', 30]] pd.DataFrame(l, columns=['Name', 'Age'])

Creating a Pandas DataFrame from a File

For any Data Science process, the dataset is commonly stored in files having formats like CSV (Comma Separated Values). Pandas allows storing data along with their labels from a CSV file using the method pandas.read_csv().

2. Selecting Rows and Columns from a Pandas DataFrame

Selecting Columns from a Pandas DataFrame

Columns can be selected using their column names.

df[column_1, column_2])

Selecting column ‘Name’ from DataFrame df

Selecting Rows from a Pandas DataFrame

Pandas provides 2 attributes for selecting rows from a DataFrame: loc and iloc

loc is label-based, which means that the row label has to be specified and iloc is integer-based which means that the integer index has to be specified.

Using loc and iloc for selecting rows from DataFrame df

3. Inserting Rows and Columns to a Pandas DataFrame

Inserting Rows to a Pandas DataFrame

One method of inserting a row into a DataFrame is to create a pandas.Series() object and insert it at the end of the DataFrame using the pandas.DataFrame.append()method. The column indices of the DataFrame serve as the index attribute for the Series object.

Inserting Columns to a Pandas DataFrame

One easy method of adding a column to a DataFrame is by just referring to the new column and assigning values.

Inserting columns ID, Score and Country to DataFrame df

4. Deleting Rows and Columns from a Pandas DataFrame

Deleting Rows from a Pandas DataFrame

A row can be deleted using the method pandas.DataFrame.drop() with it’s row label.

Deleting row with label 1 from DataFrame df

To delete a row based on a column, the index of the row is obtained using the DataFrame.index attribute and then the row with the index is deleted using the pandas.DataFrame.drop() method.

Deleting row with Name Kelly from DataFrame df

Deleting Columns from a Pandas DataFrame

A column can be deleted from a DataFrame based on its label as well as its position in the DataFrame using the method pandas.DataFrame.drop().

Deleting column with label ‘Country’ from DataFrame df

Deleting column with position 2 from DataFrame df

The axis argument is set to 1 when dropping columns, and 0 when dropping rows.

5. Sorting a Pandas DataFrame

A Pandas DataFrame can be sorted using the pandas.DataFrame.sort_values() method. The by parameter for the method serves as the label of the column to sort by and ascending is set to True for sorting in ascending order and to False for sorting in descending order.