Dataframe in Python
Introduction
The Python programming language developed by Guido Van Rossum in the 1990s is ranked as the third most popular programming language in the Stack Overflow Developer Survey 2021. Numpy and Pandas packages of Python are the preferred choices of 33.84% and 28.12% of developers globally and are primarily used in Data Science and Machine Learning. There is so much you can do with Python and its frameworks. This blog introduces DataFrame in Python along with various examples and programs. DataFrames are the primary Data Structure of the Pandas package. It provides fast, flexible, and expressive data structures designed to make working with relational or labeled data easier; the two primary data structures of pandas, Series(1-dimensional) or simply an array of objects and DataFrame(2-dimensional), are frequently used in the fields of Finance and Statistics.
What is a DataFrame in Python?
A DataFrame is defined as two-dimensional, size-mutable, potentially heterogeneous tabular data. (Mutable means columns can be inserted and deleted from DataFrame, Potentially heterogeneous data means columns may contain data of different data types). DataFrame is the Pandas Data Structure. They are a container for Series, and insertion and deletion of objects from these containers are done in a dictionary-like fashion.
The constructor for creating of a DataFrame is:
DataFrame([data, index, columns, dtype, copy])
The parameters of the constructor are as follows:
The three main components of Pandas DataFrame are the data, the index, and the columns. The Data can be a Pandas Series, a Pandas DataFramem, a Numpy ndarray, a two-dimensional ndarray, dictionaries of one-dimensional ndarray, lists, dictionaries or Series. Basic operations on a DataFrame like Creation, Insertion, Deletion, Renaming, and Sorting of DataFrame are discussed in this blog. In real-world projects, a DataFrame will be created by loading data from a database, CSV file, or an excel file. As a starting point lets see how DataFrame can be created using different ways apart from directly importing tabular data:
Creation of DataFrame
In real-world projects, a DataFrame will be created by loading data from a database, CSV file, or an excel file. As a starting point lets see how DataFrame can be created using different ways apart from directly importing tabular data:
- Creating a DataFrame using List
A DataFrame can be created using a single list or a list of lists.
Example 1
This example uses a single list for the creation of a DataFrame.
import pandas as pd
# This will create an empty DataFrame
df1 = pd.DataFrame()
print(df1)
listOne = ['DataFrame', 'in','Python']
# Calling DataFrame constructor on list
df2 = pd.DataFrame(listOne)
print(df2)
The above code when executed will generate a Pandas DataFrame as shown below
Empty DataFrame
Columns: []
Index: []
0
0 DataFrame
1 in
2 Python
Explanation
In the above example, a Python list is created. The list is then converted to a DataFrame, using the DataFrame() constructor of pandas Package. The DataFrame is then printed. Note that as we didn’t specify the column name and indices for the row, Python does it by default and gives zero-based indexing to the various rows of the DataFrame.
You can also specify index and column names by doing a minor change in the code above.
Example 2
import pandas as pdlistOne = ['DataFrame', 'in','Python']# Calling DataFrame constructor on listdf = pd.DataFrame(listOne, index=['i1','i2','i3'], columns=['Names'])print(df)
The above code when executed will generate a Pandas DataFrame as shown below:
import pandas as pd
Names
i1 DataFrame
i2 in
i3 Python
Explanation:
In the above example, a Python List, listOne, is created. The list is then converted to a Pandas DataFrame using the DataFrame() constructor. In the above example the index=[‘i1’,’i2',’i3'] is used to specify the row indices and the columns=[‘Names’] is used to specify the column name. The DataFrame is then printed. Note that the DataFrame has now the column indices and the row indices as we specified while calling the DataFrame constructor.
Example 3
This example uses a multi-dimensional list with column names and index specified.
import pandas as pd
listTwo = [[‘Alexa’, 10], [‘Siri’, ‘20’],[‘Echo’, 30]]
# Calling DataFrame constructor on list
df2 = pd.DataFrame(listTwo, index=[‘a’,’b’,’c’], columns=[‘Name’, ‘Score’])
print(df2)
Its output is as follows:
Name Score
a Alexa 10
b Siri 20
c Echo 30
Explanation:
In the above example, a list of lists is used. Each list is then treated as an individual row of the DataFrame. The column names, columns=[‘Name’, ‘Score’], and the indices, index=[‘a’,’b’,’c’], are also used in the above example.
Example 4
This example uses a multi-dimensional list with column name, index and dtype specified.
import pandas as pd
listTwo = [[‘Alexa’, 10], [‘Siri’, ‘20’],[‘Echo’, 30]]
# Calling DataFrame constructor on list
df2 = pd.DataFrame(listTwo, index=[‘a’,’b’,’c’], columns=[‘Name’, ‘Score’],dtype=float)
print(df2)
Its output is as follows:
Name Score
a Alexa 10.0
b Siri 20.0
c Echo 30.0
Notice that the dtype parameter changes the type of Score column to floating-point.
Explanation:
In the above example, a list of lists is used, the row indices and column names are also specified when calling the constructor. In addition a new parameter, dtype=float,is also used. This specified that the Data type of the score is float.
- Creating a DataFrame from dictionary of lists
Dictionaries are used to store data in Key, Value pairs. In the examples below, key will be the column names and the values will be the data inside those columns.
Example 1
import pandas as pd
data = {‘Name’:[‘ABC’,’DEF’,’GHI’], ‘Age’:[12, 13, 14]}
df = pd.DataFrame(data)
print(df)
The output of the above code is
Name Age
0 ABC 12
1 DEF 13
2 GHI 14
Explanation
In the above example, a dictionary of lists is created. Note that the values corresponding to the keys are in list format. The dictionary of lists is then converted to a DataFrame using the, DataFrame() constructor. The DataFrame is then printed.
Example 2:
The above example was quite straight-forward. Let’s consider another example wherein we will be given 3 lists, we will convert those lists to a dictionary and then the dictionary to a DataFrame.
# Three Lists
name = ['ABC','DEF','GHI','JKL']
age = [20, 22, 26, 28]
qualifications = ['BA', 'B.Tech','B.Tech + MBA', 'CA']
# Defining a dictionary containing the name, age
# and qualifications
data_persons = {
'Name': name,
'Age' : age,
'Qualifications' : qualifications
}
# Converting the dictionary into a DataFrame
df = pd.DataFrame(data_persons)
print(df)
The output of the above code is:
Name Age Qualifications
0 ABC 20 BA
1 DEF 22 B.Tech
2 GHI 26 B.Tech + MBA
3 JKL 28 CA
Explanation:
In the above example, firstly three lists are created, name, age, and qualifications respectively. A dictionary is then created with the Keys, ‘Name’, ‘Age’ and ‘Qualifications’. The values corresponding to these keys are the lists, name, age, and qualifications respectively. A dictionary of lists is thus created. This is then converted to a DataFrame using the DataFrame() constructor and printed. Note that each key of the dictionary is a column in a DataFrame. The row indexing is zero-based.
- Creating a DataFrame from List of Dictionaries
A list of dictionaries can be passed as input data to create a DataFrame. By default, the dictionary keys will be taken as the column names.
Example 1
import pandas as pd
# {'a': 1, 'b': 2} is first dictionary
# {'a': 3, 'b' : 4, 'c' : 7} is second dictionary
data = [{'a': 1, 'b': 2}, {'a': 3, 'b' : 4, 'c' : 7}]
# Converting the list of dictionaries to a DataFrame
df = pd.DataFrame(data)
print(df)
Its output is as follows:
a b c
0 1 2 NaN
1 3 4 7.0
Explanation:
In the above example, a list of dictionaries is created. The key-value pair is separated using the: in each dictionary. The list of dictionaries is then converted to a DataFrame using the DataFrame() constructor.
Note that on conversion to a DataFrame, each key becomes the column name and the values corresponding to it are the rows of the DataFrame. Also, it is worth noticing that in the first dictionary, {‘a’: 1, ‘b’: 2}, unlike the second dictionary, {‘a’: 3, ‘b’ : 4, ‘c’ : 7}, there are only two key-value pairs. So in the resultant DataFrame, the value corresponding to the third column is treated as NaN. NaN stands for Not A Number and is one of the common ways to represent the missing value in the data
It’s always recommended to specify the row indices. The row indices and column indices are very handy in other manipulation operations.
Example 2
The below example shows how to create a DataFrame with a list of dictionaries, row indices, and column indices.
import pandas as pd
# {‘a’: 1, ‘b’: 2} is first dictionary
# {‘a’: 3, ‘b’ : 4, ‘c’ : 7} is second dictionary
data = [{‘a’: 1, ‘b’: 2}, {‘a’: 3, ‘b’ : 4, ‘c’ : 7}]
# Converting the list of dictionaries to a DataFrame
df = pd.DataFrame(data,index=[‘first’,’second’] , columns=[‘a’,’b’,’c’])
print(df)
Its output is as:
a b cfirst 1 2 NaNsecond 3 4 7.0
Explanation:
In the above example, a list of dictionaries is created. The list of dictionaries is then converted to a DataFrame using the DataFrame() constructor. We have also specified the row indices using the, index=[‘first’,’second’], and the column names using the columns=[‘a’,’b’,’c’]. Note that on conversion to a DataFrame, each key becomes the column name and the values corresponding to it are the rows of the DataFrame. The DataFrame is then printed, note that now the column names and the row-indices are the one that we specified and not the default zero-based ones.
Fundamental DataFrame Operations
- Dimensions of DataFrame
The shape() method is used to get the height and width of DataFrame. - Head of DataFrame
The head() method is used to get the first five rows of the DataFrame and the tail() method is used to get the last five rows. - Locate Row
The loc attribute is used to return one or more specified rows. - Get the Column Names
You can get the list of column names by using columns in a DataFrame object. - Get the Data types of all the columns
To fetch the Data Type of each column in DataFrame use dtypes.
There are many other fundamental DataFrame operations as well. The below program illustrates the fundamental operations discussed above.
import pandas as pd
# Three Lists
name = ['ABC','DEF','GHI','JKL','MNO','PQR']
age = [20, 22, 26, 28, 29, 30]
qualifications = ['BA', 'B.Tech','B.Tech + MBA', 'CA', 'Intermediate', 'MBBS']
# Defining a dictionary containing the name, age
# and qualifications
data_persons = {
'Name': name,
'Age' : age,
'Qualifications' : qualifications
}
# Converting the dictionary into a DataFrame
df = pd.DataFrame(data_persons)
print(df)
# Shape
print("\n\nPrinting the shape of dataframe")
print(df.shape)
# Head
print("\n\nPrinting the head of dataframe")
print(df.head())
# Tail
print("\n\nPrinting the tail of dataframe")
print(df.tail())
# Location of row
print("\n\nPrinting the third row of dataframe")
print(df.loc[2])
# Column Names
print("\n\nPrinting the column names of dataframe")
print(df.columns)
# Data types
print("\n\nPrinting the Data Types of dataframe")
print(df.dtypes)
The output of the above program is:
Name Age Qualifications
0 ABC 20 BA
1 DEF 22 B.Tech
2 GHI 26 B.Tech + MBA
3 JKL 28 CA
4 MNO 29 Intermediate
5 PQR 30 MBBSPrinting the shape of dataframe
(6, 3)Printing the head of dataframe
Name Age Qualifications
0 ABC 20 BA
1 DEF 22 B.Tech
2 GHI 26 B.Tech + MBA
3 JKL 28 CA
4 MNO 29 IntermediatePrinting the tail of dataframe
Name Age Qualifications
1 DEF 22 B.Tech
2 GHI 26 B.Tech + MBA
3 JKL 28 CA
4 MNO 29 Intermediate
5 PQR 30 MBBSPrinting the third row of dataframe
Name GHI
Age 26
Qualifications B.Tech + MBA
Name: 2, dtype: objectPrinting the column names of dataframe
Index(['Name', 'Age', 'Qualifications'], dtype='object')Printing the Data Types of dataframe
Name object
Age int64
Qualifications object
dtype: object
Explanation
In the above example, the process of creation of a DataFrame is the same as the one discussed above in the example 2 of creating a DataFrame from a dictionary of lists. The sequence of output is explained below:
- Firstly the DataFrame is printed, then its shape i.e. the number of rows and columns in a tuple.
- The head() of the DataFrame is the first five rows of the DataFrame.
- The tail() of the DataFrame is the last five rows of the DataFrame.
- The df.loc[2] is used for printing the third row of the DataFrame. Note that we specify 2 in the loc[] because by default zero-based indexing is used unless explicitly specified.
- The df.columns are used to print the column labels of the DataFrame.
- The df.dtypes is used to print the Data types of all the columns in the DataFrame.
Operations on Columns in a DataFrame
In an existing DataFrame, we can rename column names, add columns, and delete columns very easily. All the three basic operations on columns are explained below:
Addition of Columns
To add a new column to a DataFrame, create a Series and assign it as a new column to the original DataFrame. The following example demonstrates this:
import pandas as pd
listTwo = [[‘Alice’, 10], [‘Bob’, ‘20’],[‘Charlie’, 30]]
# Calling DataFrame constructor on list
df = pd.DataFrame(listTwo, index=[‘a’,’b’,’c’], columns=[‘Name’, ‘Age’],dtype=float)
print(“Printing the original DataFrame”)
print(df)
# Creating a pandas Series
states = pd.Series([‘New York’,’Seattle’,’Washington’])
# Assigning the pandas Series as a new column
df[‘state’] = states
print(“\n\n Printing the modified DataFrame”)
print(df)
The output of the above program is:
Printing the original DataFrame
Name Age
a Alice 10.0
b Bob 20.0
c Charlie 30.0Printing the modified DataFrame
Name Age state
a Alice 10.0 NaN
b Bob 20.0 NaN
c Charlie 30.0 NaN
Explanation:
In the above example, firstly a DataFrame is created using a list of lists, the row indices and column names are specified along with the Data types in the DataFrame() constructor. The DataFrame() is then printed.
A pandas Series, named as series is created. This pandas series is assigned as a new column to the DataFrame. Then the modified DataFrame is printed.
Renaming of Columns
We can change the name of a single column as well as multiple columns using the .rename() method of Pandas. The following example illustrates the renaming of single and multiple columns
import pandas as pd
df = pd.DataFrame({
'Name' : ['Alice','Bob','Charlie'],
'Age' : [12, 13, 14]
})
print("Printing the original DataFrame")
print(df)
# Renaming of single column 'Name'
df.rename(columns={'Name':'FirstName'}, inplace=True)
print("\n\nPrinting the DataFrame after changing a single column\n")
print(df)
# Renaming of multiple columns
df.rename(columns={'FirstName':'fName', 'Age':'Years'}, inplace=True)
print("\n\nPrinting the DataFrame after changing multiple column names\n")
print(df)
The output is as follows:
Printing the original DataFrame
Name Age
0 Alice 12
1 Bob 13
2 Charlie 14Printing the DataFrame after changing a single column FirstName Age
0 Alice 12
1 Bob 13
2 Charlie 14Printing the DataFrame after changing multiple column names fName Years
0 Alice 12
1 Bob 13
2 Charlie 14
Explanation:
In the above example, firstly a DataFrame is created using the Dictionary of Lists. The original DataFrame is then printed.
The df.rename() method is used for renaming column names, by passing the original and modified column names as key:value pairs of a dictionary, columns={‘Name’:’FirstName’}. The inplace=True, is used to specify that the data is modified in place, which means it will return nothing and the original DataFrame is now modified. The modified DataFrame is then printed.
Again the df.rename() method is used.This time two column names will be renamed. Both the original and modified column names are passed as key:value pairs in the dictionary, columns={‘FirstName’:’fName’, ‘Age’:’Years’}. The modified DataFrame is then printed.
The above method is quite handy when renaming a single column or few columns. However, in the case of large datasets containing 100s of columns, specifying the old and new names becomes tedious. In such cases, renaming can be done by assigning a list of new column names.
import pandas as pd
df = pd.DataFrame({
‘Name’ : [‘Alice’,’Bob’,’Charlie’],
‘Age’ : [12, 13, 14]
})
print(“Printing the DataFrame column names”)
print(df.columns)
# Modifying the column names by passing a list of new column names.
df.columns = [‘FName’, ‘Years’]
print(“Printing the DataFrame column names after modification”)
print(df.columns)
The output is as:
Printing the DataFrame column names
Index([‘Name’, ‘Age’], dtype=’object’)
Printing the DataFrame column names after modification
Index([‘FName’, ‘Years’], dtype=’object’)
Explanation:
In the above example, a DataFrame is created using a dictionary of lists. The print(df.columns) is used to print the column names of the DataFrame.
The column names are modified by passing a list of new column names, df.columns = [‘FName’, ‘Years’]. The modified column names are then printed.
Deletion of Columns
A column can be deleted using the del and pop() function. The following example illustrates the deletion of columns.
import pandas as pd
# Three Lists
name = ['ABC','DEF','GHI','JKL']
age = [20, 22, 26, 28]
qualifications = ['BA', 'B.Tech','B.Tech + MBA', 'CA']
# Defining a dictionary containing the name, age
# and qualifications
data_persons = {
'Name': name,
'Age' : age,
'Qualifications' : qualifications
}
# Converting the dictionary into a DataFrame
df = pd.DataFrame(data_persons)
print(df)
# Using the del function
print("\n\nDeleting the first column")
del df['Name']
print(df)
# Using the pop function
print("\n\nDeleting the second column")
df.pop('Age')
print(df)
Its output is as follows:
Name Age Qualifications
0 ABC 20 BA
1 DEF 22 B.Tech
2 GHI 26 B.Tech + MBA
3 JKL 28 CADeleting the first column
Age Qualifications
0 20 BA
1 22 B.Tech
2 26 B.Tech + MBA
3 28 CADeleting the second column
Qualifications
0 BA
1 B.Tech
2 B.Tech + MBA
3 CA
Explanation:
In the above example, first, the DataFrame is created using a Dictionary of lists. The original DataFrame is printed.
A single column is then deleted using the del df[‘Name’], by passing the column name. Again the DataFrame is printed.
Another column is then deleted using df.pop[‘Age’] by passing the column name. Again the DataFrame is printed.
Apart from the syntax difference, between the del and pop methods, another difference is pop returns the deleted value from the list and del does not return anything.
Operation on rows in a DataFrame
In an existing DataFrame, we can add and delete a row very easily. All the three basic operations are discussed below:
Insertion of Rows
We can add a new row to the dataframe using the append function. This function is used to insert the rows at the end. This is illustrated in the below example:
import pandas as pd
df1 = pd.DataFrame([[1, 2], [3, 4]], columns = ['col1','col2'])
print("Original DataFrame")
print(df1)
df2 = pd.DataFrame([[5, 6],[7, 8]], columns=['col1','col2'])
df1 = df1.append(df2)
print("\n\nDataFrame df1 after concatenation with df2")
print(df1)
The output is as follows:
Original DataFrame
col1 col2
0 1 2
1 3 4DataFrame df1 after concatenation with df2
col1 col2
0 1 2
1 3 4
0 5 6
1 7 8
Explanation
In the above example, a DataFrame, df1, is created using a list of lists, the column names are also specified. The original DataFrame is then printed.
A new DataFrame, df2, is created in a similar manner. The Append considers the calling dataframe as the main object and adds rows to that dataframe from the data frames that are passed to the function as argument df1.append() is used to insert df2 inside df1. The modified DataFrame is then printed.
In addition to the append() method, concat() method can also be used for insertion/addition of rows in a DataFrame. This is shown in the example below:
import pandas as pd
df1 = pd.DataFrame([[1, 2], [3, 4]], columns = ['col1','col2'])
print("Original DataFrame")
print(df1)
df2 = pd.DataFrame([[5, 6],[7, 8]], columns=['col1','col2'])
df1 = pd.concat([df2, df1]).reset_index(drop = True)
print(df1)
The output is as follows:
Original DataFrame
col1 col2
0 1 2
1 3 4
col1 col2
0 5 6
1 7 8
2 1 2
3 3 4
Explanation
In the above example, first, a DataFrame, df1, is created using a list of lists. The column names are also specified. The original DataFrame is then printed. Another DataFrame, df2, is created in a similar manner.
The concat() method is used to Concatenate pandas objects along a particular axis. The two dataframes, df1 and df2 are concatenated using the concat() method. The modified DataFrame is then printed.
Deletion of Rows
Rows can be deleted from a DataFrame using index labels, the rows corresponding to those labels are dropped using the .drop() method. This is illustrated in the below example
import pandas as pd
# Three Lists
name = ['ABC','DEF','GHI','JKL','MNO','PQR']
age = [20, 22, 26, 28, 29, 30]
qualifications = ['BA', 'B.Tech','B.Tech + MBA', 'CA', 'Intermediate', 'MBBS']
# Defining a dictionary containing the name, age
# and qualifications
data_persons = {
'Name': name,
'Age' : age,
'Qualifications' : qualifications
}
# Converting the dictionary into a DataFrame
df = pd.DataFrame(data_persons)
print(df)
# Drop row with label 2
df = df.drop(2)
print("\n\n Printing DataFrame after deletion of a row")
print(df)
The output is as follows:
Name Age Qualifications
0 ABC 20 BA
1 DEF 22 B.Tech
2 GHI 26 B.Tech + MBA
3 JKL 28 CA
4 MNO 29 Intermediate
5 PQR 30 MBBSPrinting DataFrame after deletion of a row
Name Age Qualifications
0 ABC 20 BA
1 DEF 22 B.Tech
3 JKL 28 CA
4 MNO 29 Intermediate
5 PQR 30 MBBS
Explanation
In the above example, first, the DataFrame is created using the dictionary of lists, the DataFrame is then printed.
The df.drop() method is used to delete a row based on the row number, which in this example is the default zero-based indexing. The modified DataFrame is then printed.
Conclusion
This article discusses the dataframe in python, its implementation, and various operations on it with examples. It is highly recommended to study these operations and practically implement them on your own. Learn More.
Explore other operations as well. The more you explore, the more knowledge you gain!!
If you are confused about what type of questions related to Pandas are asked in an interview, you must refer to Structure Wise Interview Questions List. They are curated by experts and cover almost all the important concepts which are asked in an interview.