The Power of Pandas Library in Python

Rina Mondal
Python’s Gurus
Published in
11 min readMay 31, 2024

In this blog, we will study about Pandas library in Python. First of all, to do any machine learning Project, we need data. Now, the question is from where we can collect the data??

Data can exist in numerous formats, ranging from lists and dictionaries to CSV, Excel, JSON, and databases, or you can even design your own format.

Another way is to download datasets from the scikit-learn library or seaborn library. It is a very popular library that provides different types of datasets like toy datasets and real-life datasets.

Now, machine learning cannot process any kind of dataset directly. The datasets need to be converted to a DataFrame first. Using the Pandas library, we can create DataFrame from various types of data.

A Data Frame is a 2D tabular data structure with labeled axes (rows and columns).s types of data.

First, to Perform any Pandas operations, you have to import pandas.

import pandas as pd #pd is just like a nickname

Let’s understand how to download datasets from sklearn library: (Complete Pandas explanation is provided in my YouTube Channel)

import pandas as pd
from sklearn.datasets import load_iris

# Load the Iris dataset
iris = load_iris()

# Create a DataFrame from the features
df_features = pd.DataFrame(data=iris.data, columns=iris.feature_names)

# Create a DataFrame for the target column
df_target = pd.DataFrame(data=iris.target, columns=['target'])

# Concatenate the feature and target DataFrames horizontally
df_combined = pd.concat([df_features, df_target], axis=1)

# Display the combined DataFrame
print(df_combined.head())

Let’s dive into creating DataFrames from various data formats commonly encountered in data analysis and machine learning projects.

1.Creating a DataFrame from a Dictionary: A dictionary can easily be converted into a DataFrame. Each key in the dictionary becomes a column name, and the associated value becomes the column data.

import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
print(df)

2. Creating a DataFrame from a List of Dictionaries: Each dictionary in the list represents a row of data.

data = [
{'Name': 'Alice', 'Age': 25, 'City': 'New York'},
{'Name': 'Bob', 'Age': 30, 'City': 'Los Angeles'},
{'Name': 'Charlie', 'Age': 35, 'City': 'Chicago'}
]
df = pd.DataFrame(data)
print(df)

3. Creating a DataFrame from a List of Lists: You can specify the column names when creating a DataFrame from a list of lists.

data = [
['Alice', 25, 'New York'],
['Bob', 30, 'Los Angeles'],
['Charlie', 35, 'Chicago']
]
df = pd.DataFrame(data, columns=['Name', 'Age', 'City'])
print(df)

4. Creating a DataFrame from a NumPy Array: Pandas integrates well with NumPy, allowing you to create Data Frames directly from NumPy arrays.

import numpy as np
data = np.array([
['Alice', 25, 'New York'],
['Bob', 30, 'Los Angeles'],
['Charlie', 35, 'Chicago']
])
df = pd.DataFrame(data, columns=['Name', 'Age', 'City'])
print(df)

5. Creating a DataFrame from a CSV File: Pandas can read data from a variety of file formats, including CSV files.

df = pd.read_csv('data.csv')
print(df)

6. Creating a DataFrame from an Excel File: Pandas can also read data from Excel files.

df = pd.read_excel('data.xlsx')
print(df)

Now, Let’s understand different functions which can be applied on Pandas.

  1. pd.DataFrame(): Create a Dataframe from data using this method
import pandas as pd
data = {
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8],
'C': [9, 10, 11, 12]
}
df = pd.DataFrame(data) #will create the dataframe
df #to show the dataframe

2. read_csv(): If file is already in csv format, this function will read data from CSV files.

gold_data=pd.read_csv("gld_price_data.csv") #gold_data dataframe will be created from the gld_price_data.csv files

3. head() and tail(): The first and last rows of your DataFrame will be displayed.

data = {
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8],
'C': [9, 10, 11, 12]
}
df = pd.DataFrame(data) #will create the dataframe

df.head(1)#Display the first row
df.tail(1) #Display the last row

4. info(): Provides a concise summary of your dataset, including data types, non-null counts, and memory usage.

#getting some basic informations about data
df.info() #used the previous dataframe

O/t-
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 4 non-null int64
1 B 4 non-null int64
2 C 4 non-null int64
dtypes: int64(3)
memory usage: 228.0 bytes

5. describe(): It extracts valuable information like mean, standard deviation, and quartiles for numerical columns.

# Use describe to get summary statistics of the DataFrame
summary_stats = df.describe()

# Display the result
print("Summary Statistics:\n", summary_stats)

# O/t-
Summary Statistics:
A B C
count 4.000000 4.000000 4.000000
mean 2.500000 6.500000 10.500000
std 1.290994 1.290994 1.290994
min 1.000000 5.000000 9.000000
25% 1.750000 5.750000 9.750000
50% 2.500000 6.500000 10.500000
75% 3.250000 7.250000 11.250000
max 4.000000 8.000000 12.000000

6. value_counts(): Explore categorical variables effortlessly using value_counts(). Calculate the frequency distribution of unique value.

# Create a sample Series
data = pd.Series(['A', 'B', 'A', 'C', 'B', 'A', 'C', 'A', 'B', 'B'])

# Use value_counts to count the occurrences of each unique value in the Series
value_counts_result = data.value_counts()

# Display the result
print("Value Counts:\n", value_counts_result)

# O/t-
Value Counts:
A 4
B 4
C 2
Name: count, dtype: int64

7. groupby- Power of Grouping: Segmenting and analyzing your data by specific columns so that advanced aggregation and transformation operations can be performed.

# Create a sample DataFrame
data = {
'Category': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'A', 'B', 'B'],
'Value': [10, 15, 20, 25, 30, 35, 40, 45, 50, 55]
}
df = pd.DataFrame(data)

# Group the DataFrame by the 'Category' column
grouped = df.groupby('Category')

# Calculate the mean value for each group
mean_values = grouped.mean()

# Display the result
print("Mean values for each category:\n", mean_values)

# O/t: Mean values for each category:
Value
Category
A 27.5
B 37.5
C 32.5

8. Merge() and concat(): Combine datasets using merge() and concat().

# Create two sample DataFrames
df1 = pd.DataFrame({
'ID': [1, 2, 3],
'Name': ['Alice', 'Bob', 'Charlie']
})

df2 = pd.DataFrame({
'ID': [4, 5, 6],
'Name': ['David', 'Eve', 'Frank']
})

# Concatenation using concat function
concatenated_df = pd.concat([df1, df2], ignore_index=True)

# Display the concatenated DataFrame
print("Concatenated DataFrame:\n", concatenated_df)

# Create another DataFrame for merging
df3 = pd.DataFrame({
'ID': [2, 4, 6],
'Score': [85, 90, 78]
})

# Merge using merge function
merged_df = pd.merge(concatenated_df, df3, on='ID', how='right')

# Display the merged DataFrame
print("\nMerged DataFrame:\n", merged_df)

# O/t- Concatenated DataFrame:
ID Name
0 1 Alice
1 2 Bob
2 3 Charlie
3 4 David
4 5 Eve
5 6 Frank

# Merged DataFrame:
ID Name Score
0 2 Bob 85
1 4 David 90
2 6 Frank 78

9. apply(): It is used to apply a function along the axis of a DataFrame or a Series, or over elements in an array.

# Create a sample DataFrame
df = pd.DataFrame({
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8],
'C': [10, 11, 12, 13]
})

# Define a custom function
def square_root(x):
return x ** 0.5

# Apply the custom function to a specific column in the DataFrame
result_series = df['A'].apply(square_root)

# Display the result
print("Result Series:\n", result_series)

# O/t-
Result Series:
0 1.000000
1 1.414214
2 1.732051
3 2.000000
Name: A, dtype: float64

10. set_index() and reset_index(): Use set_index() to set a specific column as the DataFrame index and reset_index() to revert to the default integer index.

# Create a sample DataFrame
data = {
'ID': [1, 2, 3],
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 22]
}
df = pd.DataFrame(data)

# Set the 'ID' column as the index
df.set_index('ID', inplace=True)

# Display the DataFrame with the new index
print("DataFrame with 'ID' as index:\n", df)

# O/t-
DataFrame with 'ID' as index:
Name Age
ID
1 Alice 25
2 Bob 30
3 Charlie 22
# Reset the index, adding a new default integer index
df.reset_index(inplace=True)

# Display the DataFrame with the reset index
print("DataFrame with reset index:\n", df)

# O/t-
DataFrame with reset index:
ID Name Age
0 1 Alice 25
1 2 Bob 30
2 3 Charlie 22

11. loc[] and iloc[] :
loc[] — label-based indexing

iloc[]- integer-location-based indexing.

These functions allows to access specific rows and columns with ease.

# Create a sample DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 22, 28],
'City': ['New York', 'San Francisco', 'Los Angeles', 'Chicago']
}

df = pd.DataFrame(data)

# Set 'Name' column as the index
df.set_index('Name', inplace=True)

# Display the original DataFrame
print("Original DataFrame:\n", df)

# Use loc[] to select data based on labels
selected_data_loc = df.loc[['Alice', 'Charlie'], ['Age', 'City']]

# Use iloc[] to select data based on integer indices
selected_data_iloc = df.iloc[0:3, 0:2]

# Display the results
print("\nSelected Data using loc[]:\n", selected_data_loc)
print("\nSelected Data using iloc[]:\n", selected_data_iloc)
# O/t-

Selected Data using loc[]:
Age City
Name
Alice 25 New York
Charlie 22 Los Angeles

#O/t- Selected Data using iloc[]:
Age City
Name
Alice 25 New York
Bob 30 San Francisco
Charlie 22 Los Angeles

12. drop(): Removing unwanted rows or columns using the drop() function.

import pandas as pd

# Create a sample DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 22, 28],
'City': ['New York', 'San Francisco', 'Los Angeles', 'Chicago']
}

df = pd.DataFrame(data)

# Set 'Name' column as the index
df.set_index('Name', inplace=True)

# Display the original DataFrame
print("Original DataFrame:\n", df)

# Drop the row with label 'Bob' using the index values ('Alice', 'Charlie', 'David')
df = df.drop('Bob', axis=0)

# Display the DataFrame after dropping the row
print("\nDataFrame after dropping 'Bob':\n", df)
#O/t- Original DataFrame:
Age City
Name
Alice 25 New York
Bob 30 San Francisco
Charlie 22 Los Angeles
David 28 Chicago

#O/t- DataFrame after dropping 'Bob':
Age City
Name
Alice 25 New York
Charlie 22 Los Angeles
David 28 Chicago

13. isnull() and notnull():
isnull()- method identifies where the values are null (NaN)

notnull()- method identifies where the values are not null.

import pandas as pd

# Create a sample DataFrame with missing values
data = {
'Name': ['Alice', 'Bob', None, 'David'],
'Age': [25, None, 22, 28],
'City': ['New York', 'San Francisco', 'Los Angeles', None]
}

df = pd.DataFrame(data)

# Check for null values in the DataFrame
null_values = df.isnull()

# Check for non-null values in the DataFrame
notnull_values = df.notnull()

# Display the original DataFrame and the results
print("Original DataFrame:\n", df)
print("\nDataFrame indicating Null Values:\n", null_values)
print("\nDataFrame indicating Non-null Values:\n", notnull_values)
#O/t- Original DataFrame:
Name Age City
0 Alice 25.0 New York
1 Bob NaN San Francisco
2 None 22.0 Los Angeles
3 David 28.0 None

#O/t- DataFrame indicating Null Values:
Name Age City
0 False False False
1 False True False
2 True False False
3 False False True

#O/t- DataFrame indicating Non-null Values:
Name Age City
0 True True True
1 True False True
2 False True True
3 True True False

14. sort_values(): sort a DataFrame or Series by the values along a specified axis.

import pandas as pd

# Create a sample DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 22, 28],
'City': ['New York', 'San Francisco', 'Los Angeles', 'Chicago']
}

df = pd.DataFrame(data)

# Sort the DataFrame by the 'Age' column in ascending order
df_sorted_age_asc = df.sort_values(by='Age')

# Display the original DataFrame and the sorted DataFrame
print("Original DataFrame:\n", df)
print("\nDataFrame sorted by Age in ascending order:\n", df_sorted_age_asc)

# O/t-
Original DataFrame:
Name Age City
0 Alice 25 New York
1 Bob 30 San Francisco
2 Charlie 22 Los Angeles
3 David 28 Chicago

#O/t- DataFrame sorted by Age in ascending order:
Name Age City
2 Charlie 22 Los Angeles
0 Alice 25 New York
3 David 28 Chicago
1 Bob 30 San Francisco

15. pivot_table() : Transform your data with pivot_table(). This function enables you to reshape your DataFrame, aggregating and summarizing information to gain a different perspective on your data.

import pandas as pd

# Create a sample DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Alice', 'Bob', 'Charlie'],
'Subject': ['Math', 'Math', 'Math', 'Math', 'Physics', 'Physics', 'Physics'],
'Score': [85, 90, 75, 88, 92, 89, 78]
}

df = pd.DataFrame(data)

# Use pivot_table to create a pivot table
pivot_table_result = pd.pivot_table(df, values='Score', index='Name', columns='Subject')

# Display the original DataFrame and the pivot table result
print("Original DataFrame:\n", df)
print("\nPivot Table Result:\n", pivot_table_result)


#O/t-
Original DataFrame:
Name Subject Score
0 Alice Math 85
1 Bob Math 90
2 Charlie Math 75
3 David Math 88
4 Alice Physics 92
5 Bob Physics 89
6 Charlie Physics 78

Pivot Table Result:
Subject Math Physics
Name
Alice 85.0 92.0
Bob 90.0 89.0
Charlie 75.0 78.0
David 88.0 NaN

16. duplicated() and drop_duplicates(): Identify and manage duplicate rows with duplicated() and drop_duplicates(). Keep your dataset clean and avoid skewed analysis results.

import pandas as pd

# Sample DataFrame with duplicate rows
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'David', 'Bob'],
'Age': [25, 30, 35, 25, 40, 30],
'City': ['New York', 'San Francisco', 'Los Angeles', 'New York', 'Chicago', 'San Francisco']
}

df = pd.DataFrame(data)

# Identify duplicate rows using duplicated()
duplicates_mask = df.duplicated()

# Display original DataFrame and duplicated rows
print("Original DataFrame:")
print(df)

print("\nDuplicated Rows:")
print(df[duplicates_mask])

# Remove duplicate rows using drop_duplicates()
df_no_duplicates = df.drop_duplicates()

# Display DataFrame after removing duplicates
print("\nDataFrame after Removing Duplicates:")
print(df_no_duplicates)

17. astype() — Convert data types efficiently using astype(). Ensure that your columns have the appropriate data type for analysis and visualization purposes.

import pandas as pd

# Sample DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': ['25', '30', '35'],
'Salary': ['50000', '60000', '70000'],
}

df = pd.DataFrame(data)

# Display the original DataFrame with data types
print("Original DataFrame:")
print(df.dtypes)

# Convert 'Age' and 'Salary' columns to numeric using astype()
df['Age'] = df['Age'].astype(int)
df['Salary'] = df['Salary'].astype(float)

# Display the DataFrame after data type conversion
print("\nDataFrame after Data Type Conversion:")
print(df.dtypes)

18. to_datetime() — Transform string representations of dates into datetime objects using to_datetime(). This is crucial for time-series analysis and plotting.

import pandas as pd

# Sample DataFrame with string representations of dates
data = {
'Date': ['2022-01-01', '2022-02-15', '2022-03-20'],
'Value': [10, 15, 20],
}

df = pd.DataFrame(data)

# Display the original DataFrame with data types
print("Original DataFrame:")
print(df.dtypes)

# Convert the 'Date' column to datetime using to_datetime()
df['Date'] = pd.to_datetime(df['Date'])

# Display the DataFrame after datetime conversion
print("\nDataFrame after Datetime Conversion:")
print(df.dtypes)

19. corr(): Understand the relationships between variables by calculating correlations with corr(). This function is essential for exploring patterns and dependencies in your data.

import pandas as pd

# Sample DataFrame with numeric variables
data = {
'A': [1, 2, 3, 4, 5],
'B': [5, 4, 3, 2, 1],
'C': [2, 3, 4, 5, 6],
}

df = pd.DataFrame(data)

# Display the original DataFrame
print("Original DataFrame:")
print(df)

# Calculate correlations between variables using corr()
correlation_matrix = df.corr()

# Display the correlation matrix
print("\nCorrelation Matrix:")
print(correlation_matrix)

Convert any Dataframe to Excel file:

# Create a sample DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 22, 28],
'City': ['New York', 'San Francisco', 'Los Angeles', 'Chicago']
}

df = pd.DataFrame(data)

# Specify the Excel file path (you can change it to your desired path)
excel_file_path = 'example.xlsx'

# Write the DataFrame to an Excel file
df.to_excel(excel_file_path, index=False)

print(f"DataFrame has been exported to {excel_file_path}")

In this journey through essential Pandas functions, we’ve only scratched the surface of what this powerful library can offer. By mastering these functions, you’ll gain the skills needed to efficiently clean, explore, and analyze diverse datasets, paving the way for impactful data-driven insights.

Remember that the key to becoming a proficient data analyst lies in practice and continuous learning. Happy coding!

Complete Explanation for Free.

Give it :👏👏👏👏:
If you found this guide helpful , why not show some love? Give it a Clap 👏, and if you have questions or topics you’d like to explore further, drop a comment 💬 below 👇

Python’s Gurus🚀

Thank you for being a part of the Python’s Gurus community!

Before you go:

  • Be sure to clap x50 time and follow the writer ️👏️️
  • Follow us: Newsletter
  • Do you aspire to become a Guru too? Submit your best article or draft to reach our audience.

--

--

Rina Mondal
Python’s Gurus

I have an 8 years of experience and I always enjoyed writing articles. If you appreciate my hard work, please follow me, then only I can continue my passion.