The Power of Pandas Library in Python

Published in

Python’s Gurus

11 min readMay 31, 2024

In this blog, we will study about Pandas library in Python. First of all, to do any machine learning Project, we need data. Now, the question is from where we can collect the data??

Data can exist in numerous formats, ranging from lists and dictionaries to CSV, Excel, JSON, and databases, or you can even design your own format.

Another way is to download datasets from the scikit-learn library or seaborn library. It is a very popular library that provides different types of datasets like toy datasets and real-life datasets.

Now, machine learning cannot process any kind of dataset directly. The datasets need to be converted to a DataFrame first. Using the Pandas library, we can create DataFrame from various types of data.

A Data Frame is a 2D tabular data structure with labeled axes (rows and columns).s types of data.

First, to Perform any Pandas operations, you have to import pandas.

import pandas as pd #pd is just like a nickname

Let’s understand how to download datasets from sklearn library: (Complete Pandas explanation is provided in my YouTube Channel)

import pandas as pd
from sklearn.datasets import load_iris

# Load the Iris dataset
iris = load_iris()

# Create a DataFrame from the features
df_features = pd.DataFrame(data=iris.data, columns=iris.feature_names)

# Create a DataFrame for the target column
df_target = pd.DataFrame(data=iris.target, columns=['target'])

# Concatenate the feature and target DataFrames horizontally
df_combined = pd.concat([df_features, df_target], axis=1)

# Display the combined DataFrame
print(df_combined.head())

Let’s dive into creating DataFrames from various data formats commonly encountered in data analysis and machine learning projects.

1.Creating a DataFrame from a Dictionary: A dictionary can easily be converted into a DataFrame. Each key in the dictionary becomes a column name, and the associated value becomes the column data.

import pandas as pd
data = {
 'Name': ['Alice', 'Bob', 'Charlie'],
 'Age': [25, 30, 35],
 'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
print(df)

2. Creating a DataFrame from a List of Dictionaries: Each dictionary in the list represents a row of data.

data = [
 {'Name': 'Alice', 'Age': 25, 'City': 'New York'},
 {'Name': 'Bob', 'Age': 30, 'City': 'Los Angeles'},
 {'Name': 'Charlie', 'Age': 35, 'City': 'Chicago'}
]
df = pd.DataFrame(data)
print(df)

3. Creating a DataFrame from a List of Lists: You can specify the column names when creating a DataFrame from a list of lists.

data = [
 ['Alice', 25, 'New York'],
 ['Bob', 30, 'Los Angeles'],
 ['Charlie', 35, 'Chicago']
]
df = pd.DataFrame(data, columns=['Name', 'Age', 'City'])
print(df)

4. Creating a DataFrame from a NumPy Array: Pandas integrates well with NumPy, allowing you to create Data Frames directly from NumPy arrays.

import numpy as np
data = np.array([
 ['Alice', 25, 'New York'],
 ['Bob', 30, 'Los Angeles'],
 ['Charlie', 35, 'Chicago']
])
df = pd.DataFrame(data, columns=['Name', 'Age', 'City'])
print(df)

5. Creating a DataFrame from a CSV File: Pandas can read data from a variety of file formats, including CSV files.

df = pd.read_csv('data.csv')
print(df)

6. Creating a DataFrame from an Excel File: Pandas can also read data from Excel files.

df = pd.read_excel('data.xlsx')
print(df)

Now, Let’s understand different functions which can be applied on Pandas.

pd.DataFrame(): Create a Dataframe from data using this method

import pandas as pd
data = {
    'A': [1, 2, 3, 4],
    'B': [5, 6, 7, 8],
    'C': [9, 10, 11, 12]
}
df = pd.DataFrame(data) #will create the dataframe
df  #to show the dataframe

2. read_csv(): If file is already in csv format, this function will read data from CSV files.

gold_data=pd.read_csv("gld_price_data.csv") #gold_data dataframe will be created from the gld_price_data.csv files

3. head() and tail(): The first and last rows of your DataFrame will be displayed.

data = {
    'A': [1, 2, 3, 4],
    'B': [5, 6, 7, 8],
    'C': [9, 10, 11, 12]
}
df = pd.DataFrame(data) #will create the dataframe

df.head(1)#Display the first row
df.tail(1) #Display the last row

4. info(): Provides a concise summary of your dataset, including data types, non-null counts, and memory usage.

#getting some basic informations about data
df.info() #used the previous dataframe

O/t- 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   A       4 non-null      int64
 1   B       4 non-null      int64
 2   C       4 non-null      int64
dtypes: int64(3)
memory usage: 228.0 bytes

5. describe(): It extracts valuable information like mean, standard deviation, and quartiles for numerical columns.

# Use describe to get summary statistics of the DataFrame
summary_stats = df.describe()

# Display the result
print("Summary Statistics:\n", summary_stats)

# O/t- 
Summary Statistics:
               A         B          C
count  4.000000  4.000000   4.000000
mean   2.500000  6.500000  10.500000
std    1.290994  1.290994   1.290994
min    1.000000  5.000000   9.000000
25%    1.750000  5.750000   9.750000
50%    2.500000  6.500000  10.500000
75%    3.250000  7.250000  11.250000
max    4.000000  8.000000  12.000000

6. value_counts(): Explore categorical variables effortlessly using value_counts(). Calculate the frequency distribution of unique value.

# Create a sample Series
data = pd.Series(['A', 'B', 'A', 'C', 'B', 'A', 'C', 'A', 'B', 'B'])

# Use value_counts to count the occurrences of each unique value in the Series
value_counts_result = data.value_counts()

# Display the result
print("Value Counts:\n", value_counts_result)

# O/t- 
Value Counts:
A    4
B    4
C    2
Name: count, dtype: int64

7. groupby- Power of Grouping: Segmenting and analyzing your data by specific columns so that advanced aggregation and transformation operations can be performed.

# Create a sample DataFrame
data = {
    'Category': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'A', 'B', 'B'],
    'Value': [10, 15, 20, 25, 30, 35, 40, 45, 50, 55]
}
df = pd.DataFrame(data)

# Group the DataFrame by the 'Category' column
grouped = df.groupby('Category')

# Calculate the mean value for each group
mean_values = grouped.mean()

# Display the result
print("Mean values for each category:\n", mean_values)

# O/t: Mean values for each category:
           Value
Category       
A          27.5
B          37.5
C          32.5

8. Merge() and concat(): Combine datasets using merge() and concat().

# Create two sample DataFrames
df1 = pd.DataFrame({
    'ID': [1, 2, 3],
    'Name': ['Alice', 'Bob', 'Charlie']
})

df2 = pd.DataFrame({
    'ID': [4, 5, 6],
    'Name': ['David', 'Eve', 'Frank']
})

# Concatenation using concat function
concatenated_df = pd.concat([df1, df2], ignore_index=True)

# Display the concatenated DataFrame
print("Concatenated DataFrame:\n", concatenated_df)

# Create another DataFrame for merging
df3 = pd.DataFrame({
    'ID': [2, 4, 6],
    'Score': [85, 90, 78]
})

# Merge using merge function
merged_df = pd.merge(concatenated_df, df3, on='ID', how='right')

# Display the merged DataFrame
print("\nMerged DataFrame:\n", merged_df)

# O/t- Concatenated DataFrame:
    ID     Name
0   1    Alice
1   2      Bob
2   3  Charlie
3   4    David
4   5      Eve
5   6    Frank

# Merged DataFrame:
    ID   Name  Score
0   2    Bob     85
1   4  David     90
2   6  Frank     78

9. apply(): It is used to apply a function along the axis of a DataFrame or a Series, or over elements in an array.

# Create a sample DataFrame
df = pd.DataFrame({
    'A': [1, 2, 3, 4],
    'B': [5, 6, 7, 8],
    'C': [10, 11, 12, 13]
})

# Define a custom function
def square_root(x):
    return x ** 0.5

# Apply the custom function to a specific column in the DataFrame
result_series = df['A'].apply(square_root)

# Display the result
print("Result Series:\n", result_series)

# O/t- 
Result Series:
0    1.000000
1    1.414214
2    1.732051
3    2.000000
Name: A, dtype: float64

10. set_index() and reset_index(): Use set_index() to set a specific column as the DataFrame index and reset_index() to revert to the default integer index.

# Create a sample DataFrame
data = {
    'ID': [1, 2, 3],
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 22]
}
df = pd.DataFrame(data)

# Set the 'ID' column as the index
df.set_index('ID', inplace=True)

# Display the DataFrame with the new index
print("DataFrame with 'ID' as index:\n", df)

# O/t- 
DataFrame with 'ID' as index:
        Name  Age
ID              
1     Alice   25
2       Bob   30
3   Charlie   22

# Reset the index, adding a new default integer index
df.reset_index(inplace=True)

# Display the DataFrame with the reset index
print("DataFrame with reset index:\n", df)

# O/t- 
DataFrame with reset index:
    ID     Name  Age
0   1    Alice   25
1   2      Bob   30
2   3  Charlie   22

11. loc[] and iloc[] :
loc[] — label-based indexing

iloc[]- integer-location-based indexing.

These functions allows to access specific rows and columns with ease.

# Create a sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 22, 28],
    'City': ['New York', 'San Francisco', 'Los Angeles', 'Chicago']
}

df = pd.DataFrame(data)

# Set 'Name' column as the index
df.set_index('Name', inplace=True)

# Display the original DataFrame
print("Original DataFrame:\n", df)

# Use loc[] to select data based on labels
selected_data_loc = df.loc[['Alice', 'Charlie'], ['Age', 'City']]

# Use iloc[] to select data based on integer indices
selected_data_iloc = df.iloc[0:3, 0:2]

# Display the results
print("\nSelected Data using loc[]:\n", selected_data_loc)
print("\nSelected Data using iloc[]:\n", selected_data_iloc)

# O/t-

Selected Data using loc[]:
          Age         City
Name                     
Alice     25     New York
Charlie   22  Los Angeles

#O/t- Selected Data using iloc[]:
          Age           City
Name                       
Alice     25       New York
Bob       30  San Francisco
Charlie   22    Los Angeles

12. drop(): Removing unwanted rows or columns using the drop() function.

import pandas as pd

# Create a sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 22, 28],
    'City': ['New York', 'San Francisco', 'Los Angeles', 'Chicago']
}

df = pd.DataFrame(data)

# Set 'Name' column as the index
df.set_index('Name', inplace=True)

# Display the original DataFrame
print("Original DataFrame:\n", df)

# Drop the row with label 'Bob' using the index values ('Alice', 'Charlie', 'David')
df = df.drop('Bob', axis=0)

# Display the DataFrame after dropping the row
print("\nDataFrame after dropping 'Bob':\n", df)

#O/t- Original DataFrame:
          Age           City
Name                       
Alice     25       New York
Bob       30  San Francisco
Charlie   22    Los Angeles
David     28        Chicago

#O/t- DataFrame after dropping 'Bob':
          Age         City
Name                     
Alice     25     New York
Charlie   22  Los Angeles
David     28      Chicago

13. isnull() and notnull():
isnull()- method identifies where the values are null (NaN)

notnull()- method identifies where the values are not null.

import pandas as pd

# Create a sample DataFrame with missing values
data = {
    'Name': ['Alice', 'Bob', None, 'David'],
    'Age': [25, None, 22, 28],
    'City': ['New York', 'San Francisco', 'Los Angeles', None]
}

df = pd.DataFrame(data)

# Check for null values in the DataFrame
null_values = df.isnull()

# Check for non-null values in the DataFrame
notnull_values = df.notnull()

# Display the original DataFrame and the results
print("Original DataFrame:\n", df)
print("\nDataFrame indicating Null Values:\n", null_values)
print("\nDataFrame indicating Non-null Values:\n", notnull_values)

#O/t- Original DataFrame:
     Name   Age           City
0  Alice  25.0       New York
1    Bob   NaN  San Francisco
2   None  22.0    Los Angeles
3  David  28.0           None

#O/t- DataFrame indicating Null Values:
     Name    Age   City
0  False  False  False
1  False   True  False
2   True  False  False
3  False  False   True

#O/t- DataFrame indicating Non-null Values:
     Name    Age   City
0   True   True   True
1   True  False   True
2  False   True   True
3   True   True  False

14. sort_values(): sort a DataFrame or Series by the values along a specified axis.

import pandas as pd

# Create a sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 22, 28],
    'City': ['New York', 'San Francisco', 'Los Angeles', 'Chicago']
}

df = pd.DataFrame(data)

# Sort the DataFrame by the 'Age' column in ascending order
df_sorted_age_asc = df.sort_values(by='Age')

# Display the original DataFrame and the sorted DataFrame
print("Original DataFrame:\n", df)
print("\nDataFrame sorted by Age in ascending order:\n", df_sorted_age_asc)

# O/t-
Original DataFrame:
       Name  Age           City
0    Alice   25       New York
1      Bob   30  San Francisco
2  Charlie   22    Los Angeles
3    David   28        Chicago

#O/t- DataFrame sorted by Age in ascending order:
       Name  Age           City
2  Charlie   22    Los Angeles
0    Alice   25       New York
3    David   28        Chicago
1      Bob   30  San Francisco

15. pivot_table() : Transform your data with pivot_table(). This function enables you to reshape your DataFrame, aggregating and summarizing information to gain a different perspective on your data.

import pandas as pd

# Create a sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Alice', 'Bob', 'Charlie'],
    'Subject': ['Math', 'Math', 'Math', 'Math', 'Physics', 'Physics', 'Physics'],
    'Score': [85, 90, 75, 88, 92, 89, 78]
}

df = pd.DataFrame(data)

# Use pivot_table to create a pivot table
pivot_table_result = pd.pivot_table(df, values='Score', index='Name', columns='Subject')

# Display the original DataFrame and the pivot table result
print("Original DataFrame:\n", df)
print("\nPivot Table Result:\n", pivot_table_result)


#O/t- 
Original DataFrame:
       Name  Subject  Score
0    Alice     Math     85
1      Bob     Math     90
2  Charlie     Math     75
3    David     Math     88
4    Alice  Physics     92
5      Bob  Physics     89
6  Charlie  Physics     78

Pivot Table Result:
 Subject  Math  Physics
Name                  
Alice    85.0     92.0
Bob      90.0     89.0
Charlie  75.0     78.0
David    88.0      NaN

16. duplicated() and drop_duplicates(): Identify and manage duplicate rows with duplicated() and drop_duplicates(). Keep your dataset clean and avoid skewed analysis results.

import pandas as pd

# Sample DataFrame with duplicate rows
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'David', 'Bob'],
    'Age': [25, 30, 35, 25, 40, 30],
    'City': ['New York', 'San Francisco', 'Los Angeles', 'New York', 'Chicago', 'San Francisco']
}

df = pd.DataFrame(data)

# Identify duplicate rows using duplicated()
duplicates_mask = df.duplicated()

# Display original DataFrame and duplicated rows
print("Original DataFrame:")
print(df)

print("\nDuplicated Rows:")
print(df[duplicates_mask])

# Remove duplicate rows using drop_duplicates()
df_no_duplicates = df.drop_duplicates()

# Display DataFrame after removing duplicates
print("\nDataFrame after Removing Duplicates:")
print(df_no_duplicates)

17. astype() — Convert data types efficiently using astype(). Ensure that your columns have the appropriate data type for analysis and visualization purposes.

import pandas as pd

# Sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': ['25', '30', '35'],
    'Salary': ['50000', '60000', '70000'],
}

df = pd.DataFrame(data)

# Display the original DataFrame with data types
print("Original DataFrame:")
print(df.dtypes)

# Convert 'Age' and 'Salary' columns to numeric using astype()
df['Age'] = df['Age'].astype(int)
df['Salary'] = df['Salary'].astype(float)

# Display the DataFrame after data type conversion
print("\nDataFrame after Data Type Conversion:")
print(df.dtypes)

18. to_datetime() — Transform string representations of dates into datetime objects using to_datetime(). This is crucial for time-series analysis and plotting.

import pandas as pd

# Sample DataFrame with string representations of dates
data = {
    'Date': ['2022-01-01', '2022-02-15', '2022-03-20'],
    'Value': [10, 15, 20],
}

df = pd.DataFrame(data)

# Display the original DataFrame with data types
print("Original DataFrame:")
print(df.dtypes)

# Convert the 'Date' column to datetime using to_datetime()
df['Date'] = pd.to_datetime(df['Date'])

# Display the DataFrame after datetime conversion
print("\nDataFrame after Datetime Conversion:")
print(df.dtypes)

19. corr(): Understand the relationships between variables by calculating correlations with corr(). This function is essential for exploring patterns and dependencies in your data.

import pandas as pd

# Sample DataFrame with numeric variables
data = {
    'A': [1, 2, 3, 4, 5],
    'B': [5, 4, 3, 2, 1],
    'C': [2, 3, 4, 5, 6],
}

df = pd.DataFrame(data)

# Display the original DataFrame
print("Original DataFrame:")
print(df)

# Calculate correlations between variables using corr()
correlation_matrix = df.corr()

# Display the correlation matrix
print("\nCorrelation Matrix:")
print(correlation_matrix)

Convert any Dataframe to Excel file:

# Create a sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 22, 28],
    'City': ['New York', 'San Francisco', 'Los Angeles', 'Chicago']
}

df = pd.DataFrame(data)

# Specify the Excel file path (you can change it to your desired path)
excel_file_path = 'example.xlsx'

# Write the DataFrame to an Excel file
df.to_excel(excel_file_path, index=False)

print(f"DataFrame has been exported to {excel_file_path}")

In this journey through essential Pandas functions, we’ve only scratched the surface of what this powerful library can offer. By mastering these functions, you’ll gain the skills needed to efficiently clean, explore, and analyze diverse datasets, paving the way for impactful data-driven insights.

Remember that the key to becoming a proficient data analyst lies in practice and continuous learning. Happy coding!

Complete Explanation for Free.

Give it :👏👏👏👏:
If you found this guide helpful , why not show some love? Give it a Clap 👏, and if you have questions or topics you’d like to explore further, drop a comment 💬 below 👇

Python’s Gurus🚀

Thank you for being a part of the Python’s Gurus community!

Before you go:

Be sure to clap x50 time and follow the writer ️👏️️
Follow us: Newsletter
Do you aspire to become a Guru too? Submit your best article or draft to reach our audience.

The Power of Pandas Library in Python

Python’s Gurus🚀

Written by Rina Mondal