Data Wrangling in Pandas vs SQL: A Comprehensive Comparison

shubham badaya
5 min readJan 13, 2024

--

Frequently, I find myself navigating between SQL and Pandas in my work. On certain days, I find myself crafting queries directly on the database, while on other days, I delve into working with flat files. Although I’m well-versed in executing SQL operations, when it comes to working with flat files in pandas, I sometimes face uncertainties about translating SQL processes into the corresponding pandas operations. In response to this, I am currently in the process of developing a comprehensive guide that aims to assist beginners in Pandas, especially when transitioning from a background in SQL.

To lay the groundwork for this guide, let’s first establish a fundamental understanding.

There are multiple ways to perform the below operations. However, I am showing below the most basic operations.

Why do we need pandas when we can run queries on a database?

  1. As pandas is in-memory, once the data is loaded. querying the data is very fast. While in SQL each query will take its time to hit the database.
  2. When working on a data analysis project, we need to view results, iterate, and visualize, where pandas is helpful.
  3. When the data size is such that it can’t fit in memory, then use SQL or PySpark.

1. How to read CSV in Pandas?

import pandas as pd
df = pd.read_csv('data.csv') ## replace data.csv with file path where csv is placed

print(df)

2. How to use sql SELECT statement in pandas?

SQL:

SELECT ShopID , price FROM df

Python:

Method 1: Use [ ] to select desired columns.

#Select all rows and columns 'ShopID' and 'price'
df[['ShopID', 'price']]

Method 2: Use loc to access rows and columns by label.

# Select all rows and columns 'ShopID' and 'price'
df.loc[:, ['ShopID', 'price']]

Method 3: Use iloc to access rows and columns by integer position.

# Select all rows for columns at indices 0 and 2
df.iloc[:, [0, 1]]

Method 4: Use boolean

mask = [True, True, False]  # Select first and second columns
df.loc[:, mask]

3. How to apply sql WHERE condition in Python?

SQL: Select all row here price is greater than 4

select * from df where price > 4

Python:


df[df['price'] > 4]

SQL: Select all row here price is greater than 5 and product is ‘scissor’

select * from df where price > 5 and product = 'scissor'

Python:

df[(df['product'] == 'scissor') & (df['price'] > 5)]

SQL: select all rows where price is 2 and 4.

select * from df where price in (2,4)

Python:

values_to_keep = [2, 4]
df[df['price'].isin(values_to_keep)]

4. How to do sql group by in pandas?

Method 1:

Find the avg price by shopID

df.groupby("ShopID")["price"].mean()

Method 2:

Find the count of products in each shop excluding NULL.

# count will not inlcude null values
df.groupby(["ShopID", "product"]).count()

Method 3:

Find the count of products in each shop including NULL.

# size will include count of null values
df.groupby(["ShopID", "product"]).size()

Method 4:

Use as_index = False, to keep them as df columns.

df.groupby(["ShopID", "product"], as_index = False).size()

Method 5:

df.groupby(["ShopID", "product"])['price'].agg(['min', 'max']).reset_index()

Note: The reset_index() function is used to convert the grouped result back to a DataFrame.

5. How to apply group by having a clause

In SQL we can directly use having after group by. However, in Python, it is a 2-step process.

Step-1. calculate a grouped df

Step-2. Apply a filter.

# Group by "shopID" and calculate the avg price for each shopID
grouped_df = df.groupby('ShopID')['price'].mean()


# # Filter groups where the sum is greater than a threshold (e.g., 5)
filtered_groups = grouped_df[grouped_df > 5]

# # Display the result
filtered_groups

6. How to use case when statement in pandas

Method 1: If a single condition

import numpy as np
# Create a new column 'Age Group' using np.where
df['Price Group'] = np.where(df['price'] >= 6, 'High', 'Low')

print(df)

Method 2: If multiple conditions

conditions = [
df['price'] < 4,
(df['price'] >= 4) & (df['price'] < 8),
df['price'] >= 8
]
choices = ['low', 'med', 'high']

df['Price Group'] = np.select(conditions, choices, default='Invalid Price')

print(df)

7. How to apply windows function in pandas

df['rolling_mean'] = df['price'].rolling(window=3).mean()

print(df)
## rank in pandas
df['rank'] = df.groupby('ShopID')['price'].transform('rank')
## dense rank in pandas
df['Dense_Rank'] = df.groupby('ShopID')['price'].transform(lambda x: x.rank(method='dense'))

print(df)

8. How to join dataframes in pandas

Create a data frame (grouped_df) with avg price on the shop and join the same with the initial dataframe df.

# find the avg price per shop and join it with df
grouped_df = df.groupby('ShopID', as_index = False)['price'].mean()

# column rename {oldname : newname} with the use of inplace argument
grouped_df.rename(columns={'price': 'avg_price_on_shop'}, inplace = True)

Note:

  1. Many Pandas methods have inplace as an optional parameter, and its default value is often False. This means that, by default, Pandas operations return a new modified object.
  2. Using inplace=True can be memory-efficient when dealing with large datasets, as it avoids creating unnecessary copies of the data.
  3. Modifying objects inplace can make the code less readable and more error-prone, especially for complex operations. Consider using the non-inplace version of methods if readability and clarity are important.
# join the dataframes to find the avg price per shop
result = pd.merge(df, grouped_df, left_on='ShopID',right_on='ShopID', how='inner')

print(result)

9. How to read pandas data frame from oracle database(exa)?


import pandas as pd
import cx_Oracle

# Set up Oracle connection parameters
oracle_connection = {
'user': 'your_username',
'password': 'your_password',
'dsn': 'your_oracle_dsn' # e.g., 'localhost:1521/your_service_name'
}

# Connect to Oracle database
connection = cx_Oracle.connect(**oracle_connection)

# SQL Query to retrieve data from a table
sql_query = "SELECT * FROM your_table_name"

# Read data into a Pandas DataFrame
df = pd.read_sql(sql_query, connection)

# Close the database connection
connection.close()

# Display the DataFrame
print(df)

--

--