Data Wrangling in Pandas vs SQL: A Comprehensive Comparison
Frequently, I find myself navigating between SQL and Pandas in my work. On certain days, I find myself crafting queries directly on the database, while on other days, I delve into working with flat files. Although I’m well-versed in executing SQL operations, when it comes to working with flat files in pandas, I sometimes face uncertainties about translating SQL processes into the corresponding pandas operations. In response to this, I am currently in the process of developing a comprehensive guide that aims to assist beginners in Pandas, especially when transitioning from a background in SQL.
To lay the groundwork for this guide, let’s first establish a fundamental understanding.
There are multiple ways to perform the below operations. However, I am showing below the most basic operations.
Why do we need pandas when we can run queries on a database?
- As pandas is in-memory, once the data is loaded. querying the data is very fast. While in SQL each query will take its time to hit the database.
- When working on a data analysis project, we need to view results, iterate, and visualize, where pandas is helpful.
- When the data size is such that it can’t fit in memory, then use SQL or PySpark.
1. How to read CSV in Pandas?
import pandas as pd
df = pd.read_csv('data.csv') ## replace data.csv with file path where csv is placed
print(df)
2. How to use sql SELECT statement in pandas?
SQL:
SELECT ShopID , price FROM df
Python:
Method 1: Use [ ] to select desired columns.
#Select all rows and columns 'ShopID' and 'price'
df[['ShopID', 'price']]
Method 2: Use loc to access rows and columns by label.
# Select all rows and columns 'ShopID' and 'price'
df.loc[:, ['ShopID', 'price']]
Method 3: Use iloc to access rows and columns by integer position.
# Select all rows for columns at indices 0 and 2
df.iloc[:, [0, 1]]
Method 4: Use boolean
mask = [True, True, False] # Select first and second columns
df.loc[:, mask]
3. How to apply sql WHERE condition in Python?
SQL: Select all row here price is greater than 4
select * from df where price > 4
Python:
df[df['price'] > 4]
SQL: Select all row here price is greater than 5 and product is ‘scissor’
select * from df where price > 5 and product = 'scissor'
Python:
df[(df['product'] == 'scissor') & (df['price'] > 5)]
SQL: select all rows where price is 2 and 4.
select * from df where price in (2,4)
Python:
values_to_keep = [2, 4]
df[df['price'].isin(values_to_keep)]
4. How to do sql group by in pandas?
Method 1:
Find the avg price by shopID
df.groupby("ShopID")["price"].mean()
Method 2:
Find the count of products in each shop excluding NULL.
# count will not inlcude null values
df.groupby(["ShopID", "product"]).count()
Method 3:
Find the count of products in each shop including NULL.
# size will include count of null values
df.groupby(["ShopID", "product"]).size()
Method 4:
Use as_index = False, to keep them as df columns.
df.groupby(["ShopID", "product"], as_index = False).size()
Method 5:
df.groupby(["ShopID", "product"])['price'].agg(['min', 'max']).reset_index()
Note: The reset_index()
function is used to convert the grouped result back to a DataFrame.
5. How to apply group by having a clause
In SQL we can directly use having after group by. However, in Python, it is a 2-step process.
Step-1. calculate a grouped df
Step-2. Apply a filter.
# Group by "shopID" and calculate the avg price for each shopID
grouped_df = df.groupby('ShopID')['price'].mean()
# # Filter groups where the sum is greater than a threshold (e.g., 5)
filtered_groups = grouped_df[grouped_df > 5]
# # Display the result
filtered_groups
6. How to use case when statement in pandas
Method 1: If a single condition
import numpy as np
# Create a new column 'Age Group' using np.where
df['Price Group'] = np.where(df['price'] >= 6, 'High', 'Low')
print(df)
Method 2: If multiple conditions
conditions = [
df['price'] < 4,
(df['price'] >= 4) & (df['price'] < 8),
df['price'] >= 8
]
choices = ['low', 'med', 'high']
df['Price Group'] = np.select(conditions, choices, default='Invalid Price')
print(df)
7. How to apply windows function in pandas
df['rolling_mean'] = df['price'].rolling(window=3).mean()
print(df)
## rank in pandas
df['rank'] = df.groupby('ShopID')['price'].transform('rank')
## dense rank in pandas
df['Dense_Rank'] = df.groupby('ShopID')['price'].transform(lambda x: x.rank(method='dense'))
print(df)
8. How to join dataframes in pandas
Create a data frame (grouped_df) with avg price on the shop and join the same with the initial dataframe df.
# find the avg price per shop and join it with df
grouped_df = df.groupby('ShopID', as_index = False)['price'].mean()
# column rename {oldname : newname} with the use of inplace argument
grouped_df.rename(columns={'price': 'avg_price_on_shop'}, inplace = True)
Note:
- Many Pandas methods have inplace as an optional parameter, and its default value is often False. This means that, by default, Pandas operations return a new modified object.
- Using inplace=True can be memory-efficient when dealing with large datasets, as it avoids creating unnecessary copies of the data.
- Modifying objects inplace can make the code less readable and more error-prone, especially for complex operations. Consider using the non-inplace version of methods if readability and clarity are important.
# join the dataframes to find the avg price per shop
result = pd.merge(df, grouped_df, left_on='ShopID',right_on='ShopID', how='inner')
print(result)
9. How to read pandas data frame from oracle database(exa)?
import pandas as pd
import cx_Oracle
# Set up Oracle connection parameters
oracle_connection = {
'user': 'your_username',
'password': 'your_password',
'dsn': 'your_oracle_dsn' # e.g., 'localhost:1521/your_service_name'
}
# Connect to Oracle database
connection = cx_Oracle.connect(**oracle_connection)
# SQL Query to retrieve data from a table
sql_query = "SELECT * FROM your_table_name"
# Read data into a Pandas DataFrame
df = pd.read_sql(sql_query, connection)
# Close the database connection
connection.close()
# Display the DataFrame
print(df)
Not sure what to read next? I’ve picked another article for you: