Data Integration with Python
Integrating supermarket data using Python
Data integration is more than just merging tables or combining CSV files;
It’s about combining different pieces of information to create a clear and meaningful picture.
In this project, the focus is on a multi-faceted scenario involving four intricate CSV files.
Objective
The sole aim of this project is to showcase the capabilities of Python in the realm of data integration by merging these files to create a unified dataset.
The CSV files are:
- sales_data.csv
- customer_data.csv
- products_data.csv
- shipping_data.csv
sales_data.csv:
customer_data:
products_data:
shipping_data:
Data integration process
import pandas as pd
# Reading data from CSV files
sales_df = pd.read_csv('sales_data.csv')
customer_df = pd.read_csv('customer_data.csv')
products_df = pd.read_csv('products_data.csv')
shipping_df = pd.read_csv('shipping_data.csv')
# Performing multi-step data integration process
merged_df = pd.merge(sales_df, customer_df, on='Customer_ID', how='left')
# Extracting year, month, and quarter from the date
merged_df['YearMonth'] = pd.to_datetime(merged_df['Date']).dt.to_period('M')
merged_df['Quarter'] = pd.to_datetime(merged_df['Date']).dt.quarter
# Merging with product data
merged_df = pd.merge(merged_df, products_df, on='Product', how='left')
# Merging with shipping data
merged_df = pd.merge(merged_df, shipping_df, on='Order_ID', how='left')
# Calculating discounted amount based on customer segment, loyalty level, and dynamic pricing
merged_df['Segment_Discount'] = merged_df.apply(
lambda row: 0.1 if row['Segment'] == 'Gold' else (0.05 if row['Segment'] == 'Silver' else 0),
axis=1
)
merged_df['Loyalty_Discount'] = merged_df.apply(
lambda row: 0.15 if row['Loyalty_Level'] == 'Platinum' else (0.1 if row['Loyalty_Level'] == 'Gold' else 0),
axis=1
)
merged_df['Dynamic_Pricing'] = merged_df.apply(
lambda row: row['Base_Price'] * (0.1 * row['Popularity_Score']),
axis=1
)
merged_df['Discounted_Amount'] = merged_df['Amount'] * (1 - merged_df['Discount'] - merged_df['Segment_Discount'] - merged_df['Loyalty_Discount']) - merged_df['Dynamic_Pricing']
# Calculating total cost including shipping and discounted amount
merged_df['Total_Cost'] = merged_df['Discounted_Amount'] + merged_df['Shipping_Cost']
# Display the final integrated data with customer names in the second column
output_df = merged_df[['Order_ID', 'Customer_Name', 'Date', 'Product', 'Amount', 'Region', 'Gender', 'Age', 'Segment', 'Loyalty_Level', 'Category', 'Base_Price', 'Discount', 'Supplier', 'Popularity_Score', 'Shipping_Cost', 'Shipping_Method', 'Shipping_Region', 'Discounted_Amount', 'Total_Cost']]
print(output_df)
- The ‘sales_df’ and ‘customer_df’ DataFrames were merged based on the ‘Customer_ID’ column. This adds the ‘Customer_Name’ column to the resulting DataFrame (merged_df), making it available for further processing.
- The year and month were extracted from the date to create a new ‘YearMonth’ column for analysis.
- The result was merged with product data based on the ‘Product’ column.
- The shipping data was merged using the ‘Order_ID’ column.
A new column, ‘Discounted_Amount,’ was calculated by applying discounts to the sales amount. - Finally, the ‘Total_Cost,’ which includes the discounted amount and shipping cost was calculated.
Output
Order_ID Customer_Name Date Product Amount Region Gender Age Segment Loyalty_Level Category Base_Price Discount Supplier Popularity_Score Shipping_Cost Shipping_Method Shipping_Region Discounted_Amount Total_Cost
0 Order_1 John Doe 2023-01-01 Product_A 120 North Male 32 Gold Platinum Electronics 150 0.1 Supplier_X 8 10 Express North 94.5 104.5
1 Order_2 Jane Smith 2023-01-02 Product_B 180 South Female 28 Silver Gold Home & Kitchen 120 0.05 Supplier_Y 6 15 Standard South 170.1 185.1
2 Order_3 Bob Johnson 2023-01-03 Product_A 150 West Male 40 Platinum Silver Electronics 150 0.1 Supplier_X 8 8 Express West 123.0 131.0
3 Order_4 Alice Brown 2023-01-04 Product_C 90 East Female 35 Gold Bronze Clothing 80 0.2 Supplier_Z 9 12 Standard East 47.2 59.2
4 Order_5 John Doe 2023-01-05 Product_B 200 North Male 32 Gold Platinum Home & Kitchen 120 0.05 Supplier_Y 6 18 Express North 189.5 207.5
5 Order_6 Jane Smith 2023-01-06 Product_A 110 South Female 28 Silver Gold Electronics 150 0.1 Supplier_X 8 9 Standard South 100.5 109.5
6 Order_7 Bob Johnson 2023-01-07 Product_C 130 West Male 40 Platinum Silver Clothing 80 0.2 Supplier_Z 9 14 Express West 100.8 114.8
7 Order_8 Alice Brown 2023-01-08 Product_A 160 East Female 35 Gold Bronze Electronics 150 0.1 Supplier_X 8 20 Standard East 141.0 161.0
8 Order_9 John Doe 2023-01-09 Product_B 220 North Male 32 Gold Platinum Home & Kitchen 120 0.05 Supplier_Y 6 11 Express North 207.9 218.9
9 Order_10 Jane Smith 2023-01-10 Product_C 120 South Female 28 Silver Gold Clothing 80 0.2 Supplier_Z 9 16 Standard South 82.8 98.8
Column descriptions
Order_ID: The unique identifier for each order.
Customer_Name: The name of the customer who placed the order.
Date: The date when the order was placed.
Product: The product that was ordered.
Amount: The total amount of the order.
Region: The region associated with the customer.
Gender: The gender of the customer.
Age: The age of the customer.
Segment: The customer segmentation category.
Loyalty_Level: The loyalty level of the customer.
Category: The category of the ordered product.
Base_Price: The base price of the product.
Discount: The discount applies to the product.
Supplier: The supplier of the product.
Popularity_Score: The popularity score of the product.
Shipping_Cost: The cost associated with shipping.
Shipping_Method: The method used for shipping.
Shipping_Region: The region to which the order is shipped.
Discounted_Amount: The amount after applying discounts.
Total_Cost: The total cost, including the discounted amount and shipping cost.