Data Integration with Python

Integrating supermarket data using Python

Nnamdi Samuel
Art of Data Engineering
4 min readDec 20, 2023

--

Photo by Khamkéo Vilaysing on Unsplash

Data integration is more than just merging tables or combining CSV files;
It’s about combining different pieces of information to create a clear and meaningful picture.

In this project, the focus is on a multi-faceted scenario involving four intricate CSV files.

Objective

The sole aim of this project is to showcase the capabilities of Python in the realm of data integration by merging these files to create a unified dataset.

The CSV files are:

  • sales_data.csv
  • customer_data.csv
  • products_data.csv
  • shipping_data.csv

sales_data.csv:

Screenshot by author

customer_data:

Screenshot by author

products_data:

Screenshot by author

shipping_data:

Screenshot by author

Data integration process

import pandas as pd

# Reading data from CSV files
sales_df = pd.read_csv('sales_data.csv')
customer_df = pd.read_csv('customer_data.csv')
products_df = pd.read_csv('products_data.csv')
shipping_df = pd.read_csv('shipping_data.csv')

# Performing multi-step data integration process
merged_df = pd.merge(sales_df, customer_df, on='Customer_ID', how='left')

# Extracting year, month, and quarter from the date
merged_df['YearMonth'] = pd.to_datetime(merged_df['Date']).dt.to_period('M')
merged_df['Quarter'] = pd.to_datetime(merged_df['Date']).dt.quarter

# Merging with product data
merged_df = pd.merge(merged_df, products_df, on='Product', how='left')

# Merging with shipping data
merged_df = pd.merge(merged_df, shipping_df, on='Order_ID', how='left')

# Calculating discounted amount based on customer segment, loyalty level, and dynamic pricing
merged_df['Segment_Discount'] = merged_df.apply(
lambda row: 0.1 if row['Segment'] == 'Gold' else (0.05 if row['Segment'] == 'Silver' else 0),
axis=1
)

merged_df['Loyalty_Discount'] = merged_df.apply(
lambda row: 0.15 if row['Loyalty_Level'] == 'Platinum' else (0.1 if row['Loyalty_Level'] == 'Gold' else 0),
axis=1
)

merged_df['Dynamic_Pricing'] = merged_df.apply(
lambda row: row['Base_Price'] * (0.1 * row['Popularity_Score']),
axis=1
)

merged_df['Discounted_Amount'] = merged_df['Amount'] * (1 - merged_df['Discount'] - merged_df['Segment_Discount'] - merged_df['Loyalty_Discount']) - merged_df['Dynamic_Pricing']

# Calculating total cost including shipping and discounted amount
merged_df['Total_Cost'] = merged_df['Discounted_Amount'] + merged_df['Shipping_Cost']

# Display the final integrated data with customer names in the second column
output_df = merged_df[['Order_ID', 'Customer_Name', 'Date', 'Product', 'Amount', 'Region', 'Gender', 'Age', 'Segment', 'Loyalty_Level', 'Category', 'Base_Price', 'Discount', 'Supplier', 'Popularity_Score', 'Shipping_Cost', 'Shipping_Method', 'Shipping_Region', 'Discounted_Amount', 'Total_Cost']]
print(output_df)
  • The ‘sales_df’ and ‘customer_df’ DataFrames were merged based on the ‘Customer_ID’ column. This adds the ‘Customer_Name’ column to the resulting DataFrame (merged_df), making it available for further processing.
  • The year and month were extracted from the date to create a new ‘YearMonth’ column for analysis.
  • The result was merged with product data based on the ‘Product’ column.
  • The shipping data was merged using the ‘Order_ID’ column.
    A new column, ‘Discounted_Amount,’ was calculated by applying discounts to the sales amount.
  • Finally, the ‘Total_Cost,’ which includes the discounted amount and shipping cost was calculated.

Output

   Order_ID Customer_Name     Date     Product  Amount   Region   Gender  Age Segment Loyalty_Level     Category  Base_Price  Discount  Supplier  Popularity_Score  Shipping_Cost  Shipping_Method Shipping_Region    Discounted_Amount       Total_Cost
0 Order_1 John Doe 2023-01-01 Product_A 120 North Male 32 Gold Platinum Electronics 150 0.1 Supplier_X 8 10 Express North 94.5 104.5
1 Order_2 Jane Smith 2023-01-02 Product_B 180 South Female 28 Silver Gold Home & Kitchen 120 0.05 Supplier_Y 6 15 Standard South 170.1 185.1
2 Order_3 Bob Johnson 2023-01-03 Product_A 150 West Male 40 Platinum Silver Electronics 150 0.1 Supplier_X 8 8 Express West 123.0 131.0
3 Order_4 Alice Brown 2023-01-04 Product_C 90 East Female 35 Gold Bronze Clothing 80 0.2 Supplier_Z 9 12 Standard East 47.2 59.2
4 Order_5 John Doe 2023-01-05 Product_B 200 North Male 32 Gold Platinum Home & Kitchen 120 0.05 Supplier_Y 6 18 Express North 189.5 207.5
5 Order_6 Jane Smith 2023-01-06 Product_A 110 South Female 28 Silver Gold Electronics 150 0.1 Supplier_X 8 9 Standard South 100.5 109.5
6 Order_7 Bob Johnson 2023-01-07 Product_C 130 West Male 40 Platinum Silver Clothing 80 0.2 Supplier_Z 9 14 Express West 100.8 114.8
7 Order_8 Alice Brown 2023-01-08 Product_A 160 East Female 35 Gold Bronze Electronics 150 0.1 Supplier_X 8 20 Standard East 141.0 161.0
8 Order_9 John Doe 2023-01-09 Product_B 220 North Male 32 Gold Platinum Home & Kitchen 120 0.05 Supplier_Y 6 11 Express North 207.9 218.9
9 Order_10 Jane Smith 2023-01-10 Product_C 120 South Female 28 Silver Gold Clothing 80 0.2 Supplier_Z 9 16 Standard South 82.8 98.8

Column descriptions

Order_ID: The unique identifier for each order.
Customer_Name: The name of the customer who placed the order.
Date: The date when the order was placed.
Product: The product that was ordered.
Amount: The total amount of the order.
Region: The region associated with the customer.
Gender: The gender of the customer.
Age: The age of the customer.
Segment: The customer segmentation category.
Loyalty_Level: The loyalty level of the customer.
Category: The category of the ordered product.
Base_Price: The base price of the product.
Discount: The discount applies to the product.
Supplier: The supplier of the product.
Popularity_Score: The popularity score of the product.
Shipping_Cost: The cost associated with shipping.
Shipping_Method: The method used for shipping.
Shipping_Region: The region to which the order is shipped.
Discounted_Amount: The amount after applying discounts.
Total_Cost: The total cost, including the discounted amount and shipping cost.

Thank you for reading! If you found this article interesting, do consider following me and subscribing to my latest articles. Catch me on LinkedIn and follow me on Twitter

--

--

Nnamdi Samuel
Art of Data Engineering

Data Engineer💥Voracious Reader and a Writer || Chemical Engineer