SQL Challenge : Pizza Runner (data cleaning) by Mariam Salvador
Introduction
This is the challenge 2 — Pizza Runner of the 8 Weeks SQL Challenge by Danny Ma.
Note: this is the data cleaning section of this project. The exploratory analysis can be found here pizza runner exploratory analysis. Kindly go through the data cleaning process here first before the exploratory analysis.
Problem Statement
Danny was scrolling through his Instagram feed when something really caught his eye — “80s Retro Styling and Pizza Is The Future!”
Danny was sold on the idea, but he knew that pizza alone was not going to help him get seed funding to expand his new Pizza Empire — so he had one more genius idea to combine with it — he was going to Uberize it — and so Pizza Runner was launched!
Danny started by recruiting “runners” to deliver fresh pizza from Pizza Runner Headquarters (otherwise known as Danny’s house) and also maxed out his credit card to pay freelance developers to build a mobile app to accept orders from customers.
Danny has created an entity relationship diagram (ERD) for his database design in the Pizza Runner challenge. However, he needs additional support to clean and refine his data, as well as perform basic calculations. This will enable him to make more informed decisions regarding the direction of his runners and optimize the operations of Pizza Runner.
The Entity relationship Diagram can be seen below
runners
table
The runners
table shows the registration_date
for each new runner.
Customer_orders table
The pizza_id
relates to the type of pizza which was ordered whilst the exclusions
are the ingredient_id
values which should be removed from the pizza and the extras
are the ingredient_id
values which need to be added to the pizza.
Note that customers can order multiple pizzas in a single order with varying exclusions
and extras
values even if the pizza is the same type!
The exclusions
and extras
columns will need to be cleaned up before using them in your queries.
Runner_orders table:
After each orders are received through the system — they are assigned to a runner — however not all orders are fully completed and can be cancelled by the restaurant or the customer.
The pickup_time
is the timestamp at which the runner arrives at the Pizza Runner headquarters to pick up the freshly cooked pizzas. The distance
and duration
fields are related to how far and long the runner had to travel to deliver the order to the respective customer.
Pizza_names table:
At the moment — Pizza Runner only has 2 pizzas available the Meat Lovers or Vegetarian!
Pizza_recipes table:
Each pizza_id
has a standard set of toppings
which are used as part of the pizza recipe.
Pizza_toppings table:
This table contains all of the topping_name
values with their corresponding topping_id
value
Data Cleaning
The data provided has to be cleaned before carrying out analysis. I assessed the tables in PostgreSQL to identify data quality issues and abnormalities. PostgreSql was used for this case study.
I noticed that the customer_orders, pizza_toppings and runner_orders table has data quality issues and needs to be resolved.
Cleaning Process
After identifying the issues, the customer_orders table had these data quality issues
· The exclusion column had null value and blank cells. The exclusion data type is also in text but rather, it’s data type should be integer. This column also had values with comma (e.g 2,3) which can affect calculation in this column.
· The extra column also had the same issues with exclusion column
Sql to resolve this:
The extra and extension column has some comma values and needs to be separated into rows.
I created a new column called id to give the rows in the customer orders table a unique id in a temp table called new_customer_orders. I created this temp table to keep the original table while working on this new table.
In order to generate the id from 1–14 (there are 14 rows in the customer_order table), I created a temporary sequence starting from 1–14.
The nextval function helps to retrieves the next value from the sequence. Next, I dropped the sequence table since I won’t be using it again.
You must be wondering why I created the id column….well, read along with me, I’ll explain why.
Remember the extras & exclusion table has some comma values that needs to be separated into rows. That’s the reason for this whole process. Read along, I’ll explain further.
To split the extra & exclusion column, I created a new temp table called cleaned_customer_orders which we will make use of in section C exploratory analysis of this project. Next, I used the regexp_split_to_table function in postgreSQL to split the extra & extension column with commas into rows. Then I created a cte for both extra and exclusion.
Next, I joined the both cte using the id column as the join condition. I noticed that the customer order table didn’t have a unique column which will hinder us from joining these two cte to derive our result, that’s why the id column was created. The id column will serve as the join condition for these two cte.
The extras and exclusion data type should be converted to integer. I also dropped the previous temp table, new_customer_orders which we used for cleaning. This temp table will not be needed so I deleted it.
Next table to be cleaned is the pizza recipes table.
This toppings in this table should be separated into rows.
I created a temp table for the cleaned pizza recipes and used regexp_split_to_table function to split the values into rows. I also changed the topping_id data type.
The cleaned pizza recipes table is ready for analysis
Next table to be cleaned is the runner_orders table
The pickup_time, distance & duration column data type needs to be changed
The runner_order table is ready for analysis.
This comes to the end of the data cleaning process of this case study. Thank you for reading.
Next step is the exploratory analysis which has been documented here pizza runner exploratory analysis .
you can reach out to me via twitter on linkedln
Twitter handle: @SalvadorMariam6 . Link: https://twitter.com/salvadormariam6?s=21&t=KlR9Wysi0L7CAaTJZl72FQ