SQL Challenge 2: Pizza Runner (Data Wrangling)

Okonkwo Chukwuebuka Malcom
7 min readMar 7, 2023

--

By Okonkwo Chukwuebuka Malcom

Introduction

This is the challenge 12— Pizza Runner of the 8 Weeks SQL Challenge by Danny Ma. I am playing the role of a data analyst at Pizza Runner.

Problem Statement

Danny was scrolling through his Instagram feed when something really caught his eye — “80s Retro Styling and Pizza Is The Future!”

Danny was sold on the idea, but he knew that pizza alone was not going to help him get seed funding to expand his new Pizza Empire — so he had one more genius idea to combine with it — he was going to Uberize it — and so Pizza Runner was launched!

Danny started by recruiting “runners” to deliver fresh pizza from Pizza Runner Headquarters (otherwise known as Danny’s house) and also maxed out his credit card to pay freelance developers to build a mobile app to accept orders from customers.

He has prepared for us an entity relationship diagram of his database design but requires further assistance to clean his data and apply some basic calculations so he can better direct his runners and optimise Pizza Runner’s operations.

All datasets exist within the pizza_runner database schema - be sure to include this reference within your SQL scripts as you start exploring the data and answering the case study questions.

The Entity relationship Diagram can be seen below

For the runners table

The runners table shows the registration_date for each new runner

For the customer_orders:

Customer pizza orders are captured in the customer_orders table with 1 row for each individual pizza that is part of the order.

The pizza_id relates to the type of pizza which was ordered whilst the exclusions are the ingredient_id values which should be removed from the pizza and the extras are the ingredient_id values which need to be added to the pizza.

Note that customers can order multiple pizzas in a single order with varying exclusions and extras values even if the pizza is the same type!

The exclusions and extras columns will need to be cleaned up before using them in your queries.

For the runner_orders table:

After each orders are received through the system — they are assigned to a runner — however not all orders are fully completed and can be cancelled by the restaurant or the customer.

The pickup_time is the timestamp at which the runner arrives at the Pizza Runner headquarters to pick up the freshly cooked pizzas. The distance and duration fields are related to how far and long the runner had to travel to deliver the order to the respective customer.

For the pizza_names table:

At the moment — Pizza Runner only has 2 pizzas available the Meat Lovers or Vegetarian!

For the pizza_recipes table:

Each pizza_id has a standard set of toppings which are used as part of the pizza recipe.

For the pizza_toppings table:

This table contains all of the topping_name values with their corresponding topping_id value

Data Wrangling Task

There are three processes in Data Wrangling. Gathering, Assessing, and Cleaning are the three processes involved in data wrangling, and these steps will be used to solve the problem at hand.

Gathering Data

The data and entity relationships of the case study may be found here: https://8weeksqlchallenge.com/case-study-2/.

Assessing the Data

Before we start the cleaning process, it is best practice to go through the data to discover the data quality issues and abnormalities in the data. Since the data is not a big data, Visual Assessment works just fine for the data

To Assess the tables in the database:

  SELECT * FROM customer_orders;
SELECT * FROM pizza_names;
SELECT * FROM pizza_recipes;
SELECT * FROM pizza_toppings;
SELECT * FROM runner_orders;
SELECT * FROM runners;

From Visually Assessing the tables, Data Quality issues where discovered in the the runners table, the pizza_names table and the pizza toppings table.

Cleaning the Data

After identifying the issues with the tables, this is the stage where the data quality issues are dealt with

For the customer_orders table: Data Quality issues observed are

  • In the exclusions column, the presense of ‘null’ and ‘ ‘ instead of NULL
  • In the extras column, the presence of ‘null’ and ‘ ‘ instead of NULL

The Code used to take of it

-- for the null and blank values in the extrusions column
UPDATE customer_orders
SET exclusions = CASE WHEN exclusions = '' THEN NULL
WHEN exclusions = 'null' THEN NULL
ELSE exclusions END;

-- for the null and blank values in the extras column
UPDATE customer_orders
SET extras = CASE WHEN extras = '' THEN NULL
WHEN extras = 'null' THEN NULL
ELSE extras END;

To Check if the code works

SELECT * FROM customer_orders;

Output:

For the runner_orders table: Data Quality issues observed are

  • the pickup_time column has ‘null’ values
  • the distance column has the appearance of ‘km’, ‘null’ values and ‘ km’
  • the duration column has the appearance of ‘minutes’, ‘null’, ‘mins’, ‘minute’
  • the cancellation column has the appearance of ‘null’ and ‘ ‘
  • Change the data types of the corrected columns

The Code that would take care of these quality issues

UPDATE runner_orders
SET pickup_time = CASE WHEN pickup_time = 'null' THEN NULL
ELSE pickup_time END;

UPDATE runner_orders
SET distance = CASE WHEN distance = 'null' THEN NULL
WHEN distance LIKE '%km' THEN
TRIM(REPLACE(distance, 'km', '')) ELSE distance END;

UPDATE runner_orders
SET duration = CASE WHEN duration LIKE '%min%' THEN LEFT(duration, 2)
WHEN duration = 'null' THEN NULL
ELSE duration END ;

UPDATE runner_orders
SET cancellation = CASE WHEN cancellation IN ('null', '') THEN NULL
ELSE cancellation END;

The Output would look like this

Next step is to check the data type of the column

DESCRIBE runner_orders;

Output:

To change the data types of the pickup_time , distance & duration

-- To change the datatype of the columns
ALTER TABLE runner_orders
MODIFY COLUMN pickup_time DATETIME;

ALTER TABLE runner_orders
MODIFY COLUMN distance INT;

ALTER TABLE runner_orders
MODIFY COLUMN duration INT;

Output:

This Wraps up all the data cleaning requirements for the challenge.

Further Cleaning Steps

  1. For the pizza_recipes , the toppings column contains comma separated values. This can be separated into each rows of its own.

The pizza_recipes table looks like the following

Since this is an optional cleaning step, I would be storing this in a Temporary Table. This is to avoid messing with your original data

The Code to fix it:

CREATE TEMPORARY TABLE pizza_recipes_clean as (
SELECT p.pizza_id,
REPLACE(SUBSTRING_INDEX(SUBSTRING_INDEX(p.toppings,
',', numbers.n), ',', -1), ' ', '') AS toppings
FROM (select 1 AS n UNION ALL
select 2 UNION ALL
select 3 UNION ALL
select 4 UNION ALL
select 5 UNION ALL
select 6 UNION ALL
select 7 UNION ALL
select 8 ) AS numbers
INNER JOIN pizza_recipes p
ON CHAR_LENGTH(p.toppings) - CHAR_LENGTH(REPLACE(p.toppings,
',', '')) >= numbers.n-1
ORDER BY p.pizza_id, n);

Output:

2. For the customer_orders table, the exclusions & extras column contains comma separated values. The table is shown below

To split the values into separate rows,

-- to handle the commas in the exclusions column
CREATE TEMPORARY TABLE customer_orders_pre_clean as
select order_id,
customer_id,
pizza_id,
order_time,
extras,
SUBSTRING_INDEX(SUBSTRING_INDEX(exclusions, ',', numbers.n),
',', -1) AS exclusions
FROM (SELECT 1 AS n UNION ALL
SELECT 2 UNION ALL
SELECT 3) numbers
right JOIN customer_orders
ON CHAR_LENGTH(exclusions) - CHAR_LENGTH(replace(exclusions,
',', '')) >= numbers.n-1
order by order_id, n;
-- to handle the commas in the extras column
CREATE TEMPORARY TABLE customer_orders_cleaned AS
select order_id,
customer_id,
pizza_id,
order_time,
exclusions,
SUBSTRING_INDEX(SUBSTRING_INDEX(extras, ',', numbers.n),
',', -1) AS extras
FROM (SELECT 1 AS n UNION ALL
SELECT 2 UNION ALL
SELECT 3) numbers
RIGHT JOIN customer_orders_pre_clean
ON CHAR_LENGTH(extras) - CHAR_LENGTH(REPLACE(extras, ',', ''))
>= numbers.n-1
order by order_id, n;

Output:

Thanks For reading and Feel free to comment, share and correct the codes in case of an error. I would also love feedbacks. The full Data Wrangling process is documented on my GitHub Repository.

Thank you… Time to answer the Case Study Questions in the next post.

Feel free to reach out to me on LinkedIn and Twitter

--

--

Okonkwo Chukwuebuka Malcom

A Data Analyst, Problem solver and Graduate Chemical Engineer. Check out my Articles if you like to see how data can be used to solve problems & make decisions