Data Analysis — Case Study I

TiFFium. ✿
7 min read5 days ago

--

Introduction

TheLook is a fictitious eCommerce clothing site developed by the Looker team. The dataset contains information about customers, products, orders, logistics, web events and digital marketing campaigns. The contents of this dataset are synthetic, and are provided to industry practitioners for the purpose of product discovery, testing, and evaluation.

Analyze

Customers by gender

SELECT
COUNT(id) as users,
country
FROM
`bigquery-public-data.thelook_ecommerce.users`
WHERE gender = "M"
GROUP BY country
ORDER BY users DESC;
  • The country with the most male customers is China, followed by the United States.
  • The gender ratio of platform users is about the same, with a concentration in China, the United States, and Belgium.
SELECT
us.gender,
SUM(oit.sale_price)
FROM
`bigquery-public-data.thelook_ecommerce.users` us
JOIN
`bigquery-public-data.thelook_ecommerce.order_items` oit
ON
us.id = oit.user_id
WHERE
oit.status NOT IN ('Cancelled', 'Returned')
GROUP BY gender
HAVING gender = "M" OR gender = "F" ;

The total order amount for male customers on the platform is greater than that for female customers.

SELECT
us.gender,
oit.product_id,
inv.product_brand,
inv.product_category,
inv.product_name,
RANK() OVER (ORDER BY SUM(oit.sale_price) DESC) as rank,
SUM(oit.sale_price) as total_sales
FROM
`bigquery-public-data.thelook_ecommerce.users` us
JOIN
`bigquery-public-data.thelook_ecommerce.order_items` oit
ON
us.id = oit.user_id
JOIN
`bigquery-public-data.thelook_ecommerce.inventory_items` inv
ON
oit.product_id = inv.product_id
WHERE
oit.status NOT IN ('Cancelled', 'Returned')
AND gender = "F"
GROUP BY
us.gender, oit.product_id, inv.product_brand, inv.product_category, inv.product_name
ORDER BY
total_sales DESC
LIMIT 10;
  • Top 10 best-selling brands and products for female customers.
SELECT
us.gender,
oit.product_id,
inv.product_brand,
inv.product_category,
inv.product_name,
RANK() OVER (ORDER BY SUM(oit.sale_price) DESC) as rank,
SUM(oit.sale_price) as total_sales
FROM
`bigquery-public-data.thelook_ecommerce.users` us
JOIN
`bigquery-public-data.thelook_ecommerce.order_items` oit
ON
us.id = oit.user_id
JOIN
`bigquery-public-data.thelook_ecommerce.inventory_items` inv
ON
oit.product_id = inv.product_id
WHERE
oit.status NOT IN ('Cancelled', 'Returned')
AND gender = "M"
GROUP BY
us.gender, oit.product_id, inv.product_brand, inv.product_category, inv.product_name
ORDER BY
total_sales DESC
LIMIT 10;
  • Top 10 best-selling brands and products for male customers.
  • Male and Female have different purchasing preferences.

Customers by Age

SELECT
CASE
WHEN us.age < 15 THEN '<15'
WHEN us.age BETWEEN 15 AND 19 THEN '15-19'
WHEN us.age BETWEEN 20 AND 29 THEN '20-29'
WHEN us.age BETWEEN 30 AND 39 THEN '30-39'
WHEN us.age BETWEEN 40 AND 49 THEN '40-49'
WHEN us.age BETWEEN 50 AND 59 THEN '50-59'
WHEN us.age > 60 THEN '>60'
END AS age_group,
COUNT(DISTINCT oi.user_id) AS total_customer
FROM `bigquery-public-data.thelook_ecommerce.order_items` AS oi
INNER JOIN `bigquery-public-data.thelook_ecommerce.users` AS us
ON oi.user_id = us.id
GROUP BY age_group
ORDER BY total_customer DESC;
  • The average age of male customers on the platform is between 30–39 years and >60 years.
  • The number of new generation young customers is relatively low.
SELECT
CASE
WHEN us.age < 15 THEN '<15'
WHEN us.age BETWEEN 15 AND 19 THEN '15-19'
WHEN us.age BETWEEN 20 AND 29 THEN '20-29'
WHEN us.age BETWEEN 30 AND 39 THEN '30-39'
WHEN us.age BETWEEN 40 AND 49 THEN '40-49'
WHEN us.age BETWEEN 50 AND 59 THEN '50-59'
WHEN us.age > 60 THEN '>60'
END AS age_group,
gender,
COUNT(DISTINCT oi.user_id) AS total_customer
FROM `bigquery-public-data.thelook_ecommerce.order_items` AS oi
INNER JOIN `bigquery-public-data.thelook_ecommerce.users` AS us
ON oi.user_id = us.id
GROUP BY age_group, gender
HAVING gender = "F"
ORDER BY total_customer DESC;
  • The proportion of customers in each age group is similar between males and females, with the number of female customers aged 20–60 being very balanced.

Product Analysis

Brand Sales

---Top 10 best-selling brands---
SELECT
p.brand,
SUM(oit.sale_price) AS total_sale
FROM `bigquery-public-data.thelook_ecommerce.order_items` AS oit
JOIN `bigquery-public-data.thelook_ecommerce.products`AS p
ON oit.product_id = p.id
GROUP BY p.brand
ORDER BY total_sale DESC
LIMIT 10;
Top 10 best-selling brands
--- Top 10 worst-selling brands ---
SELECT
p.brand,
SUM(oit.sale_price) AS total_sale
FROM `bigquery-public-data.thelook_ecommerce.order_items` AS oit
JOIN `bigquery-public-data.thelook_ecommerce.products`AS p
ON oit.product_id = p.id
GROUP BY p.brand
ORDER BY total_sale ASC
LIMIT 10;
Top 10 worst-selling brands

Product Category Sales

--- Calculate the best-selling brand in each category ---
WITH ranked_brands AS (
SELECT
p.category,
p.brand,
RANK() OVER (PARTITION BY p.category ORDER BY SUM(oit.sale_price) DESC) AS rank,
SUM(oit.sale_price) AS total_sale
FROM `bigquery-public-data.thelook_ecommerce.order_items` AS oit
JOIN `bigquery-public-data.thelook_ecommerce.products` AS p
ON oit.product_id = p.id
GROUP BY p.category, p.brand
)
SELECT *
FROM ranked_brands
WHERE rank = 1
ORDER BY total_sale DESC;
Calculate the best-selling brand in each category
--- Calculate the worst-selling brand in each category ---
WITH ranked_brands AS (
SELECT
p.category,
p.brand,
RANK() OVER (PARTITION BY p.category ORDER BY SUM(oit.sale_price) ASC) AS rank,
SUM(oit.sale_price) AS total_sale
FROM `bigquery-public-data.thelook_ecommerce.order_items` AS oit
JOIN `bigquery-public-data.thelook_ecommerce.products` AS p
ON oit.product_id = p.id
GROUP BY p.category, p.brand
)
SELECT *
FROM ranked_brands
WHERE rank = 1
ORDER BY total_sale ASC;
Calculate the worst-selling brand in each category

Which brands and product categories have the most cancellations and returns?

SELECT
p.category,
p.brand,
oit.status,
COUNT(oit.order_id) AS order_count
FROM `bigquery-public-data.thelook_ecommerce.order_items` AS oit
JOIN `bigquery-public-data.thelook_ecommerce.products` AS p
ON oit.product_id = p.id
GROUP BY p.category, p.brand, oit.status
HAVING oit.status = "Cancelled"
ORDER BY order_count DESC
LIMIT 10;
  • The top ten most canceled categories are Swim, with two brands, Speedo and Quiksilver, followed by Jeans, with three brands, 7 For All Mankind, Levi’s, and Wrangler.
SELECT
p.category,
p.brand,
oit.status,
COUNT(oit.order_id) AS order_count
FROM `bigquery-public-data.thelook_ecommerce.order_items` AS oit
JOIN `bigquery-public-data.thelook_ecommerce.products` AS p
ON oit.product_id = p.id
GROUP BY p.category, p.brand, oit.status
HAVING oit.status = "Returned"
ORDER BY order_count DESC
LIMIT 10;
  • The top ten most returned orders, with the first being Swim brand Speedo. Among them, three brands of Jeans, 7 For All Mankind, Levi’s, and Wrangler, also have high return rates, suggesting that swimwear and jeans may be difficult to purchase in the right size.

Traffic Analysis

Which marketing channels are we performing well?

SELECT
traffic_source,
event_type,
COUNT(event_type) AS event_count
FROM `bigquery-public-data.thelook_ecommerce.events`
GROUP BY traffic_source, event_type
ORDER BY event_count DESC;
  • Compared to other sources, most browsing is done through Adword and Email.
SELECT
us.country,
COUNT(DISTINCT ev.user_id) AS users
FROM `bigquery-public-data.thelook_ecommerce.users` us
JOIN `bigquery-public-data.thelook_ecommerce.events` ev
ON us.state = ev.state
GROUP BY us.country
ORDER BY users DESC;
  • The number of users browsing and purchasing in each country is proportional. Surprisingly, the browsing volume in Colombia is not low, but the purchasing power is quite poor.

Count the total number of customer age groups purchasing through each traffic source.

SELECT
ev.traffic_source,
CASE
WHEN us.age < 15 THEN '<15'
WHEN us.age BETWEEN 15 AND 19 THEN '15-19'
WHEN us.age BETWEEN 20 AND 29 THEN '20-29'
WHEN us.age BETWEEN 30 AND 39 THEN '30-39'
WHEN us.age BETWEEN 40 AND 49 THEN '40-49'
WHEN us.age BETWEEN 50 AND 59 THEN '50-59'
WHEN us.age > 60 THEN '>60'
END AS age_group,
COUNT(DISTINCT ev.user_id) AS total_customer
FROM `bigquery-public-data.thelook_ecommerce.users` us
JOIN `bigquery-public-data.thelook_ecommerce.events` ev
ON us.state = ev.state
GROUP BY ev.traffic_source, age_group
ORDER BY total_customer;
  • The age distribution of E-mail and Award is even, but the number of teenage users of Youtube and Facebook is small.

Strategy Recommendations

About Sales

  • Strengthen Sales Promotions During Peak Hours: Conduct promotional activities during this time, such as flash sales, member-exclusive discounts, etc., to further boost sales.
  • Increase Engagement Among Younger Demographics: Consider designing targeted promotional activities or use social media for targeted marketing to attract more young customers.
  • Differentiate Marketing Strategies for Specific Countries: Since China and the U.S. are the top countries for both male and female customers, consider creating different marketing strategies to these two markets to meet local consumer preferences.

About Product

  • Optimize Products with High Return Rates: For categories with high return rates (such as swimwear and jeans), consider more detailed size guides, increasing product fit recommendations to enhance customer satisfaction.
  • Focus on Underperforming Brands: Phasing out underperforming products or repositioning their market strategy.

About Marketing

  • Increase Investment in AdWords and Email Marketing: Continue increasing investment in these two channels and manage them.
  • Optimize Content for Specific Channels: For platforms like YouTube and Facebook, which have fewer teenage users, consider launching more content that appeals to younger audiences, such as short videos, influencer collaborations.

Conclusion

This is my first case study. I referenced some analysis methods from experts and used ChatGPT to help with debugging and translation. I know I’m not perfect, so I need to learn more. Keep it up!

Reference:

https://tuanchris.com/blog/2022-09-07-Learn-SQL-with-the-e-commerce-dataset-on-Google-BigQuery/

https://medium.com/@ardanarizky/e-commerce-analysis-project-using-big-query-and-tableau-b88c32d99a22

https://medium.com/@chisomnnamani/the-look-e-commerce-data-analysis-28342b8da868

--

--