Online Business Reporting with SQL

From creating CSV import pipeline using python to analyze the data with PostgreSQL.

thejasmine
Analytics Vidhya
6 min readDec 23, 2020

--

Photo by Luke Chesser on Unsplash

As an analyst, it is essential to extract the data we want for a large database, and SQL is the tool for this!

SQL stands for Structured Query Language, and it is a language for creating, deleting, updating, and extracting data from the database. As a large amount of data stores in the database, how to extract useful information from there is important.

I want to share how to write queries to get some important metrics for the eCommerce business, such as MoM active users, top sales for different years.

The Brazilian E-Commerce dataset from Kaggle has 100k order data from 2016 to 2018 made at multiple marketplaces in Brazil.

But first, I need to solve the problem: import several CSV files to PostgresSQL. Previously, I introduced a way to import CSV use only PgAdmin, but I need to create each table manually, which is time-consuming.

Here is Python come to play! I can use `sqlalchemy` toolkit to import CSV files to my PostgreSQL database by running a script. This way, it would automatically create a table for me and use the CSV header as column names. You can learn how to use the package here.

The function I used for importing files.
Olist database

Now I have created a database with all data in place.

Before jumping into writing a query, it is important to understand the relationship between each table.

Image source: Kaggle

In my opinion, window function, common table expression, and extract time are the three important techniques for reporting. The window function can help us perform aggregation function with a specific range, and the CTE table can help us break down complex queries into small pieces and build the final result step by step.

The following solution might not be the most optimized, but building a common table expression table step by step is a clear way to understand the logic.

There are 10 questions that I am interested to understand more about Olist's sales performance and their customers. For some questions, I would add some notes and some points to put in mind when writing the query.

  1. Update product category name to the English version.
  2. Top 10 sales product
  3. What are the top 10 products for 2016, 2017, 2018, respectively?
  4. Which day of the week, customers tend to go shopping?
  5. What are the top 3 products for each state?
  6. What is the number of monthly active users for 2017? What is the MoM growth rate? What is the retention rate?
  7. What is the most popular payment method?
  8. How many orders in the different order price range?
  9. Average time of review being answered
  10. Order amount distribution(25%, 50%, 75% and average order amount)

1. Update product category name to the English version.

Sometimes, your row may contain values that you want to replace with something else. In this case, we can use `UPDATE` to update your table.

  • Use subquery to set the condition for replacing the English column. By using the where condition, I can make sure that I replace the same product category name.

2. Top 10 sales product

Let’s start from simple!

  • Group by function is needed when performing aggregation function!
  • Use LIMIT clause to select a limited number of records.

From the result, I can see that the health_beauty category has the highest sales!

3. What are the top 10 products for 2016, 2017, 2018, respectively?

Obviously, use just LIMIT the clause can not get the top 10 for different, and I can only get top 10 for 2016–2018. As a result, I need to use a common table expression to store the temporary result and use the window function to get the top 10 for each year!

  • First, get each product’s total sales and group by year, and store the result in a temporary table.
  • Use RANK clause to create a ranking column for each year’s sale.
  • Use WHERE clause to set the condition as ranking numbers smaller than and equal to 10.

2017 health and beauty category is in third place, and in 2018 it has the most sales. Bed and bath table drops from ranking1 to ranking3 from 2017 to 2018.

4. Which day of the week, customers tend to go shopping?

  • Use EXTRACT and pass theisodow argument which would return the day of the week.

Surprisingly, Monday has the most sales and Saturday is the last. It shows that customers love to shop on weekdays.

5. What are the top 3 products for each state?

The way to solve this problem is similar to the top 10 products for each year's question. Here, I need to join 4 tables in order to get the result. It is always useful to look back to the schema for connecting tables.

6. What is the number of monthly active users for 2017? What is the MoM growth rate? What is the retention rate?

Monthly active users& MoM growth rate

  • DISTINCT customer_id to get a unique customer count for each month.
  • LAG clause to get the previous customer count for calculating the growth rate.
  • Use GREATEST here is because I want to avoid setting the denominator as 0. Since the denominator for the first month of the dataset would be null, it would cause an error for the later calculation for the MoM growth rate.

From the result, the number of active users increases by 800% at the beginning of 2017.

Customer retention rate

Retention rate means how many active users in the previous month are still placed orders this month.

  • First, get a unique customer id for each month.
  • Use self left join and match with the same customer id and current month with the previous month.

7. What is the most popular payment method?

The credit card is the most popular payment method for this business.

8. How many orders in the different order price range?

The frequency table can put the data into different bins; it provides an insightful perspective on the frequency distribution.

  • Use ROUND -2 for rounding the sales to 100, and use the column to group sales bins.

Range 100–200 has the most number of order count.

9. Average time of review being answered

  • epoch and extract to calculate the average time for a customer’s review to be replyed.

10. Order amount distribution(25%, 50%, 75% and average order amount)

By knowing the percentile of the data, I can see if the data is skewed positively or negatively.

For the order payment, the median is less than the average; the data is skewed positively, meaning that most values are above that average.

Result

By asking the above questions, I know that:

  • Customers prefer using credit cards as the payment method.
  • Customers tend to shop on weekdays.
  • Health and beauty category is the top sales product category within three years, and first place for 2018 and third place for 2017.
  • On average, it takes two days for a customer’s review to get an answer.
  • More…

Takeaways

Using SQL to leverage customer data enables us to understand the business and customers better, leading to better performance and driving better strategy. Calculating key metrics and getting report-ready data using SQL and further analyzing data within Business Intelligence tools such as Python or Tableau makes the data analytics process more efficient.

You can access the full project code here.

--

--