Python For Data Analyst

Roni Ramdani
35 min readMay 20, 2023

--

Python is a high-level programming language that is often used in software development, web programming, data analysis, and game development. Developed by Guido van Rossum in 1991, Python was designed with a focus on good code readability and simple syntax, making it easy for even beginners to learn and use.

One of Python’s strengths is its flexibility. It has a wide range of uses spanning various fields such as web development, data science, artificial intelligence (AI), network programming, game development, and many more. Python also supports a variety of programming paradigms, including procedural, object-oriented, and functional programming.
One of the reasons for Python’s popularity is its rich ecosystem. There are many libraries and frameworks available for free, such as Django for web development, NumPy and Pandas for data analysis, TensorFlow and PyTorch for artificial intelligence, and many more. These libraries allow developers to develop applications quickly and efficiently.

For this opportunity, I will use Python to analyze data. The data used is data from Kaggle: Pakistan’s Largest E-Commerce Dataset with some changes. The price listed has been converted to 1 Rupee which is equal to Rp. 58. Regarding the explanation of the dataset, they are as follows: download dataset.

variabel | class | description

let’s study
For the tools that I used on this occasion, I used Google Colaboratory Link

The initial stage in working on this exercise Python is to import the Python library which will be used according to what you want to produce. In this exercise, the library will be imported as follows:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pandas.tseries.offsets import BDay

Pandas:
Pandas is one of the most popular libraries for data analysis in the Python environment. This library provides high-level data structures and powerful data analysis tools. Pandas allow users to easily manipulate, transform, and analyze data. This library provides two core data structures, namely Series (one-dimensional data series) and DataFrame (two-dimensional data table). Pandas also has features for importing and exporting data from various file formats, such as CSV, Excel, and SQL.

NumPy:
NumPy (Numerical Python) is the basic library for numeric computing in Python. This library provides efficient multidimensional array objects and various mathematical functions needed to perform numerical operations. NumPy is often used as the basis for other libraries related to data analysis and artificial intelligence. Using NumPy, users can perform mathematical operations, indexing, filtering, and transforming data quickly and efficiently.

Matplotlib. plot:
Matplotlib. pyplot is a module included in the matplotlib library. Matplotlib is a very flexible and powerful data visualization library in Python. This library allows users to create various types of data visualizations, such as line graphs, scatter plots, histograms, bar plots, and more. Matplotlib. pyplot provides an interface similar to MATLAB for plotting easily. Users can control various aspects of the plot’s appearance, such as axes, labels, color, and line style.

Seaborn:
Seaborn is a statistical data visualization library built on top of matplotlib. This library is designed to produce more engaging and informative visualizations by default, with minimal effort on the part of the user. Seaborn provides various types of plots useful for data exploration and analysis, such as scatter plots, correlation plots, distribution plots, and more. Apart from that, Seaborn also has support for beautiful color palettes and customizable display themes.
from pandas. series.offsets import BDay:

In that line of code, “from pandas. series. offsets import BDay” is a statement that imports the BDay class from the offsets module in the Pandas library. BDay (Business Day) is an offset class that is used to make date adjustments based on weekdays in the business calendar. This is useful when working with financial data or data related to workdays. Using BDay, users can calculate dates that fall on the next or previous business day, ignoring weekends and holidays.

Just go to task exercise python let’s goooo!

Soal 1 :

Dear Data Analyst,

At the end of this year, the company will give prizes to customers who win the Year-End Festival competition. The Marketing Team needs help to determine the estimated prize that will be given to the winner of the competition later. These prizes will be taken from the TOP 5 Products from the Mobiles & Tablets Category during 2022, with the highest total sales quantity (valid = 1).

Please help, to send the data before the end of this month to the Marketing Team. For the assistance provided, we thank you.

Regards

data1 = pd.DataFrame(\
#Memfilter data dengan valid=1
df[(df['is_valid']==1) &\
#Memfilter data dengan Category = Mobiles & Tablets
(df['category']=='Mobiles & Tablets') &\
#Memfilter data pada transaksi selama 2022
((df['order_date'] >= '2022-01-01') & (df['order_date'] <= '2022-12-31'))]\
#Mengelompokan data
.groupby(by=["sku_name"])["qty_ordered"]\
#Pengelompokan berdasarkan penjumlahan
.sum()\
#Mengurutkan data
.sort_values(ascending=False)\
#Mengambil TOP 5 Produk
.head(5)\
#Mereset nama header
.reset_index(name='qty_2022'))
data1

Filters data from a “df” data frame with the following conditions:
The data has an “is_valid” value equal to 1.

Data has the same “category” value as “Mobiles & Tablets”.
The data has an “order_date” between “2022–01–01” and “2022–12–31”.
Group data by column “sku_name”.

Calculates the total number of “qty_ordered” for each product group.
Sort data descending (descending) based on the number of “qty_ordered”.
Fetches the top 5 products (the product with the largest number of “qty_ordered”).

Reset the index and change the name of the processed column header to “qty_2022”.

The result is the variable “data1”, which contains a DataFrame containing the top 5 products by quantity “qty_ordered” in 2022.

Output :

sku_name | qty_2022
#Sort data untuk dapat digunakan sebagai grafik
data1.sort_values(\
by=['qty_2022'], \
ascending=True,\
inplace=True)

#Membuat grafik
data1.plot(x='sku_name',
y=['qty_2022'],
kind='barh',
grid = True,
xlabel = 'Quantity',
ylabel = 'Product Name',
figsize=(12,7),
rot = 0,
title = 'TOP 5 Product',
table = False,
sort_columns = False,
secondary_y = False)

Sort data in the variable “data1” by column “qty_2022” ascending by setting the “ascending” parameter to True.
Change the data directly in the “data1” variable by using the “inplace=True” parameter.

Create a graph by using the “plot()” function on the variable “data1”.
Set the parameters of the “plot()” function as follows:
x: Sets the “sku_name” column as the x-axis of the chart.
y: Sets column “qty_2022” as the y-axis of the chart.
kind: Specifies the type of chart you want to create as a horizontal bar (barh).
grid: Displays the grid on the chart.

xlabel: Sets the x-axis label on the chart as “Quantity”.
ylabel: Sets the y-axis label on the chart as “Product Name”.
figsize: Specifies the size (width, height) of the graph.
rot: Sets the x-axis label rotation to 0 degrees (horizontal).
title: Sets the chart title as “TOP 5 Product”.
table: Sets the parameter to display the data table if it is True (in this case it is False).

sort_columns: Sets the sort column if it is True (in this case False).
secondary_y: Specifies whether there will be a secondary y-axis (in this case False).

The end result is a horizontal bar graph showing the top 5 products by quantity “qty_2022” with the x-axis labeled “Quantity” and the y-axis labeled “Product Name”. The chart has the title “TOP 5 Products” and a grid view.

Output :

Soal 2 & 3 :

Dear Data Analyst,

Following up on the joint meeting of the Warehouse Team and Marketing Team, we found that there was still a lot of product stock in the Beauty & Grooming Category at the end of 2022.

  1. We ask for your help to check the sales data for this category for 2021 in terms of sales quantity. Our provisional estimate is that there has been a decrease in sales quantity in 2022 compared to 2021. (Please also display data for the 15 categories)
  2. If there is indeed a decrease in the number of sales in the Beauty & Grooming category, we ask for your assistance in providing data on the TOP 20 product names that have experienced the highest decline in 2022 when compared to 2021. We will use this as material for discussion at the next meeting.

Please help to send the data no later than 4 days from today. For the assistance provided, we thank you.

Regards

No 2 :

kode .1

data2 = pd.DataFrame(\
#Memfilter data dengan valid=1
df[(df['is_valid']==1) &\
#Memfilter data pada transaksi selama 2021
((df['order_date'] >= '2021-01-01') & (df['order_date'] <= '2021-12-31'))]\
#Mengelompokan data
.groupby(by=["category"])["qty_ordered"]\
#Pengelompokan berdasarkan penjumlahan
.sum()\
#Mengurutkan data
.sort_values(ascending=False)\
#Mereset nama header
.reset_index(name='qty_2021'))
data2

Briefly, here is an explanation of the steps performed in the code:
Filters data from a “df” data frame with the following conditions:
The data has an “is_valid” value equal to 1.

The data has an “order_date” between “2021–01–01” and “2021–12–31”.
Group data by column “category”.

Counts the total number of “qty_ordered” for each category.
Sort data descending (descending) based on the number of “qty_ordered”.
Reset the index and change the name of the processed column header to “qty_2021”.

The result is the variable “data2”, which contains a DataFrame containing the product category and the number of “qty_ordered” in 2021 after going through the filtering and processing of the data.

Output :

category | qty-2021

kode .2

data3 = pd.DataFrame(\
#Memfilter data dengan valid=1
df[(df['is_valid']==1) &\
#Memfilter data pada transaksi selama 2022
((df['order_date'] >= '2022-01-01') & (df['order_date'] <= '2022-12-31'))]\
#Mengelompokan data
.groupby(by=["category"])["qty_ordered"]\
#Pengelompokan berdasarkan penjumlahan
.sum()\
#Mengurutkan data
.sort_values(ascending=False)\
#Mereset nama header
.reset_index(name='qty_2022'))
data3

Here is an explanation of the steps done in the code:
Filters data from a “df” data frame with the following conditions:
The data has an “is_valid” value equal to 1.
The data has an “order_date” between “2022–01–01” and “2022–12–31”.
Group data by column “category”.

Counts the total number of “qty_ordered” for each category.
Sort data descending (descending) based on the number of “qty_ordered”.
Reset the index and change the name of the processed column header to “qty_2021”.

The final result is the variable “data3”, which contains a data frame containing the product category and the number of “qty_ordered” in 2022 after going through the filtering and processing of the data.

Output :

category | qty_2022
data4 = data2.merge(data3, left_on = 'category', right_on = 'category')
data4

Merge “data2” and “data3” based on the “category” column.
The parameter “left_on” specifies which column in “data2” is used as the left merge column.
The “right_on” parameter specifies which column in “data3” is used as the right-join column.

The result is the variable “data4”, which is a new DataFrame resulting from concatenating “data2” and “data3” based on the “category” column. The DataFrame “data4” will contain the columns in the two original DataFrames, namely “data2” and “data3”. This merger makes it possible to obtain complete or related information from the two DataFrames in one new DataFrame.

Output :

category | qty_2021 | qty_2022
data4['qty_growth']=data4['qty_2022']-data4['qty_2021']
data4.sort_values(by=['qty_growth'],ascending=True,inplace=True)
data4

Creates a new column named “qty_growth” inside the DataFrame “data4”.
Column “qty_growth” is filled with the difference between the column values “qty_2022” and “qty_2021”.

This difference represents growth in quantity from 2021 to 2022.
Sort data in DataFrame “data4” by column “qty_growth” ascending by setting parameter “ascending” to True.

Change the data directly in the “data4” variable by using the “inplace=True” parameter.

The result is a DataFrame “data4” which already has a new column “qty_growth” which indicates product quantity growth from 2021 to 2022. DataFrame “data4” is also sorted by column “qty_growth” in ascending order, so the rows with the lowest quantity growth will appear first in the DataFrame.

Output :

category | qty_2021 | qty-2022 | qty_growth

No 3 :

2 .If there is indeed a decrease in the number of sales in the Beauty & Grooming category, we ask for your assistance in providing data on the TOP 20 product names that have experienced the highest decline in 2022 when compared to 2021. We will use this as material for discussion at the next meeting.

kode .1

data5 = pd.DataFrame(\
#Memfilter data dengan valid=1
df[(df['is_valid']==1) &\
(df['category']=='Beauty & Grooming') &\
#Memfilter data pada transaksi selama 2021
((df['order_date'] >= '2021-01-01') & (df['order_date'] <= '2021-12-31'))]\
#Mengelompokan data
.groupby(by=["sku_name"])["qty_ordered"]\
#Pengelompokan berdasarkan penjumlahan
.sum()\
#Mengurutkan data
.sort_values(ascending=False)\
#Mereset nama header
.reset_index(name='qty_bg_2021'))
data5

Filters data from a “df” DataFrame with the following conditions:
The data has an “is_valid” value equal to 1.
Data has category “Beauty & Grooming”.
The data has an “order_date” between “2021–01–01” and “2021–12–31”.
Group data by column “sku_name”.

Calculates the total number of “qty_ordered” for each product (by “sku_name”).
Sort data descending (descending) based on the number of “qty_ordered”.
There’s no next step in the code, but usually the next step is to reset the column header names to make them more descriptive and in context.

The final result is the “data5” variable, which contains a DataFrame that contains products from the “Beauty & Grooming” category in 2021 after going through a filtering and data processing process. The “data5” DataFrame will have a “sku_name” column containing the product name, and a “qty_ordered” column containing the total number of products ordered. Data in “data5” will be sorted by “qty_ordered” quantity in descending order.
Because we will see if there is a decrease in sales quantity in the Beauty & Grooming category

Output :

sku_name | qty_bg_2021

kode .2

data6 = pd.DataFrame(\
#Memfilter data dengan valid=1
df[(df['is_valid']==1) &\
(df['category']=='Beauty & Grooming') &\
#Memfilter data pada transaksi selama 2021
((df['order_date'] >= '2022-01-01') & (df['order_date'] <= '2022-12-31'))]\
#Mengelompokan data
.groupby(by=["sku_name"])["qty_ordered"]\
#Pengelompokan berdasarkan penjumlahan
.sum()\
#Mengurutkan data
.sort_values(ascending=False)\
#Mereset nama header
.reset_index(name='qty_bg_2022'))
data6

Filters data from a “df” data frame with the following conditions:
The data has an “is_valid” value equal to 1.
Data has the category “Beauty & Grooming”.
The data has an “order_date” between “2021–01–01” and “2021–12–31”.
Group data by column “sku_name”.

Calculates the total number of “qty_ordered” for each product (by “sku_name”).
Sort data descending (descending) based on the number of “qty_ordered”.
There’s no next step in the code, but usually, the next step is to reset the column header names to make them more descriptive and in context.

The final result is the “data5” variable, which contains a data frame that contains products from the “Beauty & Grooming” category in 2021 after going through a filtering and data processing process. The “data5” DataFrame will have a “sku_name” column containing the product name, and a “qty_ordered” column containing the total number of products ordered. Data in “data5” will be sorted by “qty_ordered” quantity in descending order.

Output :

sku_name | qty_bg_2022

kode .3

data7 = data5.merge(data6, left_on = 'sku_name', right_on = 'sku_name')
data7['qty_bg_growth']=data7['qty_bg_2022']-data7['qty_bg_2021']
data7.sort_values(by=['qty_bg_growth'],ascending=True,inplace=True)
data7 = data7.head(20)
data7

Filters data from a “df” data frame with the following conditions:
The data has an “is_valid” value equal to 1.
Data has the category “Beauty & Grooming”.
The data has an “order_date” between “2021–01–01” and “2021–12–31”.
Group data by column “sku_name”.

Calculates the total number of “qty_ordered” for each product (by “sku_name”).
Sort data descending (descending) based on the number of “qty_ordered”.
There’s no next step in the code, but usually, the next step is to reset the column header names to make them more descriptive and in context.

The final result is the “data5” variable, which contains a data frame that contains products from the “Beauty & Grooming” category in 2021 after going through a filtering and data processing process. The “data5” DataFrame will have a “sku_name” column containing the product name, and a “qty_ordered” column containing the total number of products ordered. Data in “data5” will be sorted by “qty_ordered” quantity in descending order.

  1. The data is initially filtered based on certain criteria, such as data validity, product category, and transaction date range.
  2. Then, the data is grouped by product name (sku_name) and the total ordered product quantity (qty_ordered) is calculated for each of these products.
  3. The data is sorted by the number of product quantities in descending order.
  4. Merged with another DataFrame (data6) based on column sku_name.
  5. Created a new column (qty_bg_growth) that shows the growth in product quantity from 2021 to 2022.
  6. The data is re-sorted by increasing product quantity growth.

Data is limited to the top 20 rows.

With the above steps, we can analyze the products with the lowest quantity growth from 2021 to 2022. The final result is a DataFrame “data7” which contains the 20 products with the lowest quantity growth.

Soal 4 :

Dear Data Analyst,

Regarding the company’s anniversary in the next 2 months, the Digital Marketing Team will provide promo information for customers at the end of this month. The customer criteria that we will need are those who have checked out but have not made a payment (is_gross = 1) during 2022. The data we need is the Customer ID and Registered Date.

Please help, to send the data before the end of this month to the Digital Marketing Team. For the assistance provided, we thank you.

Regards

Tim Digital Marketing

kode .1

data8 = df[\
#Memfilter data dengan gross=1
(df['is_gross']==1) &\
#Memfilter data dengan valid=0
(df['is_valid']==0) &\
#Memfilter data dengan net=0
(df['is_net']==0) &\
#Memfilter data pada transaksi selama 2022
((df['order_date'] >= '2022-01-01') & (df['order_date'] <= '2022-12-31'))]
data9 = data8[['customer_id','registered_date']]
data9

Briefly, here is an explanation of the steps performed in the code:
Filters data from a “df” data frame with the following conditions:
The data has the value “is_gross” which is equal to 1.
The data has an “is_valid” value which is equal to 0.
The data has the value “is_net” which is equal to 0.
The data has an “order_date” between “2022–01–01” and “2022–12–31”.
The result is the variable “data8”, which contains a DataFrame that is a subset of the “df” DataFrame with data that satisfies the above filter conditions.

DataFrame “data8” will contain rows of data that have an “is_gross” value equal to 1, “is_valid” value equal to 0, “is_net” value equal to 0, and have transaction dates between “2022– 01–01” and “2022–12–31”
The code generates the variable “data9” which is a new DataFrame consisting of the “customer_id” and “registered_date” columns of the “data8” DataFrame.
Briefly, here is an explanation of the steps performed in the code:

Selects the “customer_id” and “registered_date” columns from the DataFrame “data8” using the slicing syntax with double square brackets “[[‘customer_id’, ‘registered_date’]]”.

Assigns the selection result to the variable “data9”.
The result is the variable “data9”, which is a DataFrame consisting of two columns, namely “customer_id” and “registered_date”, taken from the DataFrame “data8”. DataFrame “data9” contains only data relevant to those selected columns.

Output :

customer_id | registere_date

In the given context, the goal is to send the Digital Marketing Team customer data before the end of this month. The code described earlier can be used to generate a DataFrame “data9” containing customer information that matches the criteria you provide, i.e. those who have checked out but have not made a payment (is_gross = 1) during 2022.

By using the additional code data9.to_csv(‘customer_data.csv’, index=False), the data can be saved in CSV format with the file name “customer_data.csv”. This file can be uploaded or sent to the Digital Marketing Team as the data they need.
Thus, by following these steps, you can fulfill the request of the Digital Marketing Team by providing customer information that meets the criteria before the end of this month.

Soal 5 & 6 :

Dear Data Analyst,

Next month we will make an Annual Report which we will submit to Investors, in this regard please provide the following data:
Overall Profit Growth (%) 2021 vs 2022 as a result of annual sales performance.

Profit Growth (%) by Product Category in 2021 vs 2022.
Please help to send the data by tomorrow. For the assistance provided, we thank you.

Regards

Board of Director (BOD)

no 5 :

df['profit'] = df['after_discount'] - df['cogs']

In the code df[‘profit’] = df[‘after_discount’] — df[‘cogs’], we create a new column in the DataFrame df named “profit”. This field is filled with the difference between the “after_discount” and “cogs” column values.

In more detail, here is an explanation of the components in the code:
df[‘profit’] is a way to create or access the column “profit” in a DataFrame df.
df[‘after_discount’] is the “after_discount” column in the DataFrame df, which contains the value after the discount is applied to a transaction.

df[‘cogs’] is the column “cogs” in the DataFrame df, which is the cost of goods sold, namely the costs incurred to produce or obtain the products sold.
The — operator is used to subtract the “cogs” column values from the “after_discount” column values, returning the difference.
The result of the difference is then stored in the new “profit” column in the DataFrame df.

By executing the code, we can add a “profit” column to the DataFrame df, which describes the profitability or profit of each transaction based on the difference between the sales value after discount and the cost of goods sold.


data10 = df[\
#Memfilter data dengan valid=1
(df['is_valid']==1) &\
#Memfilter data pada transaksi selama 2022
((df['order_date'] >= '2022-01-01') & (df['order_date'] <= '2022-12-31'))]

Code data10 = df[(df[‘is_valid’]==1) & ((df[‘order_date’] >= ‘2022–01–01’) & (df[‘order_date’] <= ‘2022–12- 31’))] is used to create a new DataFrame named “data10” by taking a subset (part) of data from DataFrame “df” based on certain criteria.

Here is an explanation of each component in the code:
df[‘is_valid’]==1 is a condition to filter data which has “is_valid” value equal to 1. With this, only data with an “is_valid” value of 1 will be preserved in DataFrame “data10”.

(df[‘order_date’] >= ‘2022–01–01’) & (df[‘order_date’] <= ‘2022–12–31’) is a condition to filter data by date “order_date” in the time range between ‘ 2022–01–01’ and ‘2022–12–31’. Only data that meet these conditions will be stored in a “data10” DataFrame.

The & operator is used to combine the two conditions above. In this case, both conditions must be met simultaneously for the data to fit into the “data10” DataFrame.

By running the code, the DataFrame “data10” will contain a data subset of the DataFrame “df” that satisfies both filter conditions, i.e. it has an “is_valid” value of 1 and an “order_date” date is within the year 2022 (January 1, 2022 to December 31). 2022).

data11 = df[\
#Memfilter data dengan valid=1
(df['is_valid']==1) &\
#Memfilter data pada transaksi selama 2021
((df['order_date'] >= '2021-01-01') & (df['order_date'] <= '2021-12-31'))]

Code data11 = df[(df[‘is_valid’]==1) & ((df[‘order_date’] >= ‘2021–01–01’) & (df[‘order_date’] <= ‘2021–12- 31’))] is used to create a new DataFrame with the name “data11” which takes a subset (part) of data from DataFrame “df” based on certain criteria.

Here is an explanation of each component in the code:
df[‘is_valid’]==1 is a condition to filter data that has “is_valid” value equal to 1. Only data with a “is_valid” value of 1 will be stored in DataFrame “data11”.
(df[‘order_date’] >= ‘2021–01–01’) & (df[‘order_date’] <= ‘2021–12–31’) is a condition to filter data by date “order_date” in the time range between ‘ 2021–01–01’ and ‘2021–12–31’. Only data that meet these conditions will enter the DataFrame “data11”.

The & operator is used to combine the two conditions above. In this case, both conditions must be met simultaneously for data to be included in the “data11” DataFrame.

By running the code, the DataFrame “data11” will contain a data subset of the DataFrame “df” that satisfies both of the filter conditions, i.e. it has an “is_valid” value of 1 and the “order_date” date is within the year 2021 (1 January 2021 to 31 December 2021).

data12 = {\
'Periode Profit':'Total',\
'2021': data11['profit'].sum(), \
'2022': data10['profit'].sum(),\
'Growth (Value)': data10['profit'].sum() - data11['profit'].sum(),\
'Growth': pd.Series(round(((data10['profit'].sum() - data11['profit'].sum())/data11['profit'].sum())*100,2), dtype=str)+'%'
}
pd.DataFrame(data=data12, index=[0])

Data12 code is used to create a dictionary with the name “data12” which contains information regarding total profit and profit growth in a certain period. Then, the “data12” dictionary is used to create a new DataFrame using the function pd.DataFrame(data=data12, index=[0]).

Here is an explanation of each component in the code:
‘Profit Period’: ‘Total’ is a key-value pair in the “data12” dictionary that represents the column label “Profit Period” with the value “Total”.
‘2021’: data11[‘profit’].sum() is the key-value pair in the dictionary “data12” that represents the column label “2021” with the total profit value of the DataFrame “data11” (subset data year 2021).

‘2022’: data10[‘profit’].sum() is a key-value pair in the dictionary “data12” that represents the column label “2022” with the total profit value of the DataFrame “data10” (subset data year 2022).

‘Growth (Value)’: data10[‘profit’].sum() — data11[‘profit’].sum() is the key-value pair in the dictionary “data12” that represents the column label “Growth (Value)” with the value profit growth between 2021 and 2022.
‘Growth’: pd.Series(round(((data10[‘profit’].sum() — data11[‘profit’].sum())/data11[‘profit’].sum())*100,2 ), dtype=str)+’%’ is the key-value pair in the dictionary “data12” which represents the column label “Growth” with the profit growth value in percentage form.

After that, the “data12” dictionary is used as input to create a new DataFrame using the function pd.DataFrame(data=data12, index=[0]). This DataFrame has one row (index=[0]) and columns corresponding to the key in the “data12” dictionary.

Output :

Periode Profit | 2021 | 2022 | Growth ( Value ) | Growth

Based on the code provided, the insights that can be taken for the annual report regarding this question are:

Overall Profit Growth (%) 2021 vs 2022 as a result of annual sales performance.
Using the “data12” DataFrame, you can take the value of the overall profit growth between 2021 and 2022. The “Growth” column in the “data12” DataFrame shows the percentage of profit growth.

Profit Growth (%) by Product Category in 2021 vs 2022.
To get this information, it is necessary to do further analysis using DataFrames “data2” and “data3”. Possible steps that can be taken include:
a. Merge DataFrames “data2” and “data3” based on product category using the merge method.
b. Calculates total profit for 2021 and 2022 by product category.
c. Calculates profit growth by product category between 2021 and 2022.
d. Create graphical or table visualizations that illustrate profit growth by product category.

No 6 :

2. Profit Growth (%) berdasarkan Kategori produk tahun 2021 vs 2022.

data13 = pd.DataFrame(data10\
.groupby(by="category")["profit"].sum()\
.sort_values(ascending=False)\
.reset_index(name='profit_2022'))
data13

The code you provided is a command to create a new DataFrame called data13. The following describes the steps performed in the command:

data10.groupby(by=”category”)[“profit”].sum(): This code groups the data in the data10 DataFrame by the “category” column and then sums the values of the “profit” column for each group. The result is a series of profit values that are summed up for each category.

.sort_values(ascending=False): After grouping and summing, this command sorts the resulting profit values descending (descending), so that the category with the highest profit will be at the top.

.reset_index(name=’profit_2022'): This command resets the DataFrame index and names the resulting new column as ‘profit_2022’. In the resulting data frame, the categories will become the new indexes, and the ‘profit_2022’ column will contain the total profit values for each category.

Thus, data13 will become a new DataFrame containing the unique categories of data10 as an index, and column ‘profit_2022’ will contain the total profit for each category in descending order.

Output :

category | profit-2022
data14 = pd.DataFrame(data11\
.groupby(by="category")["profit"].sum()\
.sort_values(ascending=False)\
.reset_index(name='profit_2021'))
data14

The code you provided is a command to create a new DataFrame called data14. The following describes the steps performed in the command:

data11.groupby(by=”category”)[“profit”].sum(): This code groups the data in the data11 DataFrame by the “category” column and then sums the values of the “profit” column for each group. The result is a series of profit values that are summed up for each category.

.sort_values(ascending=False): After grouping and summing, this command sorts the resulting profit values descending (descending), so that the category with the highest profit will be at the top.

.reset_index(name=’profit_2021'): This command resets the DataFrame index and names the resulting new column as ‘profit_2021’. In the resulting data frame, the categories will become the new index, and the ‘profit_2021’ column will contain the total profit values for each category.

Thus, data14 will become a new DataFrame containing unique categories from data11 as an index, and column ‘profit_2021’ will contain the total profit for each category in descending order.

Output :

category | profit_2021
data15 = data14.merge(data13, left_on = 'category', right_on = 'category')
data15

The command you provided combines the data14 and data13 DataFrames based on the “category” column. Here is an explanation of what the command does:

data15 = data14.merge(data13, left_on=’category’, right_on=’category’)
This command uses the merge method to combine two DataFrames, namely data14 and data13, based on the “category” column. The “category” column is used as the key to join the two DataFrames.

As a result, the DataFrame data15 will contain the concatenated data from data14 and data13 with the column “category” as the index. This data frame will have three columns: “category”, “profit_2021” (total profit in 2021), and “profit_2022” (total profit in 2022)

Output :

category | profit_2021 | profit-2022
#Melakukan kalkulasi
data15['Growth (Value)'] = data15['profit_2022']-data15['profit_2021']
data15['Growth (%)'] = round(data15['Growth (Value)']/data15['profit_2021']*100,2)
data15.sort_values(by=['Growth (%)'], ascending = False, inplace = True)
data15

The given command adds the “Growth (Value)” and “Growth (%)” columns to the DataFrame data15, calculating the value growth and the growth percentage between the “profit_2022” and “profit_2021” columns. Then, the data frame is sorted based on the “Growth (%)” column descending. Here is an explanation of the steps:

data15[‘Growth (Value)’] = data15[‘profit_2022’] — data15[‘profit_2021’]: This command calculates the difference between the “profit_2022” and “profit_2021” columns and adds it as a new column named “Growth (Value)”.

data15[‘Growth (%)’] = round(data15[‘Growth (Value)’] / data15[‘profit_2021’] * 100, 2): This command calculates the percentage growth by dividing the value of the column “Growth (Value)” by “profit_2021”, then multiply by 100. The result is rounded to 2 decimals and added as a new column named “Growth (%)”.

data15.sort_values(by=[‘Growth (%)’], ascending=False, inplace=True): This command sorts the DataFrame data15 based on the “Growth (%)” column descending. This sequence replaces the original data frame with the new sequence.

Thus, DataFrame data15 will contain combined data from data14 and data13 with additional columns “Growth (Value)” and “Growth (%)” which describe the growth in value and the percentage of growth. This data frame will be sorted by percentage growth (Growth (%)) in descending order.

Output :

category | profit_2021 | profit-2022 | Growth(Value) | Growth (%)

Grafik :

#Membuat grafik
data15.plot(x='category',
y=['Growth (%)'],
kind='bar',
grid = True,
title = 'Growth 2021 vs 2022',
xlabel = 'Category',
ylabel = 'Growth (%)',
figsize=(12,7),
rot = 90,
table = False,
sort_columns = False,
secondary_y = False)

Soal 7 & 8:

Dear Data Analyst,

From October to December 2022, we are running a campaign every Saturday and Sunday. We want to assess whether the campaign has enough impact on increasing sales (before_discount). Please help to display the data:
Average daily weekend sales (Saturday and Sunday) vs. average daily sales weekdays (Monday-Friday) per month. Is there an increase in sales in each of these months?
Average daily sales weekends (Saturday and Sunday) vs. average daily sales weekdays (Monday-Friday) for the entire 3 months.
Please help to send the data no later than next week. For the assistance provided, we thank you.

Regards

Tim Campaign

no 7 :

Rata-rata harian penjualan weekends (Sabtu dan Minggu) vs rata-rata harian penjualan weekdays (Senin-Jumat) per bulan tersebut. Apakah ada peningkatan penjualan pada masing-masing bulan tersebut.

Kode 1.

#Membuat kolom tambahan untuk day, month, month_num
df['day']=df['order_date'].dt.day_name()
df['month']=df['order_date'].dt.month_name()
df['month_num']=df['order_date'].dt.month
df.head(5)

df[‘day’] = df[‘order_date’].dt.day_name() This line adds a new column named ‘day’ to the DataFrame df. The value in this column will contain the day of the data contained in the ‘order_date’ column. The dt.day_name() attribute is used to retrieve the day that came from the Pandas DateTime object in the ‘order_date’ column.

df[‘month’] = df[‘order_date’].dt.month_name() This line adds a new column named ‘month’ to the DataFrame df. The value in this column will contain the month name of the data contained in the ‘order_date’ column. The dt.month_name() attribute is used to retrieve the month name from the Pandas DateTime object in the ‘order_date’ column.

df[‘month_num’] = df[‘order_date’].dt.month This line adds a new column named ‘month_num’ to the DataFrame df. The value in this column will contain the month number of the data contained in the ‘order_date’ column. The dt. month attribute is used to retrieve the month number from the Pandas DateTime object in the ‘order_date’ column.

df.head(5) This line is used to display the first five rows of the DataFrame df that have been updated with new columns. The head(5) function returns a DataFrame with the first five rows, allowing you to see the results of the changes you’ve made.

Output

id | customer_id | order_date | sku_id | price | qty_ordered | before_discount | before_amount | after_discount |is_gross | … | payment_id | payment_method | sku_name | base_price | cogs | category | registed_date | day | month | month_num.

Kode 2.

data16 = pd.DataFrame(df[(df['is_valid']==1) \
& (df['day'].isin(['Saturday','Sunday']))\
& (df['order_date'] >= '2022-01-01') & (df['order_date'] <= '2022-12-31')]\
.groupby(by=["month_num","month"])["before_discount"].mean()\
.round()\
.sort_values(ascending=False)\
.reset_index(name='avg_sales_weekend'))
data16

df[(df[‘valid’] == 1) & (df[‘day’].isin([‘Saturday’, ‘Sunday’])) & (df[‘order_date’] >= ‘2022–01- 01’) & (df[‘order_date’] <= ‘2022–12–31’)] In the first step, we filter the data in the DataFrame df with the following conditions:

df[‘is_valid’] == 1 ensures only valid data will be considered.

df[‘day’].isin([‘Saturday’, ‘Sunday’]) ensures only data with Saturday or Sunday will be considered.

df[‘order_date’] >= ‘2022–01–01’ ensures only records with a date after or on January 1, 2022, will be approved.

df[‘order_date’] <= ‘2022–12–31’ ensures only data with a date before or on December 31, 2022, will be approved.

.groupby(by=[“month_num”, “month”])[“before_discount”].mean() After filtering the data, we group it based on the “month_num” and “month” columns. Then, we select the “before_discount” column and calculate the average using the .mean() method.
.round() After getting the average, we round it to the nearest integer using the .round() method.

.sort_values(ascending=False) After that, we sort the data based on the average value descending using the .sort_values() method with the parameter ascending=False.
.reset_index(name=’avg_sales_weekend’) Lastly, we reset the DataFrame index and name the average calculated column as “avg_sales_weekend” using the .reset_index() method with parameter name=’avg_sales_weekend’.

Then data16 will contain a DataFrame containing the columns “month_num”, “month”, and “avg_sales_weekend” showing the average weekend sales (Saturday and Sunday) for each month in the specified time range.

Output :

month_num | month | avg_sales_weekend

Kode 3.

data17 = pd.DataFrame(df[(df['is_valid']==1) \
& (df['day'].isin(['Monday','Tuesday','Wednesday','Thusday','Friday']))\
& (df['order_date'] >= '2022-01-01') & (df['order_date'] <= '2022-12-31')]\
.groupby(by=["month_num","month"])["before_discount"].mean()\
.sort_values(ascending=False)\
.round()\
.reset_index(name='avg_sales_weekdays'))
data17

Output :

month_num | month | avg_sales_weekdays

Kode 4.

data18 = data16.merge(data17, left_on = 'month', right_on = 'month')
data18.sort_values(by='month_num_x',ascending=True, inplace=True)
data18 = data18[["month","avg_sales_weekend","avg_sales_weekdays"]]
data18

Here is an explanation of the steps done in the code:

df[(df[‘is_valid’] == 1) & (df[‘day’].isin([‘Monday’, ‘Tuesday’, ‘Wednesday’,

‘Thursday’, ‘Friday’])) & (df [‘order_date’] >= ‘2022–01–01’) & (df[‘order_date’] <= ‘2022–12–31’)] In the first step, we filter the data in the DataFrame df with the following conditions:

df[‘is_valid’] == 1 ensures only valid data will be considered.

df[‘day’].isin([‘Monday’, ‘Tuesday’, ‘Wednesday’, ‘Thursday’, ‘Friday’]) ensures only data with Mondays through Fridays will be considered.

df[‘order_date’] >= ‘2022–01–01’ ensures only data with a date after or on January 1, 2022 will be considered.

df[‘order_date’] <= ‘2022–12–31’ ensures only data with a date before or on December 31, 2022 will be considered.

.groupby(by=[“month_num”, “month”])[“before_discount”].mean() After filtering the data, we group it based on the “month_num” and “month” columns. Then, we select the “before_discount” column and calculate the average using the .mean() method.

.sort_values(ascending=False) After that, we sort the data based on the average value descending using the .sort_values() method with the parameter ascending=False.

.round() After getting the average, we round it to the nearest integer using the .round() method.
.reset_index(name=’avg_sales_weekdays’) Finally, we reset the DataFrame index and name the averaged column as “avg_sales_weekdays” using the .reset_index() method with parameter name=’avg_sales_weekdays’.

Then data17 will contain a DataFrame containing the columns “month_num”, “month”, and “avg_sales_weekdays” representing average sales on weekdays (Monday to Friday) for each month in the specified time range.

Output :

month_num | month | avg_sales_weekend | avg_sales_weekdays

Grafik :

data18.plot(x='month',
y=['avg_sales_weekend','avg_sales_weekdays'],
kind='bar',
grid = True,
xlabel = 'Category',
ylabel = 'Total',
figsize=(12,7),
rot = 90,
table = False,
secondary_y = False)

No 8 :

Average daily sales weekends (Saturday and Sunday) vs. average daily sales weekdays (Monday-Friday) for the entire 3 months.

#Menyimpan data dalam bentuk Pandas DataFrame
data19 = df[\
#Memfilter data dengan valid=1
(df['is_valid']==1) &\
(df['day'].isin(['Saturday','Sunday'])) &\
#Memfilter data pada transaksi selama 2022
((df['order_date'] >= '2022-10-01') & (df['order_date'] <= '2022-12-31'))]
#Menyimpan data dalam bentuk Pandas DataFrame
data20 = df[\
#Memfilter data dengan valid=1
(df['is_valid']==1) &\
(df['day'].isin(['Monday','Tuesday','Wednesday','Thusday','Friday'])) &\
#Memfilter data pada transaksi selama 2022
((df['order_date'] >= '2022-10-01') & (df['order_date'] <= '2022-12-31'))]
data21 = {\
'Periode':'Total 3 months',\
'Avg Weekend Sales': round(data19['before_discount'].mean(),2), \
'Avg Weekdays Sales': round(data20['before_discount'].mean(),2),\
'Diff (Value)': round(data19['profit'].mean() - data20['profit'].mean(),2),\
'Diff (%)': pd.Series(round(((data19['profit'].mean() - data20['profit'].mean())/data19['profit'].mean())*100,2), dtype=str)+'%'
}
pd.DataFrame(data=data21, index=[0])

Here is an explanation of the steps done in the code:

In the first part of the code, we perform data filtering in the DataFrame df with the following conditions:

(df[‘is_valid’]==1) ensures only valid data will be considered.

(df[‘day’].isin([‘Saturday’,’Sunday’])) ensures only data with Saturday or Sunday will be considered.

((df[‘order_date’] >= ‘2022–10–01’) & (df[‘order_date’] <= ‘2022–12–31’)) filters data by date range, i.e. starting from 1st October 2022 until December 31, 2022.

Then, the data that has been filtered in the first step is stored in the variable data19.

In the second part of the code, we also perform data filtering in the DataFrame df with the following conditions:

(df[‘is_valid’]==1) ensures only valid data will be considered.

(df[‘day’].isin([‘Monday’,’ Tuesday’,’ Wednesday’, ‘Thursday’,’ Friday’])) ensures only data with Mondays through Fridays will be considered.

((df[‘order_date’] >= ‘2022–10–01’) & (df[‘order_date’] <= ‘2022–12–31’)) filters data based on the same date range.

The filtered data in the second step is stored in the data20 variable.

Next, we create a dictionary named data21 which contains the calculation result information. This dictionary has several key-value pairs, namely:
‘Period’: ‘Total 3 months’ indicating the period under consideration.

‘Avg Weekend Sales’: round(data19[‘before_discount’].mean(),2) which is the average weekend sales (Saturday and Sunday) in the specified time range.

‘Avg Weekdays Sales’: round(data20[‘before_discount’].mean(),2) which is the average weekday’s sales (Monday to Friday) in the specified time range.

‘Diff (Value)’: round(data19[‘profit’].mean() — data20[‘profit’].mean(),2) which is the difference in value between the weekend’s average profit and the day’s average profit work within the specified timeframe.

‘Diff (%)’: pd.Series(round(((data19[‘profit’].mean() — data20[‘profit’].mean())/data19[‘profit’].mean())* 100,2), dtype=str)+’%’ which is the percentage difference between the average weekend profit and the average weekday profit in the specified time range.

Finally, we use pd.DataFrame() to create a new DataFrame from the data21 dictionary. This data will have one row with an index of 0.

So, data21 will contain a data frame with information on average weekend sales, average weekday sales, differences in profit values, and differences.

Output :

Periode | Avg Weeken Sales | Avg Weekdays Sales | Diff ( Value ) Diff (%)

Soal 9 & 10 :

Dear Data Analyst,

To know a customer’s purchasing power, we need data:
AOV (Average Order Value or Total sales / Total Order) every month during 2021 vs 2022.
AOV 2021 vs 2022.
For the assistance provided, we thank you

Regards

Tim Campaign

No 9 :

data22 = pd.DataFrame(df[(df['is_valid']==1) & ((df['order_date'] >= '2022-01-01') & (df['order_date'] <= '2022-12-31'))]\
.groupby(by=['month_num','month'])["before_discount"].sum()\
.round()\
.reset_index(name='sales_2022'))\
.sort_values(by=['month_num'],ascending=True)\
.head(12)
data22

Here is an explanation of the steps done in the code:

df[(df[‘is_valid’]==1) & ((df[‘order_date’] >= ‘2022–01–01’) & (df[‘order_date’] <= ‘2022–12–31’) )] In the first step, we perform data filtering in the DataFrame df with the following conditions:

(df[‘is_valid’]==1) ensures only valid data will be considered.

((df[‘order_date’] >= ‘2022–01–01’) & (df[‘order_date’] <= ‘2022–12–31’)) filters data by date range, i.e. starting from January 1, 2022 until December 31, 2022.

.groupby(by=[‘month_num’,’month’])[“before_discount”].sum() After filtering the data, we group it based on the “month_num” and “month” columns. Then, we select the “before_discount” column and sum it using the .sum() method.

.round() After getting the total sales, we round it to the nearest whole number using the .round() method.

.reset_index(name=’sales_2022') We reset the DataFrame index and name the total sales calculated column as “sales_2022” using the .reset_index() method with parameter name=’sales_2022'.

.sort_values(by=[‘month_num’],ascending=True) Next, we sort the data based on the column value “month_num” ascending using the .sort_values() method with parameters by=[‘month_num’] and ascending=True .

.head(12) Finally, we limit the DataFrame to display only the first 12 rows using the .head(12) method.

Then, data22 will contain a DataFrame containing the columns “month_num”, “month”, and “sales_2022” showing the total sales for each month from January 1, 2022 to December 31, 2022, sorted by month with the smallest “month_num” value and only displays the first 12 rows.

Output :

month_num | month | sales_2022
data23 = pd.DataFrame(df[(df['is_valid']==1) & ((df['order_date'] >= '2022-01-01') & (df['order_date'] <= '2022-12-31'))]\
.groupby(by=['month_num','month'])["id"].nunique()\
.round()\
.reset_index(name='order_2022'))\
.sort_values(by=['month_num'],ascending=True)\
.head(12)
data23

df[(df[‘is_valid’]==1) & ((df[‘order_date’] >= ‘2022–01–01’) & (df[‘order_date’] <= ‘2022–12–31’) )] In the first step, we perform data filtering in the DataFrame df with the following conditions:

(df[‘is_valid’]==1) ensures only valid data will be considered.

((df[‘order_date’] >= ‘2022–01–01’) & (df[‘order_date’] <= ‘2022–12–31’)) filters data by date range, i.e. starting from January 1, 2022, until December 31, 2022.

.groupby(by=[‘month_num’, ‘month’])[“id”].unique () After filtering the data, we group it based on the “month_num” and “month” columns. Then, we select the column “id” (probably the column that stores the identity of the transaction or order) and calculate the number of unique values using the .nunique() method.

.round() After getting the number of unique orders, we round it to the nearest integer using the .round() method.

.reset_index(name=’order_2022') We reset the DataFrame’s index and name the unique order quantity calculated column as “order_2022” using the .reset_index() method with parameter name=’order_2022'.

.sort_values(by=[‘month_num’],ascending=True) Next, we sort the data based on the column value “month_num” ascending using the .sort_values() method with parameters by=[‘month_num’] and ascending=True.

.head(12) Finally, we limit the DataFrame to display only the first 12 rows using the .head(12) method.

Then data23 will contain a DataFrame containing the columns “month_num”, “month” and “order_2022” representing the number of unique orders for each month from 1 January 2022 to 31 December 2022, sorted by month with the smallest value “month_num”, and only shows the first 12 rows.

Output :

month_num | month | order_2022
data24 = pd.DataFrame(df[(df['is_valid']==1) & ((df['order_date'] >= '2021-01-01') & (df['order_date'] <= '2021-12-31'))]\
.groupby(by=['month_num','month'])["before_discount"].sum()\
.round()\
.reset_index(name='sales_2021'))\
.sort_values(by=['month_num'],ascending=True)\
.head(12)
data24

df[(df[‘is_valid’]==1) & ((df[‘order_date’] >= ‘2021–01–01’) & (df[‘order_date’] <= ‘2021–12–31’) )] In the first step, we perform data filtering in the DataFrame df with the following conditions:

(df[‘is_valid’]==1) ensures only valid data will be considered.

((df[‘order_date’] >= ‘2021–01–01’) & (df[‘order_date’] <= ‘2021–12–31’)) filters data by date range, i.e. starting from January 1, 2021, until December 31, 2021.

.groupby(by=[‘month_num’,’month’])[“before_discount”].sum() After filtering the data, we group it based on the “month_num” and “month” columns. Then, we select the “before_discount” column and sum it using the .sum() method.

.round() After getting the total sales, we round it to the nearest whole number using the .round() method.

.reset_index(name=’sales_2021') We reset the DataFrame index and name the total sales calculated column as “sales_2021” using the .reset_index() method with parameter name=’sales_2021'.

.sort_values(by=[‘month_num’],ascending=True) Next, we sort the data based on the column value “month_num” ascending using the .sort_values() method with parameters by=[‘month_num’] and ascending=True.

.head(12) Finally, we limit the DataFrame to display only the first 12 rows using the .head(12) method.

Then data24 will contain a DataFrame containing the columns “month_num”, “month”, and “sales_2021” showing the total sales for each month from January 1, 2021, to December 31, 2021, ordered by month with the smallest “month_num” value, and only shows the first 12 rows.

Output :

month_num | month | sales_2021
data25 = pd.DataFrame(df[(df['is_valid']==1) & ((df['order_date'] >= '2021-01-01') & (df['order_date'] <= '2021-12-31'))]\
.groupby(by=['month_num','month'])["id"].nunique()\
.round()\
.reset_index(name='order_2021'))\
.sort_values(by=['month_num'],ascending=True)\
.head(12)
data25

df[(df[‘is_valid’]==1) & ((df[‘order_date’] >= ‘2021–01–01’) & (df[‘order_date’] <= ‘2021–12–31’) )] In the first step, we perform data filtering in the DataFrame df with the following conditions:

(df[‘is_valid’]==1) ensures only valid data will be considered.

((df[‘order_date’] >= ‘2021–01–01’) & (df[‘order_date’] <= ‘2021–12–31’)) filters data by date range, i.e. starting from January 1, 2021, until December 31, 2021.

.groupby(by=[‘month_num’,’month’])[“id”].unique () After filtering the data, we group it based on the “month_num” and “month” columns. Then, we select the column “id” (probably the column that stores the identity of the transaction or order) and calculate the number of unique values using the .nunique() method.

.round() After getting the number of unique orders, we round it to the nearest integer using the .round() method.

.reset_index(name=’order_2021') We reset the DataFrame index and name the calculated unique order quantity column as “order_2021” using the .reset_index() method with parameter name=’order_2021'.

.sort_values(by=[‘month_num’],ascending=True) Next, we sort the data based on the column value “month_num” ascending using the .sort_values() method with parameters by=[‘month_num’] and ascending=True.

.head(12) Finally, we limit the DataFrame to display only the first 12 rows using the .head(12) method.

Then data25 will contain a DataFrame containing the columns “month_num”, “month” and “order_2021” representing the number of unique orders for each month from 1 January 2021 to 31 December 2021, sorted by month with the smallest value “month_num”, and only shows the first 12 rows.

Output :

month_num | month | order_2021
from sqlite3 import connect
conn = connect(':memory:')
data22.to_sql('sales_2022', conn, index=False, if_exists='replace')
data23.to_sql('order_2022', conn, index=False, if_exists='replace')
data24.to_sql('sales_2021', conn, index=False, if_exists='replace')
data25.to_sql('order_2021', conn, index=False, if_exists='replace')
data26 = pd.read_sql("""
SELECT
sales_2021.*,
order_2021.order_2021,
sales_2022.sales_2022,
order_2022.order_2022
FROM sales_2022
LEFT JOIN order_2022
on order_2022.month_num = sales_2022.month_num
LEFT JOIN sales_2021
on sales_2021.month_num = sales_2022.month_num
LEFT JOIN order_2021
on order_2021.month_num = sales_2022.month_num
""", conn)
data26

Here is an explanation of the steps done in the code:

from sqlite3 import connect The code imports the connect module from the sqlite3 library to connect to an SQLite database.

conn = connect(‘:memory:’) The code establishes a connection to the in-memory (:memory:) SQLite database using the connect() function.

data22.to_sql(‘sales_2022’, conn, index=False, if_exists=’replace’) Data from DataFrame data22 is stored in a table named ‘sales_2022’ in SQLite database using .to_sql() method. The parameter index=False ignores the index when saving to the table, and if_exists=’replace’ will replace the table if it already exists with the same name.

data23.to_sql(‘order_2022’, conn, index=False, if_exists=’replace’) The data from the data23 DataFrame is stored in a table named ‘order_2022’ in the SQLite database in the same way as the previous step.

data24.to_sql(‘sales_2021’, conn, index=False, if_exists=’replace’) The data from the data24 DataFrame is stored in a table named ‘sales_2021’ in the SQLite database in the same way as the previous step.

data25.to_sql(‘order_2021’, conn, index=False, if_exists=’replace’) The data from the data25 DataFrame is stored in a table named ‘order_2021’ in the SQLite database in the same way as the previous step.

data26 = pd.read_sql(“SELECT …”, conn) The code runs an SQL query that concatenates the tables ‘sales_2022’, ‘order_2022’, ‘sales_2021’, and ‘order_2021’ in an SQLite database. The query fetches all the columns from the ‘sales_2021’ table, and also the ‘order_2021.order_2021’, ‘sales_2022.sales_2022’, and ‘order_2022.order_2022’ columns. Joins are performed using a JOIN operation on the column ‘month_num’ between the tables. The query results are retrieved and read into a new DataFrame which is stored in the data26 variable.

So, data26 will contain a DataFrame which is the result of merging the tables ‘sales_2022’, ‘order_2022’, ‘sales_2021’, and ‘order_2021’ in the SQLite database, according to the columns selected in the SQL query.

Output :

data26['AOV 2021'] = round(data26['sales_2021']/data26['order_2021'],2)
data26['AOV 2022'] = round(data26['sales_2022']/data26['order_2022'],2)
data26['Diff AOV (value)']= data26['AOV 2022']-data26['AOV 2021']
data26['Diff AOV (%)'] = round((data26['AOV 2022']-data26['AOV 2021'])/data26['AOV 2021']*100,2)
data26

data26[‘AOV 2021’] = round(data26[‘sales_2021’]/data26[‘order_2021’],2) This code calculates the average order value (AOV) for 2021. In column ‘AOV 2021’ ‘, we divide the total sales of ‘sales_2021’ by the number of orders of ‘order_2021’. The resulting value is rounded to two decimal places using the .round() method. The calculation results are stored in a new column ‘AOV 2021’ in DataFrame data26.

data26[‘AOV 2022’] = round(data26[‘sales_2022’]/data26[‘order_2022’],2) This code calculates the average order value (AOV) for the year 2022. In column ‘AOV 2022 ‘, we divide the total sales of ‘sales_2022’ by the number of orders ‘order_2022’. The resulting value is rounded to two decimal places using the .round() method. The calculation results are stored in a new column ‘AOV 2022’ in DataFrame data26.

data26[‘Diff AOV (value)’] = data26[‘AOV 2022’]-data26[‘AOV 2021’] This code calculates the difference in AOV values between 2022 and 2021. In the ‘Diff AOV (value)’ column, we subtract the 2021 AOV value (‘AOV 2021’) from the 2022 AOV value (‘AOV 2022’). The calculation results are stored in a new column ‘Diff AOV (value)’ in DataFrame data26.

data26[‘Diff AOV (%)’] = round((data26[‘AOV 2022’]-data26[‘AOV 2021’])/data26[‘AOV 2021’]*100,2) This code calculates the difference in the percentage of AOV values between 2022 and 2021. In the ‘Diff AOV (%)’ column, we calculate the percentage difference by dividing the difference between the AOV values (‘AOV 2022’ — ‘AOV 2021’) by the AOV values of 2021 (‘AOV 2021’), then multiplied by 100. The calculation result is rounded to two decimal places using the .round() method. The calculation results are stored in a new column ‘Diff AOV (%)’ in DataFrame data26.

Thus, data26 will contain a DataFrame to which four new columns have been added: ‘AOV 2021’, ‘AOV 2022’, ‘Diff AOV (value)’, and ‘Diff AOV (%)’, each representing the average of the order values (AOV) for 2021 and 2022, the difference in the AOV value between 2022 and 2021, as well as the percentage difference in the AOV value between 2022 and 2021.

Output :

Grafik :

data26.plot(x='month',
y=['AOV 2021','AOV 2022'],
kind='bar',
grid = True,
xlabel = 'Month',
ylabel = 'AOV',
figsize=(12,7),
rot = 90,
table = False,
secondary_y = False)

No 10 :

kode 1.

aov_2021 = round(data26['sales_2021'].sum()/data26['order_2021'].sum(),2)
aov_2022 = round(data26['sales_2022'].sum()/data26['order_2022'].sum(),2)
data27 = {\
'Periode':'Total',\
'AOV 2021': aov_2021, \
'AOV 2022': aov_2022, \
'Growth (value)': aov_2022-aov_2021,\
'Growth': pd.Series(round((aov_2022-aov_2021)/aov_2021*100,2), dtype=str)+'%'
}
data28 = pd.DataFrame(data=data27, index=[0])
data28

Here is an explanation of the steps done in the code:

aov_2021 = round(data26[‘sales_2021’].sum()/data26[‘order_2021’].sum(),2) This code calculates the total Average Order Value (AOV) for 2021. We sum total sales of ‘sales_2021’ from all months and divide by the number of orders ‘order_2021’ from all months. The calculation results are rounded to two decimal places using the .round() method. The 2021 AOV value is stored in the aov_2021 variable.

aov_2022 = round(data26[‘sales_2022’].sum()/data26[‘order_2022’].sum(),2) This code calculates the average total order value (AOV) for 2022. We add up total sales of ‘sales_2022’ from all months and divide by the number of orders ‘order_2022’ from all months. The calculation results are rounded to two decimal places using the .round() method. The 2022 AOV value is stored in the aov_2022 variable.

data27 = {…} This code creates a data27 dictionary containing information on the total AOV for 2021 and 2022, as well as the growth in AOV values between the two years. This dictionary will be used to create a DataFrame data28.
data28 = pd.DataFrame(data=data27, index=[0]) This code creates a DataFrame data28 from dictionary data27. This DataFrame will have one row with index [0] and the columns corresponding to the keys in the data27 dictionary. The dataFrame data28 will contain information on the total AOV for 2021 and 2022, as well as the growth in AOV value between the two years.

Thus, data28 will contain a DataFrame that presents information on the total AOV for 2021 and 2022, as well as the growth in the AOV value between the two years. This DataFrame has one row with index [0] and columns ‘Period’, ‘AOV 2021’, ‘AOV 2022’, ‘Growth (value)’ and ‘Growth’, which respectively represent the period, year AOV value 2021, AOV value in 2022, AOV value growth (in value), and AOV value growth (in percentage).

Output :

Grafik :

data28.plot(x='Periode',
y=['AOV 2021','AOV 2022'],
kind='bar',
grid = True,
xlabel = 'Month',
ylabel = 'AOV',
figsize=(12,7),
rot = 90,
table = False,
secondary_y = False)

Output :

This is the end of the explanation regarding the training assignments given by MySkill to hone skills in using Python to analyze data. Thank you, I hope every explanation is useful.

“Rich library ecosystem: Python has various libraries specially designed for data science, such as NumPy, Pandas, Matplotlib, and Scikit-learn. These libraries provide powerful functions for manipulating data, analyzing data, creating visualizations, and training a machine learning model.”

Good job guys 😊

--

--