Final Portfolio Project Python for Data Analysis

Megah Tauchid Ridla Wijaya
11 min readAug 17, 2023

This post is related to my previous post. You check this link: https://medium.com/@megahtauchid/final-portfolio-project-sql-for-data-analysis-b80aeb4fbfea

Hi everyone, I hope you are doing well and always surrounded by positive thoughts. In my previous post, I wrote about the portfolio project in SQL. This time, I will discuss the next data analysis project using Python. However, for easier access, I use Google Colaboratory to solve the Python-related problems below.

Previously, what is Python? Python is the most easy-to-understand programming language. Python was created by a Dutch programmer named Guido Van Rossum. Python is a programming language that is widely used by developers in the field of data science, software development, web applications, and Machine Learning. Python is free and integrated with all types of systems and can improve development speed.

Then, the advantages of using Python include:

  1. The syntax is simple and easy to understand, and easy to learn.
  2. Easy to use in the product development process
  3. Supports the Internet of Things or the Internet of Things
  4. Very flexible and can be integrated with various programming languages
  5. Improves productivity with many resources and clean object-oriented design
  6. Can be downloaded for free without buying a license because it is open source and free.

Okay, let’s now move on to the problems in this final project. Before doing the problem, make sure that the following libraries are already imported into Google Collabs.

Next, import the required data. The data used is the same as the data in the SQL project work in the previous post. If you want to see the data, you can access it at the following link.

This data contains four tables: order_detail table, payment_detail table, customer_detail table, and sku_detail table. The following is the syntax used:

Next, we need to display the original data to make sure that the data has been entered. We can use the syntax df_namatabel.head(). The number of rows that we want to display is in the parentheses, but by default, 5 rows will be displayed. It is also important to note that the index counting in Python starts from the number 0 (zero).

Next, in this work we will use sqlite.online, so we need to connect to sqlite.online first with the following syntax:

Next, we try to merge all the data in each table and display the data using the following query:

Before proceeding to question 1, we need to make sure that the data type of each column in the table is correct. We can check this by using the syntax df.dtypes. It appears that there are some data types on the column that are not correct. The columns order_date and registered_date should be using the data type datetime, while the columns before_discount, discount_amount, and after_discount should be using the data type integer.

To do this, we need to change the data type as follows:

Question 1

Dear Data Analyst

At the end of this year, the company will provide rewards for customers who win the Year-End Festival competition. The Marketing Team requires assistance in determining the estimated prizes that will be given to the competition winners in the future. These prizes will be selected from the TOP 5 Products in the Mobiles & Tablets Category during the year 2022, based on the highest quantity of (valid = 1) sales.

We kindly request your help in sending this data before the end of this month to the Marketing Team. We express our gratitude for the assistance provided.

Regards,

Marketing Team.

# Write te code below, you can write more than 1 code.
data1 = pd.DataFrame(\
#Filter data that have been paid (is_valid =1)
df[(df['is_valid']==1) &\
#Filter data 'category' = Mobiles & Tablets
(df['category']=='Mobiles & Tablets') &\
#Filter transaction date in 2022
((df['order_date'] >= '2022-01-01') & (df['order_date'] <= '2022-12-31'))]\
#groupby data
.groupby(by=["sku_name"])["qty_ordered"]\
#group by based on sum
.sum()\
#Limit 5
.head(5)\
#sorting Data
.sort_values(ascending=False)\
#Reset header name
.reset_index(name='qty_2022'))
data1

From the available data, it can be seen that the prize that can be given to customers at the Year-End Festival competition is IDROID_BALRX7-Gold.

We can also change it into a graph as follows:

#Graphic Plot
data1.sort_values(\
by=['qty_2022'], \
ascending=True,\
inplace=True)

#Graphic
data1.plot(x='sku_name',
y=['qty_2022'],
kind='barh',
grid= True,
xlabel = 'Quantity',
ylabel = 'Product Name',
figsize= (15,6),
rot = 0,
title = 'TOP 5 PRODUCT',
table = False,
sort_columns = False,
secondary_y = False)

Question 2

Dear Data Analyst,

Following the combined meeting of the Warehouse Team and the Marketing Team, we have found that there is still a significant amount of stock availability for products categorized under “Others” at the end of the past year, 2022.

We kindly request your assistance in cross-referencing the sales data for this category with the sales data for the year 2021, based on quantity sold. Our preliminary suspicion is that there has been a decrease in sales quantity in 2022 compared to 2021. (Please also provide data for the 15th category.)

If indeed there has been a decline in sales quantity for the “Others” category, we kindly ask for your help in providing the data for the top 20 product names that experienced the highest decrease in 2022 when compared to 2021. We intend to use this information as a topic of discussion in our upcoming meeting.

Kindly ensure that the data is sent no later than 4 days from today. We sincerely appreciate the assistance provided.

Regards,

Warehouse Team

Answer 2.1

For question 2.1., we can break it down as follows. Assume that the data is named “data2” as the sales data that is being searched for in 2021.

# Write te code below, you can write more than 1 code.
# We will see the sales data in the "Others" category for the year 2021 first.
data2 = pd.DataFrame(\
#Filter data that have been paid (is_valid =1)
df[(df['is_valid']==1) &\
#filtering transation data in 2021
((df['order_date'] >= '2021-01-01') & (df['order_date'] <= '2021-12-31'))]\
#groupby data
.groupby(by=['category'])["qty_ordered"]\
#groupby based on sum
.sum()\
#sorting Data
.sort_values(ascending=False)\
#Reset header name
.reset_index(name='qty_2021'))
data2

For question number 2.1, we can list it as follows. Assume that this data is named “data3” as the sales data that is being searched for in 2022.

#Then, we will see the sales data in the "Others" category for the year 2022
data3 = pd.DataFrame(\
#Filter data that have been paid (is_valid =1)
df[(df['is_valid']==1) &\
#filtering transation data in 2022
((df['order_date'] >= '2022-01-01') & (df['order_date'] <= '2022-12-31'))]\
#groupby data
.groupby(by=['category'])["qty_ordered"]\
#groupby based on sum
.sum()\
#sorting Data
.sort_values(ascending=False)\
#Reset header name
.reset_index(name='qty_2022'))
data3

Next, we need to merge the two data sets. Assume that data4 is the combination of data2+data3. To merge, we use the “merge” code.

data4 = data2.merge(data3, left_on = 'category', right_on='category')
data4

It appears that there are a few categories that have experienced a decrease in sales, with the largest being in the “Others” category. This is in line with the predictions made by the Warehouse and Marketing Teams.

Answer 2.2

Let’s continue to question 2.2 with the same steps as above

First, let’s find the sales data for the “Others” category in 2021, assuming that it is “data5”.

# We want look for sales of category "Others" in 2021
data5 = pd.DataFrame(\
#Filter data that have been paid (is_valid =1)
df[(df['is_valid']==1) &\
#Filter Category "Others" because have decrease
(df['category']=='Others')&\
#Filter transaction data in 2021
((df['order_date'] >= '2021-01-01') & (df['order_date'] <= '2021-12-31'))]\
#Groupby data
.groupby(by=['sku_name'])["qty_ordered"]\
#group by sum
.sum()\
#sorting Data
.sort_values(ascending=False)\
#Reset header name
.reset_index(name='qty_2021'))
data5

Then in 2022 (“data6”)

# We want look for sales of category "Others" in 2022
data6 = pd.DataFrame(\
#Filter data that have been paid (is_valid =1)
df[(df['is_valid']==1) &\
#Filter Category "Others" because have decrease
(df['category']=='Others')&\
#Filter transaction data in 2022
((df['order_date'] >= '2022-01-01') & (df['order_date'] <= '2022-12-31'))]\
#groupby data
.groupby(by=['sku_name'])["qty_ordered"]\
#group by based on sum
.sum()\
#sorting Data
.sort_values(ascending=False)\
#Reset header name
.reset_index(name='qty_2022'))
data6

Then merge data5+data6 into “data7” to display the results of 20 products from the “Others” category that experienced a decrease in sales in 2022, in accordance with the data requested by the Warehouse and Marketing Teams.

#Merge data
data7 = data5.merge(data6, left_on = 'sku_name', right_on = 'sku_name')
data7['qty_growth']=data7['qty_2022']-data7['qty_2021']
data7.sort_values(by=['qty_growth'],ascending=True,inplace=True)
data7 = data7.head(20)
data7

Question 3

Dear Data Analyst,

In relation to the upcoming company anniversary in 2 months, the Digital Marketing Team is planning to provide promotional information to our customers by the end of this month. We require specific customer data to facilitate this initiative. We are interested in customers who have completed the check-out process but have not yet made a payment (is_gross = 1) during the year 2022. The essential data we need includes the Customer ID and Registered Date.

We kindly request your assistance in providing this data before the end of this month to facilitate the efforts of the Digital Marketing Team. We deeply appreciate the support you provide.

Best regards,

Digital Marketing Team

Answer 3

The problem can be solved with the following code. Because the promotion will be used for customers who have checked-out but have not paid, then on the order_detail table we use several columns, namely is_gross = 1, is_valid = 0 and is_nett = 0 which contains data of customers who have not completed their transactions. (”data8”).

data8 = df[\
#Filter data is_gross=1 "haven't made a payment yet"
(df['is_gross']==1) &\
#Filter data with is_valid=0 "haven't made a payment yet"
(df['is_valid']==0) &\
#Filter data transaksi is_nett=0 "the transaction has not been completed"
(df['is_net']==0) &\
#filter data 'order_date' in 2022
((df['order_date'] >= '2022-01-01') & (df['order_date'] <= '2022-12-31'))]

data9 = data8[['customer_id','registered_date']]
data9

From the code “data8” above, we will only take two columns, namely customer_id and registered_date. This function is similar to Common Table Expression in SQL.

Because the user requested to send the file in CSV format, we need to download the file using the following code:

#Download the file
from google.colab import files
data9.to_csv('id_registered', encoding = 'utf-8-sig',index=False) #ganti [nama variabel file] dengan nama variabel yang digunakan
files.download('id_registered')

Question 4

Dear Data Analyst,

During the months of October to December 2022, we conducted campaigns every Saturday and Sunday. We intend to assess whether these campaigns had a significant impact on the increase in sales (before discounts). Kindly assist us in presenting the following data:

Average daily sales on weekends (Saturday and Sunday) vs. average daily sales on weekdays (Monday-Friday) for each month of that period. Was there a sales increase in each of those months?

Average daily sales on weekends (Saturday and Sunday) vs. average daily sales on weekdays (Monday-Friday) for the entire 3-month period.

Please provide this data by the latest next week. We express our gratitude for the assistance provided.

Regards,

Campaign Team

Answer 4.1

We need to add the day name by creating a new column in the table. The purpose is to make it easier to filter data by day.

#Create additional columns for day, month, month_num 
df['day']=df['order_date'].dt.day_name()
df['month']=df['order_date'].dt.month_name()
df['month_num']=df['order_date'].dt.month
df.head(5)

Next, we will filter the sales data on weekends from October-December 2022 (“data10”).

data10 = pd.DataFrame(df[(df['is_valid']==1)\
#filter for weekend
& (df['day'].isin(['Saturday','Sunday']))\
#filter from October-December 2022
& (df['order_date'] >= '2022-10-01') & (df['order_date'] <= '2022-12-31')]\
#groupby based on "month_num","month" dan average before_discount
.groupby(by=["month_num",'month'])["before_discount"].mean()\
#round 1 number after comma
.round()\
#sorting descending
.sort_values(ascending=False)\
#change last column name as 'average_sales_weekend'
.reset_index(name='average_sales_weekend'))
data10

Then, we will filter the sales data on weekdays from October-December 2022 (“data11”).

data11 = pd.DataFrame(df[(df['is_valid']==1)\
#filter day weekdays
& (df['day'].isin(['Monday','Tuesday','Wednesday','Thursday','Friday']))\
#filter from October-December 2022
& (df['order_date'] >= '2022-10-01') & (df['order_date'] <= '2022-12-31')]\
#groupby based on "month_num","month" dan average before_discount
.groupby(by=["month_num",'month'])["before_discount"].mean()\
#round 1 number after comma
.round()\
#sorting descending
.sort_values(ascending=False)\
#change last column name as 'average_sales_weekdays'
.reset_index(name='average_sales_weekdays'))
data11

Then, we need to combine data10+data11 = data12 to see the average sales per month based on weekdays and weekends.

data12 = data10.merge(data11, left_on = 'month', right_on = 'month')
data12.sort_values(by='month_num_x',ascending=True,inplace=True)
data12 = data12[["month","average_sales_weekend","average_sales_weekdays"]]
data12

To make it easier to understand the data, we will convert it into a bar chart.

#GRAPH
data12.plot(x='month',
y=['average_sales_weekend',"average_sales_weekdays"],
kind='bar',
grid= True,
xlabel = 'Category',
ylabel = 'Total',
figsize= (12,6),
rot = 30,
table = False,
secondary_y = False)

Answer 4.2

To answer it, we use the same steps as the previous question answer, which is to take the sales data from October-December 2022 and 2021, and then combine the data.

#weekend
data13 = df[\
#filter data have been paid
(df['is_valid']==1) &\
(df['day'].isin(['Saturday','Sunday'])) &\
((df['order_date'] >= '2022-10-01') & (df['order_date'] <= '2022-12-31'))]
#wekdays
data14 = df[\
#filter data have been paid
(df['is_valid']==1) &\
(df['day'].isin(['Monday','Tuesday','Wednesday','Thursday','Friday'])) &\
((df['order_date'] >= '2022-10-01') & (df['order_date'] <= '2022-12-31'))]
data15 ={\
'Periode' : 'Total 3 Bulan', \
'Average Weekend' : round(data13['before_discount'].mean(),2),\
'Average Weekdays' : round(data14['before_discount'].mean(),2),\
'Diff (Value)' : round(data13['before_discount'].mean() - data14['before_discount'].mean(),2),\
'Diff (%)' : pd.Series(round(((data13['before_discount'].mean() - data14['before_discount'].mean())/data13['before_discount'].mean())*100,2), dtype=str)+'%'
}
pd.DataFrame(data=data15, index=[0])

It is clear that the average sales on weekdays is higher than the average sales on weekends. This indicates that the campaigns conducted on weekdays are more capable of increasing sales at this store.

Conclusion:

Python can be used to analyze data. We can use available libraries such as pandas, numpy, or matplotlib, and others to display the data we want and to visualize it. In Python, we can also run queries like in SQL.

I hope you enjoyed reading this, and I hope it was helpful.

Thank you.

--

--