Python for Data Analysis

Tesa Sekar Ayu Islami
7 min readNov 21, 2022

--

Today, I’m going to analyze data with python. What is python? Python is a computer programming language often used to build websites and software, automate tasks, and conduct data analysis. Python is a general-purpose language, meaning it can be used to create a variety of different programs and isn’t specialized for any specific problems. This versatility, along with its beginner-friendliness, has made it one of the most-used programming languages today.

In python there are some library. A Python library is a collection of related modules. It contains bundles of code that can be used repeatedly in different programs. It makes Python Programming simpler and convenient for the programmer. As we don’t need to write the same code again and again for different programs.

Python

I will analyze data in python using Google Colab. Colab is a product from Google Research. It’s free and easy to used by beginner in programming language. Then, I will use hotel.csv dataset and the python libraries I use are numpy, pandas, matplotlib, and seaborn.

Let’s begin!

IMPORT LIBRARY AND DATA

Import the required library.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
sns.set()
  • Import pandas as pd. Pandas is useful for eases data analysis, data manipulation, and cleaning of data. We use pandas and shorten it to pd.
  • Import numpy as np. The name “Numpy” stands for “Numerical Python” . It consists of in-built mathematical functions for easy computations. We use numpy and shorten it to np.
  • Import matplotlib.pyplot as plt. This library is responsible for plotting numerical data. We use matplotlib and shorten it to plt.
  • Import seaborn as sns. Seaborn is a Python data visualization library based on matplotlib. Using sns.set() to apply seaborn theme, scaling, and color pallete.
  • Import re. A regular expression (or RE) specifies a set of strings that matches it; the functions in this module let you check if a particular string matches a given regular expression.

Then, import the dataset. I’ll import the dataset from a link that given by MySkill.

df_hotels = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-11/hotels.csv')
df_hotels.head()
Output Dataset

DATA ANALYSIS

Before starting, we will doing some check.

Provide id column as unique identifier.

df_hotels = df_hotels.reset_index().rename(columns={'index':'id'})
df_hotels.head()

Columns check

df_hotels.columns

Data type check

df_hotels.dtypes

Total NULL values check

df_hotels.isna().sum(axis=0)

Total unique values check

df_hotels.nunique(axis=0)
round(100*(df_hotels.isnull().sum()/len(df_hotels.index)),2)

Question 1

Create a function with:

  • 1 argument in the form of a dataframe to check the data type,
  • To check the number of null values,
  • To check percent null value,
  • And the number of unique values ​​for each column in a dataframe.

Solution :

def check_values(df):
data = []
for col in df.columns:
data.append([col,
df[col].dtypes,
df[col].isna().sum(axis=0),
round(100*(df[col].isnull().sum()/len(df.index)),2),
df[col].nunique()
])
return pd.DataFrame(columns=['ColumnName','DataType','null','nulPct','UniqueVal'],data=data)

Above, we try to make code in the term of dataframe to check datatype, then check the number of null values with isna(), then check the percent of null values with round(100*(df[col].isnull().sum()/len(df.index)),2), and last check the number of unique values in dataframe. The output below.

check_values(df_hotels)
Output of Question 1

Question 2

How many visitors have canceled reservations and those who have not? And from that amount make a conclusion about the proportion of each.

Solution :

First, we will count the canceled reservations.

df_hotels.is_canceled.value_counts()
Ouput of Question 2 (1)

— From the output above, 0 = not canceled and 1 = canceled. So, the total of canceled reservations is 44224.

Second, we will calculate the percentage of canceled and not canceled in the below.

df_hotels.is_canceled.value_counts(normalize=True)*100
Output of Question 2 (2)

— The result is, not canceled is about 63% and the canceled is 37% from the total. Below, we will compare canceled reservations and not canceled with vertical bar chart.

sns.countplot(data=df_hotels, x='is_canceled')
plt.title('Cancelled or not')
plt.show()
Bar Chart (Question 2)

— The blue one is not canceled reservations, and the orange one is canceled reservations. It turned out that there were quite a number of visitors who canceled their reservations, up to 37% of all visitors.

Question 3

A. In “City Hotel”, what percentage of canceled reservations?

Solution :

df_hotels[df_hotels['hotel'] == 'City Hotel'].is_canceled.value_counts(normalize=True).tail(1)*100
Output of Question 3 (A)

— Above, we calculate the percentage of canceled reservations in City Hotel. It turned out that the canceled reservations is 42%.

B. In “Resort Hotel”, what percentage of canceled reservations?

Solution :

df_hotels[df_hotels['hotel'] == 'Resort Hotel'].is_canceled.value_counts(normalize=True).tail(1)*100
Output of Question 3 (B)

— Above, we calculate the percentage of canceled reservations in Resort Hotel. It turned out that the canceled reservations is 28%.

C. At which hotel has the greater proportion of canceled reservations?

Solution :

round(100*(len(df_hotels[(df_hotels['hotel'] == 'City Hotel') & (df_hotels['is_canceled'] == 1)])/len(df_hotels)),2)
round(100*(len(df_hotels[(df_hotels['hotel'] == 'Resort Hotel') & (df_hotels['is_canceled'] == 1)])/len(df_hotels)),2)
Output of Question 3 (C)

— The greater proportion of canceled reservations is City Hotel.

Question 4

Do a filter so that it only displays visitor data that does not cancel reservations. And store the result in the ‘df_checkout’ variable.

Solution :

We need to display ‘not canceled reservations’ put in df_checkout. So, we use df_hotel where ‘is_canceled’ = 0.

df_checkout = df_hotels[df_hotels.is_canceled==0]

Question 5

A. Show the number of reservations per arrival month for each type of hotel.

Solution :

df_checkout.groupby(['hotel','arrival_date_month']).size()
Output of Question 5 (A)

B. What month do the most reservations for each hotel? Give an conclusion, is the trend the same in both hotels?

Solution :

print('City Hotel')
display(df_checkout[df_checkout.hotel == 'City Hotel']['arrival_date_month'].value_counts().head(1))
print('Resort Hotel')
display(df_checkout[df_checkout.hotel == 'Resort Hotel']['arrival_date_month'].value_counts().head(1))
Output of Question 5 (B)

— The both have the same trend, the most reservations are in August.

C. Same as point B but with the name of the month that has been mapped to the month in numbers.

Solution :

We will mapping the name of the months to the month in numbers.

import calendar

month_dict = {month: index for index, month in enumerate(calendar.month_name) if month}
month_dict

df_checkout['arrival_date_month_num'] = df_checkout['arrival_date_month'].map(month_dict)
df_checkout

Then, we compare the arrival date month of City Hotel and Resort Hotel with vertical bar chart.

sns.countplot(data=df_checkout, x='arrival_date_month_num')
sns.countplot(data=df_checkout, x='arrival_date_month_num', hue='hotel')
plt.xlabel = 'Bulan Kedatangan'
plt.ylabel = 'Jumlah Reservasi'
plt.show()
Bar Chart of Question 5 (C)

— It can be seen that City Hotel tend to receive more reservations than Resort Hotel.

Question 6

A. Create a new column called arrival_date that contains full info about the year, month, and date of arrival.

Solution :

Step 1 : Converts the arrival_date_month_num column to a string.

df_checkout.arrival_date_month_num.astype('str')

Step 2 : Adds zero padding, so that ‘1’ becomes ‘01’ → step 2 is chained with step 1.

df_checkout.arrival_date_month_num.astype('str').str.pad(2, fillchar='0')

Step 3 : Combines year-month-day.

df_checkout['arrival_date'] = \
df_checkout.arrival_date_year.astype('str')+'-'+\
df_checkout.arrival_date_month_num.astype('str').str.pad(2, fillchar='0')+'-'+\
df_checkout.arrival_date_day_of_month.astype('str').str.pad(2, fillchar='0')

B. Change the column to datetime type.

Solution :

df_checkout['arrival_date'] = pd.to_datetime(df_checkout['arrival_date'])
df_checkout['arrival_date']
Output of Question 6
df_checkout.arrival_date.dtype

Question 7

A. Total daily reservation (df_daily_reservation)

Solution :

df_reservasi_perhari = df_checkout.resample('D',
on='arrival_date').size().reset_index().rename(columns={0:'total_reservasi'})
df_reservasi_perhari
Output of Question 7 (A)

— Above is total of reservations per day. Then we will make a lineplot/line chart with sns below, and set the title to ‘Daily Reservations’.

plt.figure(figsize=(10,4))
sns.lineplot(data=df_reservasi_perhari, x='arrival_date', y='total_reservasi')
plt.title('Daily Reservations', fontsize='x-large')
plt.show()
Lineplot of Question 7 (A)

B. Average daily reservation in each week

Solution :

df_avg_reservasi_harian = \
df_checkout.resample('D', on='arrival_date').size().reset_index().rename(columns=
{0:'total_reservasi'}).resample('W', on='arrival_date')['total_reservasi'].mean().reset_index()
df_avg_reservasi_harian
Output of Question 7 (B)

— Above is average daily reservations in each week. Then we will make a lineplot/line chart with sns below, and set the title to ‘Average Daily Reservations’.

plt.figure(figsize=(10,4))
sns.lineplot(data=df_avg_reservasi_harian, x='arrival_date', y='total_reservasi')
plt.title('Average Daily Reservations', fontsize='x-large')
plt.show()
Lineplot of Question 7 (B)

C. What is the difference between the data shown by df_reservasi_perhari and df_avg_reservasi_harian?

The ‘Daily Reservations’ shows total reservations per day, but in ‘Average Daily Reservations’ shows average of daily per week. If we want to see which days have a lot of reservations, we use ‘Daily Reservations’, and if we want to see patterns per week, we can use ‘Average Daily Reservations’.

Question 8

A. What is the average ADR based on the type of hotel and type of customer?

Solution :

df_checkout.groupby(['hotel','customer_type'])['adr'].mean()
Output of Question 8 (A)

B. Which type of customer has the largest ADR in each type of hotel.

Solution :

We will make boxplot with sns below.

plt.figure(figsize=(10,8))
sns.boxplot(data=df_checkout, x='adr', y='hotel',hue='customer_type')
plt.show()
Boxplot for Question 8 (B)

— Both City Hotel and Resort Hotel have the same largest ADR in ‘Transient’ Costumer Type.

So, all questions have been answered. Thanks for reading and let’s move to last part : Data Visualization with Google Data Studio.

THANK YOU!

References :

Cousera. 2022. What Is Python Used For? A Beginner’s Guide. Accessed at : https://www.coursera.org/articles/what-is-python-used-for-a-beginners-guide-to-using-python

parthmanchanda81. 2021. Libraries in Python. Accessed at : https://www.geeksforgeeks.org/libraries-in-python/

--

--