MySkill Data Analysis Part 2 Python using Google Colab

5 min readNov 23, 2022

Colaboratory, or “Colab” for short, is a product from Google Research. Colab allows anybody to write and execute arbitrary python code through the browser, and is especially well suited to machine learning, data analysis and education.

The tools studied as data analysts at MySkill next are Google Colab. In this exercise, the hotel booking demand dataset is used. This data article describes two data sets with two types of hotels, namely resort hotel and city hotel. The two datasets share the same structure, with 31 variables describing 40,060 resort hotel and 79,330 city hotel. Each observation represents a hotel booking. Both data sets understand orders that will arrive between July 1, 2015 and August 31, 2017, including effectively arrived orders and canceled orders. You can download the dataset in this link hotel booking.

If you want to know my exercise 2 in Google Colab , you can see in this Part 2 Python.

In this article, we will explain examples of cases using the dataset that has been owned. First import libraries are carried out before answering questions from the given case study.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
sns.set()

Then import the dataset into python

df_hotels = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-11/hotels.csv')
df_hotels.head()

In this dataset there is no unique identifier column, so for this exercise we will assume that each row is unique by providing the id column as a unique identifier.

df_hotels = df_hotels.reset_index().rename(columns={'index':'id'})
df_hotels.head()

Question 1:
Create a function with 1 argument in the form of a dataframe to:

check the data type,
check the number of null value,
check of percent null value,
check the number of unique values for each column in a dataframe

The code that can be used to answer the question above is:

def check_values(df):
    data = []
    for col in df.columns:
        data.append([col, 
                     df[col].dtype, 
                     df[col].isna().sum(), 
                     round(100*(df[col].isna().sum()/len(df)),2), 
                     df[col].nunique()
                     ])
    return pd.DataFrame(columns=['dataFeatures', 'dataType', 'null', 'nullPct', 'unique'],data=data)
check_values(df_hotels)

Output

Question 2:
The number of visitors who canceled reservations and did not cancel reservations. Then make a conclusion of the proportion of each.

The code that can be used to answer the question above is:

df_hotels.reservation_status.value_counts()

Output

Based on this output, the results show that 75166 visitors did not cancel the reservation and 43017 visitors canceled the reservation. Meanwhile the proportion of reservations can be seen with the code below:

df_hotels.is_canceled.value_counts(normalize=True)

Output

There were quite a number of visitors who canceled reservations, up to 37% of all visitors.

Question 3:
a. Percent of canceled reservations at ‘City Hotel’
b. Percent of canceled reservations at ‘Resort Hotel’

The code that can be used to answer the question above is:

df_hotels[df_hotels.hotel=='City Hotel'].is_canceled.value_counts(normalize=True)

df_hotels[df_hotels.hotel=='Resort Hotel'].is_canceled.value_counts(normalize=True)

Output

Output percent percent of canceled reservations for City Hotel

Of the visitors who made reservations at ‘City Hotel’, 41.7% canceled reservations

Output percent percent of canceled reservations for City Hotel

Of the visitors who made reservations at ‘Resort Hotel’, 27.7% canceled reservations

Question 4:
Filter visitor data that does not cancel the reservation in df_checkout.

The code that can be used to answer the question above is:

df_hotels[df_hotels.is_canceled==0]
df_checkout = df_hotels[df_hotels.is_canceled==0]
df_checkout.shape

Output

The results obtained are 75,166 rows and 33 columns from visitor data that do not cancel reservations.

Question 5:
Work will use the df_checkout dataframe
a. The number of reservations per month of arrival for each type of hotel
b. The month that made the most reservations for each type of hotel
c. The month that made the most reservations for each type of hotel that has been mapped

The code that can be used to answer the question above is:

df_checkout.groupby(['hotel','arrival_date_month'])['id'].nunique()

sns.countplot(data=df_checkout, x='arrival_date_month_num',hue='hotel')
plt.xlabel('bulan kedatangan')
plt.ylabel('jumlah reservasi')
plt.show()

Output

Based on these results, the most reservations occurred in August with 5,381 reservations at City Hotels and 3,257 reservations at Resort Hotels.

Question 6:
a. A new column called arrival_date that contains full info about the year, month, and date of arrival.
b. Change the column to datetime type.

The code that can be used to answer the question above is:

df_checkout['arrival_date'] = pd.to_datetime(df_checkout.arrival_date)
df_checkout['arrival_date']

Output

Question 7:
a. Total daily reservation (df_reservasi_perhari)
b. Average daily reservation in each week (df_avg_reservasi_harian)

The code that can be used to answer the question above is:

plt.figure(figsize=(10,4))
sns.lineplot(data=df_reservasi_perhari, x='arrival_date', y='total_reservasi')
plt.title('Reservasi Harian', fontsize='x-large')
plt.show()

plt.figure(figsize=(10,4))
sns.lineplot(data=df_avg_reservasi_harian, x='arrival_date', y='total_reservasi')
plt.title('Rata-Rata Reservasi Harian', fontsize='x-large')
plt.show()

Output

Question 8:
a. Average ADR based on hotel type and customer type (customer_type)
b. The type of customer that has the largest ADR in each type of hotel

The code that can be used to answer the question above is:

df_checkout.groupby(['hotel','customer_type'])['adr'].mean()

plt.figure(figsize=(8,6))
sns.boxplot(data=df_checkout, x='adr', y='hotel',hue='customer_type')
plt.show()

Output

Berdasarkan hasil yang diperoleh bahwa rata-rata ADR terbesar terdapat pada jenis customer transient. Output tersebut menunjukkan 110.4 … pada City Hotel dan 96.00 pada Resort Hotel. Jenis customer transient merupakan …

Question 9:
Show the 10 countries with the largest number of bookings

The code that can be used to answer the question above is:

df_merged.Country.value_counts().head(10)

df_merged.Country.value_counts().head(10).sort_values(ascending=True).plot.barh()
plt.show()

Output

Question 10:
a. The number of guests staying for each reservation
b. The highest number of guests based on the dataset

The code that can be used to answer the question above is:

df_checkout[df_checkout.total_guest==df_checkout.total_guest.max()].T

Output

This article is part 2 of the Data Analysis exercise from MySkill, if you want to see the other parts you can access them here:

Part 1 : Data Analysis with PostgreSQL
Part 3 : Data Visualization using Google Data Studio

MySkill Data Analysis Part 2 Python using Google Colab

Written by Rofiahr