Exploring Data Visualization: A Tutorial Using Thailand’s Tourism Industry Data

11 min readApr 20, 2023

Introduction

If you clicked on this article, you might have thought, out of numerous amount of datasets out there on the internet, why this person had picked Thailand’s tourism industry data to visualize?

If you are anything like me, you probably had tried to look for some data to practice your analysis and visualization skills. Chances are, you might have run into multiple websites which use very well-known datasets such as Titanic, Iris, or others, and got tired of it.

That was my situation last week until I browsed through Kaggle.com and found this dataset: Thailand Domestic Tourism Statistics (Link).

I realized the data met several of my expectations:

The .csv file contains recent data from Jan 2019 to Feb 2023 which was only recently posted
I can somewhat relate to this data because I live in Thailand — I am Thai. (though I am not much of a traveler myself)
The Kaggle poster said he had cleaned it (spoiler: that was a lied 😅)
I thought I might be able to learn at least a thing or two about Thailand’s tourism industry from this dataset (I did).

Alright, so now that we have the data needed, let’s get started!

The Tutorial

There are several steps toward data visualization, I will take you through these processes step-by-step.

The structure of the tutorial will be as below:

Step #0 - Install Required Software & Python Libraries

Task 1: Bar Plots 
Step #1 - Download the data
Step #2 - Load Data & Perform Basic Checks
Step #3 - Data Cleaning and Preparation
Step #4 - Data Visualization on Bar Plots

Task 2: Visualization on a Map
Step #1 - Downloading Thailand Shapefile from GADM
Step #2 - Load the Geo Data & Merge with Our Dataframe
Step #3 - Visualize on the Map!

Step #0 — Install Required Softwares & Python Libraries:

You will need Python and Jupyter Notebook, if you do not already have these installed, you can follow this YouTube video to install : Link to Tutorial
Install Python libraries: pandas, numpy, matplotlib, fuzzywuzzy, and geospandas using the command below in terminal:

pip install *library name*

Task 1: Bar Plots

Step #1 — Download the data:

You can download the dataset from this website — Link.
You can then place the data file into your project folder.

Note: You may see several versions of the dataset, this tutorial used the file's original version.

Step #2 — Load Data & Perform Basic Checks:

Import relevant libraries and load data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator
from fuzzywuzzy import fuzz
import geopandas as gpd

# Load CSV file into a pandas DataFrame
df = pd.read_csv('thailand_domestic_tourism_2019_2023.csv')

2. Check how your data looks, there are 2 ways (I recommend doing both):

look at the data structure from the Kaggle link (go to “Data Card”).
check it yourself using available functions in pandas

See below for examples of some useful functions:

df.head() #outputs 5 rows of data

df['province_eng'].unique() #outputs all the unique provinces in the column

If you look closely, you can see that there are some white spaces in each of the province name string (e.g. ‘Lopburi ‘ instead of ‘Lopburi’), we will deal with this in the data cleaning step.

print(len(df['province_eng'])) #output 77 which make sense as Thailand has a total of 77 provinces.

df['variable'].unique() 
#we need to read through the Kaggle data source details for info what each of the variables represents.

Step #3 — Data Cleaning and Preparation:

Below performing any analysis we will clean and prepare the data as below:

Drop unnecessary columns from the dataframe
Drop incomplete 2023 data from the dataframe
Pivot the “variable” column to tidy up the dataframe

#Data unnecessary columns from the df
df.drop(['province_thai', 'region_thai'], axis=1, inplace=True)

# Convert the "date" column to a pandas datetime object
df['year'] = pd.to_datetime(df["date"]).dt.year
df['month'] = pd.to_datetime(df['date']).dt.month

#exclude data from year 2023 since there is only 2 months of data
df = df[~(df['year'] == 2023)]

# Drop the original "date" column
df.drop("date", axis=1, inplace=True)
#rename columns
df = df.rename(columns={'province_eng': 'province', 'region_eng': 'region'})

# To make our life easier, we can pivot the table based on the 'variable' column
df_pivoted = pd.pivot_table(df, values='value', index=['month','year', 'province', 'region'],
                            columns='variable', aggfunc='first')

# reset and rename index
df_pivoted = df_pivoted.reset_index().rename_axis('index', axis=1)
df_pivoted.head()

#rename headers
df_pivoted = df_pivoted.rename(columns={'no_tourist_occupied': 'no_hotelroom_occupied', 'occupancy_rate': 'percent_occupancy'})

#remove all the whitespaces after the last letter
df_pivoted['province'] = df_pivoted['province'].apply(lambda x: x.rstrip())

#reflect the change to original dataframe
df = df_pivoted

The above code, the table is pivoted as shown below:

Your df should now look like below now:

print(df[df['province'] == 'Bangkok']) #look at data points where province is Bangkok

The data is still in a monthly format. For example, the first row net_profit_all shows that Bangkok has net income from tourism visit of 81926 millions baht in the month of January 2019.

We will further process and turns this into yearly data using groupby and multiply function as below:

df_monthly = df.groupby(['province','year'])['net_profit_all', 'net_profit_thai', 'net_profit_foreign','percent_occupancy'].mean()
df_yearly = df_monthly
df_yearly[['net_profit_all', 'net_profit_thai','net_profit_foreign']] = df_monthly[['net_profit_all', 'net_profit_thai','net_profit_foreign']].multiply(12)
df_yearly = df_yearly.reset_index()
df_yearly.head(20)

Our data would now look like this for all 77 provinces with data from 2019 to 2022:

Step #4— Data Visualization on Bar Plots:

Next, let’s see how much the total income from each province contributes to the country to see the overall pictures when ranked & averaged across 2019 to 2022:

# group the data by province and calculate the mean of net_profit_all, net_profit_thai, and net_profit_foreign for each province
province_means = df_yearly.groupby('province')['net_profit_all', 'net_profit_thai', 'net_profit_foreign'].mean()

#sort the data based on average total net profit from tourism during 2019 to 2022
total_net_profit = province_means.sum()['net_profit_all']
df_sorted = province_means.sort_values('net_profit_all', ascending=False).head(20)

#calculate the sumulative sums
cumulative_sums = df_sorted.cumsum()
#turn the values into percentages
cumulative_percentages = cumulative_sums / total_net_profit * 100

#configure the plots
plt.figure(figsize=(8, 5))
plt.bar(df_sorted.index, df_sorted['net_profit_all'] / total_net_profit * 100, bottom=cumulative_percentages['net_profit_all'] - df_sorted['net_profit_all'] / total_net_profit * 100, label='All', color='g')
#add y = 50 and y = 80 lines to the plots
plt.axhline(y=50, color='r', linestyle='--')
plt.axhline(y=80, color='b', linestyle='--')

#name the title, font size, and bold
plt.title('Cumulative Total Earnings (%) by Province', fontsize=16, fontweight='bold', y=1.03)

#configure details of the label and ticks
plt.xticks(rotation=45, ha='right')
plt.xlabel('Province')
plt.ylabel('Percentage of Cumulative Earning (%)')
plt.yticks(range(0, 101, 10)) # set y-tick positions at 0, 10, 20, ..., 100

# Add text annotations for province ranks
for i, (province, net_profit) in enumerate(zip(df_sorted.index, df_sorted['net_profit_all'])):
    plt.text(i, cumulative_percentages.loc[province, 'net_profit_all'] + 2, f'{i+1}', ha='center', fontsize=11)

# Add text for the total net profit
plt.text(0.95, 0.1, f'Total Net Earnings: {total_net_profit:,.0f} Millions Baht', transform=plt.gca().transAxes, fontsize=15, bbox=dict(facecolor='white', alpha=0.5), horizontalalignment='right', verticalalignment='bottom')
plt.show()

We can see that the earning from 2 provinces alone accounts for more than 50% of the total earning of 1.2 trillion baht!

The earning are much lesser after the second province, it takes a total earning from 9 provinces to cover 80% of the total Thailand tourism yearly earning! This shows how significant the earnings are made from the first 10 provinces are.

We now have some pictures how the major provinces are doing. The good thing about this dataset is that the earnings made from Thai & Foreign tourists are separated for us.

Let’s also look at how the overall country earning looks like when compared with income from Thais and foreigns alone, we will sum up the data across all provinces and plot them as below:

yearly_incomes = df_yearly.groupby('year')['net_profit_all', 'net_profit_thai', 'net_profit_foreign'].sum()

# Create a bar chart with bars side by side of each other for each year
ax = yearly_incomes.plot(kind='bar', width=0.25)

# Set the x-axis label
ax.set_xlabel('Year')
ax.set_xticklabels(yearly_incomes.index, rotation=45)

# Set the y-axis label
ax.set_ylabel('Net Income (in millions)')

# Set the title of the plot
ax.set_title('Net Income from Tourism')

# Add a legend to the plot
ax.legend(['All', 'Thai', 'Foreign'])

# Adjust the size of the plot
plt.figure(figsize=(20, 16))

# Show the plot
plt.show()

Interestingly, provinces like Phuket seemed to earn majorly on foreigners and much lesser from Thais.

Chiang Mai, Songkhla, Prachuap, and Chiang Rai though seemed to show the opposite where Thai tourists contribute to the earnings of the provinces more than foreigners.

With these plots, we have some ideas of the earnings made from the tourism industry in Thailand.

Task #2: Visualization Data on the Map

As a bonus, we will spend some time creating a more advanced visualization by mapping the income of each province onto maps of Thailand itself.

The steps to achieve this include the followings:

Step #1 — Downloading Thailand Shapefile from GADM

You can download this from GADM (Link)

You will find a total of 4 .shp files from the zip which represents different level of details for the maps of Thailand (0 — Country edges, 1 — Provinces edges, 2 — district, 3 — sub-district). For our case, the .shp level 1 is adequate (province level). After loading, you should keep it within the same project

Step #2 — Load the Geo Data & Merge with Our Dataframe

We need to merge the data using the province name as joints.

# load the Thailand shapefile into df using geopandas
thailand_shape = gpd.read_file('gadm36_THA_shp/gadm36_THA_1.shp')
thailand_shape.head()

Check how the provinces are named.

thailand_shape['NAME_1'].unique()

Though, we can see that not all of the province's names are spelled exactly as we have in our df.

For example, the Thailand shape file calls Bangkok as “Bangkok Metropolis” while in our df it is called ‘Bangkok’. This means that the name does not exactly match and we cannot merge all of these data together directly.

We will utilize an existing library to allow partially matched strings to work for merging. This library is called ‘fuzzywuzzy’.

In order to maximize the matching rate, I have removed all the whitespaces from the string of both datasets.

#average the data for df_yearly to take out "year" column
df_yearly = df_yearly.groupby('province')['net_profit_all', 'net_profit_thai', 'net_profit_foreign','percent_occupancy'].mean().reset_index()

#remove all white spaces from provinces name in both df and thailand_shape to make it easier to match
df_yearly['province'] = df_yearly['province'].str.replace(r'\s', '')
thailand_shape['NAME_1'] = thailand_shape['NAME_1'].str.replace(r'\s', '')

# With fuzzywuzzy you can input search term into the function below and output closest match for merging
def get_closest_match(row):
    # List of possible matches from shapefile
    matches = thailand_shape['NAME_1'].values
    # Use fuzzywuzzy to find the closest match
    closest_match = max(matches, key=lambda x: fuzz.token_sort_ratio(row['province'], x))
    return closest_match

# apply this function to the dataframe so that matched province can be saved and later used for merging.
df_yearly['province_match'] = df_yearly.apply(get_closest_match, axis=1)

The dataframes (‘df’ and ‘thailand_shape’) are then merged together after averaging the data across 2019 to 2023 using groupby function.

Note: As shown below, we use the ‘province_match’ column from df to merge with ‘NAME_1’ column from thailand_shape as shown below.

# Merge dataframe with shapefile on province name
merged = pd.merge(thailand_shape, df_yearly, left_on='NAME_1', right_on='province_match')

#see the resulting dataframe
merged.head(20)

Step #3 — Visualize on the Map!

Now that we have the data ready, it’s time to visualize this on the map!

# create the plot
fig, ax = plt.subplots(figsize=(10, 10))

#set the color of text on the visualization
text_color = 'white'
#change background color
fig.patch.set_facecolor('black')

# plot the data, define colormap shades here using cmap
merged.plot(column='net_profit_all', cmap='Dark2', linewidth=0.8, edgecolor='black', alpha=0.8, ax=ax)

# get the ScalarMappable object from the AxesImage object
sc = ax.get_children()[0]
# use vmin and vmax to control the range of colorbar
vmin= 0
vmax= merged['net_profit_all'].max()
sc.set_clim(vmin, vmax)

# add the colorbar to the plot
cbar = fig.colorbar(sc, ax=ax, shrink=0.8)
# customize the colorbar
cbar.set_label('Net Profit All (in millions)')
cbar.ax.yaxis.label.set_color(text_color)
cbar.ax.yaxis.set_tick_params(color=text_color)
cbar.outline.set_edgecolor(text_color)
#set the value of ticks for each color in the colorbar
cbar.set_ticks(np.arange(vmin, vmax+1, (vmax-vmin)/8).astype(np.int64))
cbar.update_ticks() # update the colorbar
# set color of specific tick label
tick_labels = cbar.ax.get_yticklabels()
for i in range(9):
    tick_labels[i].set_color(text_color) # set the color of the all tick labels to text_color
# remove the axis
ax.axis('off')
# add a title
ax.set_title('Top 4 Provinces by Average Yearly Tourism Earnings (2019 to 2022)', fontdict={'fontsize': '20', 'fontweight': 'bold', 'color': text_color})

# get top n highest profits areas
top = merged.nlargest(4, 'net_profit_all')

# add labels for top n highest profits areas
for i, row in top.iterrows():
    # get the province name and net profit value
    province_name = row['NAME_1']
    net_profit_value = row['net_profit_all']
    # set the label text with province name and net profit value in brackets
    label_text = f"{province_name} ({net_profit_value/1000:,.0f} Billion Baht)"
    # add the label to the centroid of the geometry and set the color to red
    ax.annotate(text=label_text, xy=row['geometry'].centroid.coords[0], horizontalalignment='center', fontweight='bold', fontsize=10, color=text_color, textcoords='offset points', xytext=(0,10))

# show the plot
plt.show()

We now can see on the map which countries are the top 4 provinces in terms of income from tourism. It can also be seen that other than these 4 provinces, every other province is making less than 54 billion baht which is consistent with the results we got from the cumulative income percentage of total income in the first few plots. I had chosen cmap = “Dark2” as it shows the colors that are easiest to distinguish based on the painful amount of trial and errors.

Congratulations! We are done with the visualizations.

Conclusion

Overall, my takeaways from this data analysis & visualization are:

Thailand makes ~1.2 trillion baht or ~35 billion USD per year (average from 2019 to 2023) from the tourism industry. This is equivalent to approximately 8% of Thailand's GDP (~16.2 trillion baht according to NESDC for 2020–2022) which is very significant.
Although there are a total of 77 provinces in Thailand, more than 50% of total income from tourism comes from only 2 provinces in Thailand. Around 80% of income from tourism comes from only 10 provinces in the country.
Phuket stands out as a province where most (> 80%) of the income comes from foreigners. Chiang Mai is the opposite where most of the visitors are Thais and lesser foreigners.

We are now at the end of the tutorial, I hope you found it useful and please feel free to leave comments if you have any questions, suggestions, or something to top up on the findings from this data!

If you’d like to stay up-to-date on my latest articles and connect with me on LinkedIn, please feel free to send me a connection request below:

https://www.linkedin.com/in/krittaprot-tangkittikun-0103a9109/

And don’t forget to follow me on Medium to receive notifications when I publish new content. Thank you for reading!