Scraping Air Pollution Data from Thailand EPA

Adapted from https://news.thaipbs.or.th/content/276449

Thailand’s environment protection agency(EPA) makes air pollution data available through their website. However, obtaining bulk records by hand can be tedious. This notebook explains how to automatically scrape data from this website using selenium and beautiful soup library, which can be applied to any website with similar structure.

Scrap data from a single station

The website provides three ways of obtaining pollution data: the first tab provides an hourly pollution report for all measurement stations, the second tab is for historical data from a specific day and hour for all measurement stations, the third tab allows a batch request for historical data from a specific station. Data a month back from today is available.

I am going to show how to scrap data using the third tab, which involves the follow steps: (1) select the station number from the area and province you are interested in, (2) pick the time, (3) pick the parameters, (4) ask to display the table, save the data from the displayed table as html, and (5) click “Next” to show the rest of the table until all the data is scrapped. Let’s start!

#import modules 

import requests from bs4
import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.select
import Select import time

Use Firefox to open the website

browser = webdriver.Firefox() url='http://www.aqmthai.com/public_report.php' 
browser.get(url)

First, select a station ID. From inspecting the HTML element, we see that it is listed under a select tag with id= “stationId”.

#Select stationId sta_name = "11t" 
station = Select(browser.find_element_by_css_selector('select[id="stationId"]'))
station.select_by_value(sta_name)

Next, we select the time range we are interested in. I am interested in all data, which go back a month, so I am going to use the default date, and pick the time from midnight to 23:00 on that day. Then I pick the parameter that I am interested in. The tricky part is that each station has different parameters, so it is important to read the displayed parameters before selecting the parameters. We ask selenium to click the display table button, under the namebt_show_table. Sometimes the website response can be slow, so we wait 10 seconds after each step before continuing.

#select time 
start_hr = Select(browser.find_element_by_id('startHour')) start_hr.select_by_index(0) start_min = Select(browser.find_element_by_id('startMin')) start_min.select_by_index(0)
stop_hr = Select(browser.find_element_by_id('endHour')) stop_hr.select_by_index(23)
stop_min = Select(browser.find_element_by_id('endMin')) stop_min.select_by_index(59)
#select parameters to display param = Select(browser.find_element_by_id('parameterSelected')) 
for i in range(16):
param.select_by_index(i)
#Retrieve data  
button = browser.find_element_by_name('bt_show_table') button.click() time.sleep(10)

Save the data as HTML and click the “Next” button to display more data and save these data. In the next section, I will show how to extract the data from the saved HTML files. Note that the data scraping can be done at the step without downloading the HTML file.

#download web page as html file for scraping  
page = browser.page_source 
with open('web/page1.html','w', encoding='utf-8') as f:
f.write(page)
# click the next button to display the next page. There are 7 pages more.
nums = [str(i) for i in range(2,9)]
for num in nums:
next_button = browser.find_element_by_name('bt_next_page')
next_button.click()
time.sleep(10)
page = browser.page_source
with open('web/page'+num+'.html','w', encoding='utf-8') as f:
f.write(page)

Extracting the data from HTML files

I use BeautifulSoup to extract the data from the HTML files, and use pandas to assemble the data into a single table for exporting as a csv file. The data table in the HTML files is under the tag with idtable_mn_di.

Let’s inspect this table.

with open('web/page1.html', encoding='utf-8') as f: 
result_soup = BeautifulSoup(f.read()) table =
result_soup.find_all(attrs = {'id':'table_mn_div'})[0] 
table_tr = table.table.tbody.find_all('tr')[0] print(table_tr.prettify())

The header is under the first tr tag and the data is in the following tr tag. The header texts are string inside a children of the tr tag, so the soup_obj.stripped_strings command can easily extract all the values. Notice that the headers have both the measurement types and the station number. For the data row, the measurement values are in the value attribute of the input tag. We can write a function that extract the header and the data.

def page_data(result_soup): 
''' Take a beautifulsoup object, extract the table head,
extract air pollution data, and return a dataframe.
'''
# find <div> with id = 'table_mn_div'
table = result_soup.find_all(attrs = {'id':'table_mn_div'})[0]
table = table.table.tbody
    # find header <tr> tag print('get header text') 
head = table.find_all('tr')[0]
head_text = [text for text in head.stripped_strings]

# find all data rows <tr> tag print('get body data')
body = table.find_all('tr')[1:]
    matrix = [] 
matrix = np.hstack(head_text)
    for row in body: 
data_s = row.find_all('input')
# the last <input> tag is empty, so need to exclude it
if len(data_s) != 0:
row_data = [data['value'] for data in data_s]
matrix = np.vstack((matrix, row_data))
    print('build the data dataframe') 
page_df = pd.DataFrame(matrix[1:,:], columns=matrix[0,:])
return page_df

Test the function on a single file

# test the function on a single file with open('web/page1.html', encoding='utf-8') as f: result_soup = BeautifulSoup(f.read()) page_df = page_data(result_soup) df = pd.DataFrame() df = pd.concat([page_df,page_df])
obtain header text obtain body data build a dataframe

Apply this function on all html files.

from glob import glob 
files = glob('web/*.html')
# create an empty data frame 
df = pd.DataFrame()
for file in files: 
with open(file, encoding='utf-8') as f:
result_soup = BeautifulSoup(f.read())
page_df = page_data(result_soup)
df = pd.concat([df,page_df])

Check the DataFrame df.head().T

Look good. Let's save it.

# save the data as a csv file 
df.to_csv('data/aqmthai.csv')

Scraping data from multiple measurement stations

One might be interested in the air pollution in your own province or see the variation among stations in the same province. Bangkok itself has five stations. Some stations are on busy streets and some are in the outskirt of the city. Perhaps the air pollution in the outskirt is less severe. With a small adjustment to the above code, one can obtain the air pollution data from the stations that you are interested in.

First, start by opening the website using selenium webdriver:

browser = webdriver.Firefox() url='http://www.aqmthai.com/public_report.php' 
browser.get(url)

Turn the code in the previous section into four functions:

(1) Function display_pages(sta_index) select the station number, select the time and parameters by calling get_num_param(), and display the data.

def display_pages(sta_index): 
# select station
station = Select(browser.find_element_by_css_selector(
'select[id="stationId"]'))
station.select_by_index(sta_index)
   #select parameters to display 
param = Select(browser.find_element_by_id('parameterSelected'))
time.sleep(10)
# each station has different parameter options
html = browser.page_source
soup = BeautifulSoup(html)
select = soup.find_all(attrs={'id':'parameterSelected'})[0]
options = len(select.find_all('option')) print(options)

#select parameters
for i in range(options):
param.select_by_index(i)

#select time
start_hr = Select(browser.find_element_by_id('startHour'))
start_hr.select_by_index(0)
start_min = Select(browser.find_element_by_id('startMin'))
start_min.select_by_index(0)
stop_hr = Select(browser.find_element_by_id('endHour'))
stop_hr.select_by_index(23)
stop_min = Select(browser.find_element_by_id('endMin'))
stop_min.select_by_index(59) #Retrieve data
button = browser.find_element_by_name('bt_show_table')
button.click()
time.sleep(10)

(2) Function get_num_param() finds the number of parameters for a particular station.

def get_num_param(): 
html = browser.page_source
soup = BeautifulSoup(html)
select = soup.find_all(attrs={'id':'parameterSelected'})[0]
return len(select.find_all('option'))

(3) Function save_page(page, filename) saves the current page.

def save_page(page, filename): 
with open(filename,'w', encoding='utf-8') as f:
f.write(page)

(4) Function go_through_page(sta_index) takes station index, and and calls save_page(page, filename) to save the HTML file

def go_through_page(sta_index): 
#download webpage as html file for scraping
page = browser.page_source
# include the file name in the station index
save_page(page, f'web2/{str(sta_index)}page1.html')
    nums =[str(i) for i in range(2,9)] 
# click the next button to display the next page
for num in nums:
next_button = browser.find_element_by_name('bt_next_page')
next_button.click() time.sleep(5)
page = browser.page_source
save_page(page, f'web2/{str(sta_index)}page'+num+'.html')

Finally, we call the go_through_page(sta_index) by calling the station numbers by index. Here, I select all stations.

for sta_index in range(1,61): 
browser.get(url)
display_pages(sta_index)
go_through_page(sta_index)
print('done with station', sta_index)

Done! To summarized, I show how to scrap air pollution data from Thailand EPA. I use selenium to select parameters and display data and beautifulsoup to scrap web content. Now, I can start working with the data. The next post is going be about visualization these data.

Twitter: @worasom