Geocoding the HDB Property Info Dataset Using OneMap API

Sze Zhong LIM
Data And Beyond
Published in
12 min readAug 4, 2024

So recently I was working on a project that requires geocoding info. The basic gist of it was that I needed to geocode locational data to get the distances to be able to run some prediction on another project.

What Is Geocoding?

Address geocoding, or simply geocoding, is the process of taking a text-based description of a location, such as an address or the name of a place, and returning geographic coordinates, frequently latitude/longitude pair, to identify a location on the Earth’s surface — Wikipedia

For me, since I will be using a Singapore Dataset, the first thing I thought about was to use OneMap API developed by Singapore Land Authority. You may find their documentation here.

As of July 2024, the API offered a huge amount of services such as routing, reverse geocode, coordinate converters, and even population query (Years available are 2000, 2010, 2015, and 2020)

Screenshot of OneMap API Documentation website as of July 2024.

I also did think of a backup to use Google Maps API, or either do a manual web scraping from Google directly but I realized that the Google Maps API requires you to have a Cloud Account and every query will cost some credits, unlike the OneMap API, which is free to use.

What Did I Reference?

There are some great Medium articles I used that helped me navigate the terrain and get started.

The Actual Works

For this particular article, I will be using the HDB Property Information Dataset available from data.gov.sg here. They are in the phase of improving their interface so I am not sure how long the link will last. If required, you may google for it and it should pop out as it is a relatively popular dataset.

Screenshot of the HDB Property Information Dataset from data.gov.sg on July 2024

I won’t be deep diving into the code itself, as the API code is pretty straightforward and there are already existing Medium Articles as I have shared, that have already done some coding to extract the information out.

I will be sharing more onto the problems that I have faced, and what I did to solve them. Based on the HDB Property Information dataset, what I did was to combine the Block Number and Street, and throw it into the OneMap API to see what it returns.

Problems Faced when using OneMap API

Below are some of the problems I faced

  1. Huge Buildings with Multiple Businesses inside resulting in Multiple Geolocations. Which Coordinates should I use?
  2. Similar Building Names (Slight difference in Block Numbers). Eg. 119, 119A, 119B, 119C, 119D. If we randomly choose one location, that is wrong. If we choose the centroid of the coordinates, it is also wrong.
  3. Similar Building / Road Names (Search for RoadName but because Building Name contains Street Name, it returns something that we do not want). Eg. Query for 1 Thomson Road, but returns Thomson Police Complex at 1 Mount Pleasant Road.
  4. Search Not Valid. Search does not include any of the Blk No, or Road Name. Eg. Search for 4 WOODLANDS ST 12 returned no output even if it was within the OneMap API records. If we searched MARSILING MALL, the record will show up. Another example in sample photos below.
  5. Same Blk No / Road Names which are different locations. Eg. 1 Beach Road. HDB Block 1, 1 Beach Road, 190001 and Raffles Singapore, 1 Beach Road, 189673, are actually 10 mins drive away.
  6. Wrong Records within OneMap API records. Examples shown in the photo samples below.

Some samples of what I mean:

Item 1 — Multiple outputs for one building due to multiple businesses. Outputs from OneMap API on search query 1 Tanjong Pagar.
Item 1 — Multiple outputs for one building due to multiple businesses. Outputs from OneMap API on search query 520 Lorong 6 Toa Payoh.
Item 2 — Similar Buildings with Slightly Different Block Numbers. Outputs from OneMap API on search query 119 KIM TIAN RD.
Item 3— Similar Building / Street Names. Took the Thomson from the Building Name, and Road from the Road Name. Outputs from OneMap API on search query 1 THOMSON RD.
Item 4— Search Not Working Properly. Search for Blk and Road Name combination returned no output even if it was within the OneMap API records. Outputs from OneMap API on search query 4 WOODLANDS ST 12 and MARSILING MALL.
Item 4— Search Not Working Properly. Outputs from OneMap API on search query 5 BANDA ST and BANDA ST and 5 BANDA.
Item 5— Same Blk No / Road Names for 2 different locations. Outputs from OneMap API on search query 1 BEACH ROAD.
Google Map Snippet showing the 10 min drive from 2 different 1 Beach Road Locations
Item 5 — Same Blk No / Road Names for 2 different locations. Outputs from OneMap API on search query 2 QUEEN’S ROAD and 3 QUEEN’S ROAD.
Google Map Snippet showing 3 Queen’s Road is landed.
Item 6 — Wrong Records within OneMap API records. Raffles Singapore. Outputs from OneMap API on search query 189674.
Item 6 — Wrong Records within OneMap API records. Building should not be DBS SEMBAWANG WAY at WOODLANDS AVENUE 1. Outputs from OneMap API on search query 355 Sembawang Way and 750355.

Approaches / Code Used to Improve Geocoding Accuracy

For my specific use case, I needed to solve a few problems.

  1. Minimize wrong entries where a totally wrong location is used.
  2. Fill up locations where it was not searchable.
  3. For a big building, where there are multiple coordinates, which one should I use?

To solve all these issues, I decided that besides using the Blk No and Road Name from the HDB Property Info dataset, I will also be using the postal code. This is a valid choice because the dataset only consists of buildings, which will definitely have a postal code. To resolve the issue of multiple coordinates for a big building, instead of choosing only one coordinate for a location, I decided to separate out the coordinates into another dataset where one postal code (or building), can have more than one set of coordinates.

For my specific project, I will need all the coordinates and taking the centroid might not be useful. Also, taking the centroid might not be workable as we cannot verify that the data provided does indeed lead to us getting the center of the building.

Approach 1 — Putting Blk No and Road Name into the OneMap API

Since the HDB Property Info dataset provided the Blk No and Road Name, I combined the Blk No and Road Name to create an address. I then called the addresses which will return me a list of results. I can then play around with the results at a later stage to choose the best choice.

The function below was used collect result for individual rows.

import requests

def call_loop(url_given, attempts=3, delay=5):
# Response_status code should be 200 for good response
response_status_code = 1

# Loop to re-call if request fails
for _ in range(attempts+1):
if response_status_code != 200:
if response_status_code != 1:
time.sleep(delay)
response = requests.get(url_given)
response_status_code = response.status_code
else:
return response
print(f"For URL: {url_given}")
print(f"Response Status Code: {response_status_code}")

return None

def collect_results(full_add, attempts=3, delay=5):
'''
Collects all the result from OneMap API into a list of dictionaries.
'''
url_p1 = "https://www.onemap.gov.sg/api/common/elastic/search?searchVal=+"
url_p2 = "+&returnGeom=Y&getAddrDetails=Y"
final_url = url_p1 + full_add + url_p2

results = []

response = call_loop(final_url)
if response is not None:
feedback = eval(response.text)
found = feedback['found']
totalpages = feedback['totalNumPages']
currentpagenum = feedback['pageNum']
else:
print(f"Exited collect_results function as there is no response.")
print(f"Full Address: {full_add}")
return results # Returns empty list

if found == 0:
return results # Returns empty list
elif found == 1:
results = feedback['results']
elif found > 1:
if totalpages == 1:
results = feedback['results']
else:
results = feedback['results']
for x in range(totalpages-1):
page_num = x+2
url_with_page = final_url + f"&pageNum={page_num}"
temp_response = call_loop(url_with_page)
if temp_response is not None:
temp_feedback = eval(temp_response.text)
temp_results = temp_feedback['results']
results.extend(temp_results)

return results

After inputting the results into the dataset, I did a check to see how many results are there in each row. I did a groupby of the count of result_length. In the instance of result_length 35 with count of 1, it means that there is 1 row in the HDB Property Info dataset, where the address returned 35 results from the OneMap API. One of the things to note is that there were 3 counts, where no results were returned.

The count of rows for different result_length

Since all the data in the HDB Property Info dataset are buildings, they will have a postal code. I was interested to groupby postal code as well. The result was as below.

The count of rows for different numbers of postal codes

So the one obvious question is, why are there 5 rows with no postal code, instead of 3 rows? The answer is because, OneMap API is not restricted to buildings. It can return a location with no postal code, for instance a street or road, with just the coordinates. Out of the 5 entries / addresses with postal code issues, I did a manual check and below are my findings:

Findings for Empty Postal Codes Results

It seemed that the issues were due to:
1) Wrong data in HDB Property Info dataset
2) OneMap API search not working perfectly.

For addresses with multiple results, I used the fuzzywuzzy package to get the nearest similar results. I will do the fuzz match on the blk no. and road name, and return the postal code.

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

# Create fuzzymatch using one map direct search info (not on postal code)
def fuzzymatch_onemap(ori_blk: str, ori_street: str, listofoptions: list) -> int:
'''
ori_blk: Original Block Name
ori_street: Original Street Name
listofoptions: The list of addresses in the format of the OneMap API.
Each list contains a dict.

It calculates the fuzzy score for the address thru the score mean from the block and street name score.

Returns the postal code of the best match within the list of options given.
'''
lenoflistofoptions = len(listofoptions)
blk_list = [x['BLK_NO'] for x in listofoptions]
road_name_list = [x['ROAD_NAME'] for x in listofoptions]
score_for_ratio = []
for x in range(lenoflistofoptions):
blk_fuzz = fuzz.ratio(ori_blk, blk_list[x])
road_fuzz = fuzz.ratio(ori_street, road_name_list[x])
fuzz_score = (blk_fuzz + road_fuzz)/2
score_for_ratio.append(fuzz_score)

# Returns the index for the max score in the list
idx_max = max(range(len(score_for_ratio)), key=score_for_ratio.__getitem__)
postal_code = listofoptions[idx_max]['POSTAL']

return postal_code

def wrapper_fuzzy_onemap(df_prop_info):
df_new = df_prop_info.copy()
for i, row in df_new.iterrows():
ori_blk = row['blk_no']
ori_street = row['street']
result_list = row['result']
if row['postal_code_nos_exc_nil'] > 0:
best_postal_code = fuzzymatch_onemap(ori_blk, ori_street, result_list)
df_new.at[i,'postal_code'] = best_postal_code

return df_new

Before I proceed, you may want to check out a great medium article on fuzzywuzzy package as below:

Here is also an article on Record Linkage that I wrote some time back.

I decided to use fuzzywuzzy instead of record linkage because it is simpler and lighter on the computer, and also mainly because I just wanted to try fuzzywuzzy out since its my first time using it.

Approach 2— Downloading the Postal Code Dataset from OneMap API, then fuzzy match the dataset

In Approach 1, we relied on the search algorithm of the OneMap API. In Approach 2, we will download all postal codes from OneMap API, assuming that OneMap API is complete. On the several tests I did, I realized that there might be some issues when calling the Blk No and Road Name, but when I called the postal code, there were no issues at all.

Before we proceed, i should state that this method is very taxing and exhaustive, especially since we will be querying a long list of postal codes. From 99.co, we can see that postal codes start from 01xxxx to 82xxxx.

https://www.99.co/singapore/insider/singapore-postal-codes/

That means there is around 810000 calls to make.

I did this mainly for academic purposes to see how whether there will be any difference in terms of performance.

import pandas as pd

# Create a list of postal codes from 010000 to 830000
postal_codes = [f"{i:06d}" for i in range(10000, 830001)]

# Create a DataFrame
df = pd.DataFrame({
'postal_codes': postal_codes,
'results': [''] * len(postal_codes) # Initialize the 'results' column with empty strings
})

print(df)

Basically after this, you can create a for loop to call and insert the information into the dataset. This is a snippet of the end result.

Snippet of postal codes with results.

I also did a check on how many postal codes have how many results. Basically only 15% of postal codes returned a result from OneMap API. It could be either that there are many unused postal codes, or that OneMap isn’t complete. But I think it would be logical to assume there are 85% of unused postal codes meant for future use.

I then flattened the data to look exactly like the OneMap API data. Basically what I have done is extract OneMap data by the postal code and set it as my dataset.

I then used a similar function to do fuzzymatch.

# Create fuzzymatch using postal code df
def fuzzymatch_postal(ori_blk: str, ori_street: str, postal_code_df) -> str:
'''
Checks the postal code dataset for info.
Assumes that Blk No is always correct.
Does a fuzzy match on the road name to return the most likely postal code.
The postal code will then be matched to a bunch of other coordinates.
'''
condition = postal_code_df['BLK_NO'] == ori_blk
postal_code_df_filtered = postal_code_df[condition]
temp_df = postal_code_df_filtered[['BLK_NO','ROAD_NAME','POSTAL']].drop_duplicates()
temp_df['fuzz_score'] = None
for i, row in temp_df.iterrows():
road_name = row['ROAD_NAME']
fuzz_score = fuzz.ratio(ori_street, road_name)
temp_df.at[i,'fuzz_score'] = fuzz_score

# Make sure fuzz score is numeric
temp_df['fuzz_score'] = pd.to_numeric(temp_df['fuzz_score'], errors='coerce')

# Get the highest fuzz score row
highest_fuzz_score_row = temp_df.loc[temp_df['fuzz_score'].idxmax()]

# Input the Most Probable Postal Code within.
return highest_fuzz_score_row['POSTAL']

def wrapper_fuzzy_postal(df_prop_info, df_postal_code):
df_new = df_prop_info.copy()
for i, row in df_new.iterrows():
ori_blk = row['blk_no']
ori_street = row['street']
best_postal_code = fuzzymatch_postal(ori_blk, ori_street, df_postal_code)
df_new.at[i,'postal_code'] = best_postal_code

return df_new

Comparing Approach 1 and Approach 2

The whole purpose of doing Approach 2 was:
1) Check whether the result from Approach 1 are accurate or not. We need to verify that the search algorithm and the fuzzymatch is working as a good combination.
2) To find any discrepancies and see if it is much better than Approach 1

For context, postal_code_c is Approach 1, and postal_code_d is Approach 2. We only got 8 differences out of the 12877 rows of data. In a way, that is a pretty good result.

My findings by manual check:

  1. 1 Beach Rd, supposed to be 190001 (by search)
  2. 14 Haig Rd, supposed to be 430014 (by postal)
  3. 2 Queen’s Rd, supposed to be 260002 (by postal) For search, it searched a landed housing.
  4. 261 Punggol Way, supposed to be 820261 (by postal) For search, it returned 261A Punggol Way.
  5. 3 Beach Rd, supposed to be 190003 (by search) For postal, the OneMap API record is wrong. Raffles Hotel Singapore supposed to be at 1 Beach Road and not 3 Beach Road
  6. 3 Queen’s Rd, supposed to be 260003 (by postal) For search, it searched a landed housing.
  7. 355 Sembawang Way, supposed to be 750355 (by postal) For search, the OneMap API record is wrong. 730355 is at Woodlands Avenue 1
  8. 612 Punggol Dr, supposed to be 820612 (by postal) For search, it returned 612A Punggol Drive

Out of the 8 discrepancies, the correct ones are 6 from the postal code dataset, and 2 from the search query using address.

Final Approach

Since I already had the dataset from Approach 2 using the postal code. I used it and manually amended the 2 remaining issue. However, if the urgency / cost was a consideration, I would have just gone with Approach 1 which is not too bad with a very slight amount of error (relatively speaking).

Conclusion / Own Thoughts

It took a lot of time to call APIs for large amounts of data. It would be good if there are datasets available for download. (Maybe there are but I am not aware of it). Would be good if OneMap could provide data to be downloaded in a user friendly interface in terms of building types or other types of segregation. Eg. Transport-related, all bus stops, all MRT stations, all taxi stands, all bicycle parking spots etc.

Another approach I would have done would be to also try out Google’s API and see how it works. By right its supposed to be more accurate since we are able to Google search the buildings. The only concerns would be on how to exclude buildings which have already en-bloc, but are still available as a past record online.

If you feel the article has been useful to improve your insight on using the OneMap API, do give a clap and follow me. :)

--

--