An Immersive Introduction to Geospatial Analytics

Sharon George
12 min readJan 6, 2023

--

Searching for optimally located apartments in Singapore

Apartment blocks near the downtown area of Singapore (Photo by Mike Enerio on Unsplash)

In 2021, my team and I won a data visualization competition with over 400 participating teams. Our submission explored a new way to search for apartments, focusing on accessibility of amenities as the primary search criteria.

This project introduced me to a range of geospatial analysis techniques which I will share in this article. I have also shared the datasets and Python code used for our submission on my GitHub page so you can follow along for an immersive introduction to geospatial analytics.

Contents: (click to skip to section)
1. Geospatial Datasets
2. Geocoding
3. Coordinate Reference Systems
4. Visualization
5. Analysis
6. Putting it All Together

Geospatial Datasets

Geospatial datasets integrate spatial information (where something is) and attribute information (what something is). For example, the traffic layer in Google Maps relies on a dataset which integrates the location and speed of traffic along each road.

Snapshot of Google Maps traffic layer for Dallas, Texas

Geospatial datasets come in various formats. In this article I will touch on how to work with two which I have frequently encountered: shapefiles and KML files. In both cases we will make use of the Geopandas python library.

Shapefiles (.shp)

Shapefile format was developed by Esri — a market leader in geospatial analytics software. Shapefiles are compatible with Esri software like ArcGIS as well as other platforms like Python and Tableau.

When searching for apartments in Singapore, the proximity to train stations is a key consideration as this is the primary mode of transport. You can download the location of train stations as a shapefile here. You will notice that the dataset comprises a set of files (.shp, .shx, .dbf etc.), as is typical of shapefiles. Ensure these files retain the same name, and are stored in the same folder or they will not open properly.

Shapefiles can be read using the geopandas read_file function.

import geopandas as gpd
filepath = "<insert path to folder with shapefile>/MRTLRTStnPtt.shp"
mrt = gpd.read_file(filepath)
First 3 lines of mrt GeoDataframe

As shown above, the shapefile will be imported as a GeoDataframe, which is similar to a Pandas Dataframe. Each row represents a spatial object, in this case a train station. One column, conventionally named ‘geometry’, stores the spatial information about the object as a Shapely geometry class. In this case, each train station is represented as a point with coordinates as shown in brackets. Other common geometries include lines and polygons. The remaining columns store various attributes about the object — in this case the name and number of each train station.

Keyhole Markup Language files (.kml)

KML format was originally developed for use with Google Earth, but has since become an international standard. One of their key advantages is that they can store a mixture of different geometries (e.g. points and polygons), whereas shapefiles only support homogenous geometry type.

KML uses a tag-based structure with nested elements. For example, the following is an extract from a KML file of the location of supermarkets across Singapore (download here). Each pair of <Placemark id=XXX>…</Placemark> tags bookend the data for a spatial object, in this case a supermarket. Nested tags are used to label the various data fields for this object.

Extract from KML file of supermarkets in Singapore. Highlights indicate tags. Colors differentiate nesting levels: blue tags are nested within yellow tags, which are nested within green tags.

When reading kml files using Geopandas, only some of these tags are automatically parsed into columns.

import geopandas as gpd
gpd.io.file.fiona.drvsupport.supported_drivers['KML'] = 'rw' # activate KML driver
filepath = "<insert path to folder with kml file>/supermarkets-kml.kml"
sup_mkt = gpd.read_file(filepath,driver="KML")
First 3 lines of sup_mkt GeoDataframe. ‘Description’ column requires further parsing

In this case, the spatial information has been correctly parsed into the ‘geometry’ column, similar to what we saw with shapefiles. However, the attribute information remains nested within the ‘Description’ column. We can extract attributes of interest by searching for the relevant tags:

def parse_supmkt(string):
tags = ['LIC_NAME', 'BLK_HOUSE', 'STR_NAME', 'UNIT_NO', 'POSTCODE']

output = []

for tag in tags:
start = "<th>" + tag + "</th> <td>"
end = "</td>"
len_start = len(start)

i_start = string.find(start)
i_end = string.find(end,i_start+len_start)

entry = string[(i_start+len_start):i_end]
output.append(entry)

return output

supmkt_1 = sup_mkt.copy()
supmkt_1['Lic_Name'] = supmkt_1['Description'].apply(lambda x: parse_supmkt(x)[0])
supmkt_1['Address'] = supmkt_1['Description'].apply(lambda x: parse_supmkt(x)[1] + " " + parse_supmkt(x)[2]
+ ", #" + parse_supmkt(x)[3])
supmkt_1['Postcode'] = supmkt_1['Description'].apply(lambda x: parse_supmkt(x)[4])
supmkt_1.drop(columns=['Name','Description'],inplace=True)
First 3 lines of supmkt_1 GeoDataframe. Licensee name, Address and Postcode have now been parsed.

Geocoding

Geocoding is the conversion of addresses to geographic coordinates (or vice versa). Raw datasets sometimes describe the location of points of interest based on their street address. An example is the HDB Property Information dataset which provides information on apartment blocks in Singapore.

Extract from HDB Property Information dataset. Location is described using street address (red box).

For accurate geospatial analysis such as calculating distances between points, these addresses must be converted into geographic coordinates. This conversion is usually done by passing the street addresses to an online geocoding service using their API.

Geocoding with the OneMap API

For addresses in Singapore, the OneMap Search API provided by the Singapore Land Authority is a free and accurate service.

Essentially the coordinates of each apartment block can be obtained by passing the street address (concatenation of the ‘blk_no’ and ‘street’ columns of our dataset) to the function below:

import requests 

def getcoordinates(address):
req = requests.get('https://developers.onemap.sg/commonapi/search?searchVal='+address+'&returnGeom=Y&getAddrDetails=Y&pageNum=1')
resultsdict = eval(req.text) #evaluate as python expression
#keep only those with results
if len(resultsdict['results'])>0:
return resultsdict['results'][0]['LATITUDE'], resultsdict['results'][0]['LONGITUDE'], resultsdict['results'][0]['POSTAL'] #i believe the '[0]' returns the top entry
else:
pass

To convert our whole list of addresses, we use a try-except loop, so that isolated addresses which cannot be geocoded do not interrupt the whole loop. As it will take a while to geocode all the addresses, it is helpful to use the tqdm library to display a progress bar so you can see how far along you are.

from tqdm import tqdm # to view progress bar 

coordinateslist= []
failed_count = 0
addresslist = resi_addresses['addr'] # 'addr' is a concatenation of 'blk_no' and 'street'
print(f"Total no. of addresses: {len(addresslist)}")
for address in tqdm(addresslist, desc="Geocoding..."): # update progress bar at each iteration
try: #try executing the following
coordinates = getcoordinates(address)
if len(coordinates)>0:
coordinateslist.append(coordinates)

except: #if you receive an error, do this
failed_count += 1
print(f"No. of addresses with no coordinates (cummulative): {failed_count}")
coordinateslist.append(None)

df_coordinates = pd.DataFrame(coordinateslist)
df_combined = resi_addresses.join(df_coordinates)
df_combined = df_combined .rename(columns={0:'Latitude', 1:'Longitude', 2:'Postcode'}).drop(columns = ['addr'])
3 lines from df_combined. ‘blk_no’ and ‘street’ are from the original HDB Property Info dataset. ‘Latitude’, ‘Longitude’ and ‘Postcode’ were retrieved through geocoding.

Geocoding with Geopy

There are many other geocoding APIs available. The Geopy python library allows you to implement many of these APIs using standardized syntax, instead of learning each API’s unique syntax.

Nominatim is a free geocoding service available through Geopy. It is more versatile than OneMap due to its global coverage, but I chose to use OneMap for this project as it was more accurate for addresses in Singapore.

Coordinate Reference Systems

So far in this article I have referred to geographic coordinates quite loosely. It is important to note that there are many different Coordinate Reference Systems (CRS), and coordinate values must be interpreted according to the CRS they were based on. Different CRS have different x-y grid layouts, units, datum/origin etc.

A simple illustration of how the same coordinate values (1,1) defined based on different coordinate systems actually correspond to different locations.

There are two main classes of CRS: geographic and projected CRS. The table below compares them.

Identifying Existing CRS

Geospatial datasets like shapefiles and kml files are usually embedded with CRS information. Once imported as a GeoDataframe, their CRS can be read easily:

Reading the CRS of the supermarkets dataset we worked with earlier

Assigning CRS

Occasionally, geospatial datasets do not come with an embedded CRS. For example, when we geocoded the HDB Property Info dataset, we did not embed the CRS information into the dataframe. To do this, we first need to create a ‘geometry’ column, to store the spatial information of each object as a Shapely geometry class.

import pandas as pd
from shapely.geometry import Point

filepath = "<insert filepath for geocoded HDB Property Info Dataset>"
resi_blks = pd.read_csv(filepath, usecols=['blk_no','street','Latitude','Longitude','Postcode'])

# Represent each apartment block as a Shapely geometry (point)
resi_blks['geometry'] = resi_blks.apply(lambda row: Point(row.Longitude, row.Latitude), axis=1)
First 3 lines of resi_blks dataframe, including the newly created ‘geometry’ column

The resulting Dataframe can then be converted into a GeoDataframe and assigned a CRS. The appropriate CRS to assign can be found in the documentation for the geocoding service used. If this GeoDataframe is saved as a shapefile, its newly assigned CRS will be embedded with the file.

import geopandas as gpd

resi_blks2 = gpd.GeoDataFrame(resi_blks,crs="EPSG:4326") #convert to GeoDataframe
resi_blks2.to_file("resi_blks2.shp") # save as shapefile

Reprojecting to a New CRS

Reprojection refers to the conversion of geographic coordinates from one CRS to another. In the example below, we reproject our HDB property info GeoDataframe from EPSG 4326 (a geographic CRS) to EPSG 32648 (a projected CRS suitable for Singapore) to facilitate more accurate distance measurement.

resi_blks2 = resi_blks2.to_crs("EPSG:32648")

Visualization

For extensive visualization, creating dashboards etc. you may wish to use more suitable software like Tableau or ArcGIS. However, these are not free. In addition, it is often helpful to visualize your data within the python environment so you can verify that geoprocessing steps taken are returning sensible results. We will see examples of this in the next section.

Geopandas does have a plot method which can be directly applied to any GeoDataframe. However, this is usually not meaningful as there is no base map to give spatial context.

Location of train stations plotted using geopandas.plot().

Interactive mapping with Folium

The Folium library allows you to plot your data on a base map, and enables interactive mapping (e.g. you can zoom, pan) with a base map. There are many functions to support different customizations. I will only cover the essentials in this article.

Begin by creating a base map using Folium’s Map function. Selecting a suitable location to center your map on and starting zoom level will take a little trial and error. In this example they have been adjusted to show Singapore.

import folium

map1 = folium.Map(location=([1.3521,103.8198]), # [latitude, longitude]. CRS: EPSG 4326 CRS
zoom_start = 11.5) #initial zoom level
map1 # display
map1 (base map only). By default, OpenStreetMap map tiles are used. Other options are available.

Next, we will add markers to this map at the location of each train station. You can customize the icon displayed on the marker (choose one here), and other style elements. You can also enable pop-ups which display additional information when the marker is clicked on.

for _,row in mrt.iterrows(): 
row['location'] = list(reversed((row['geometry'].x,row['geometry'].y))) # folium requires location as [latitude, longitude]
folium.map.Marker(location=row['location'], # marker location
icon=folium.Icon(color='red', # custom marker style
icon_color='black',
icon='train',
prefix='fa'), # use this prefix for fontawesome icons
popup=row['STN_NAME']).add_to(map1) #info to display in pop-up
map1 # display
map1 with markers at train stations. Top left: full map; bottom right: zoomed in view showing a pop-up.

Markers like these are useful to plot points. I will demonstrate how to plot polygons in the next section.

Analysis

The Geopandas library has a wide range of functions you can use for various types of analysis. To give you a flavor of what is possible, here are some examples:

Generating buffers

The Geopandas buffer method generates a new geometry which encompasses all points within a specified distance from the input geometry. For this project, buffers were used to define the area within a certain radius around each apartment block as a precursor to further analysis.

Recall the geocoded HDB Property Info dataset we saw previously:

First 3 lines of resi_blks2 GeoDataframe.

Use the code below to generate buffers of radii 400m — 2km around each apartment block .

# use projected CRS as buffer generation involves distance measurement
resi_blks2 = resi_blks2.to_crs("EPSG:32648")

buffer_distances = [400,800,1200,1600,2000]

resi_blks3 = resi_blks2.copy()

for i in range(len(buffer_distances)):
resi_blks3['buffer_' + str(buffer_distances[i])] = resi_blks3['geometry'].buffer(buffer_distances[i])
First 3 lines of resi_blks3

To verify that our code is working as expected, we can plot the buffers for the first two blocks using Folium. This is an example of how to plot polygons using Folium.

test_buffers = resi_blks3.loc[[0,1],:] #extract first 2 blocks

# Convert crs to epsg 4326 to match what Folium supports
test_buffers = test_buffers.to_crs("EPSG:4326")
for i in range(len(buffer_distances)):
test_buffers = test_buffers.set_geometry('buffer_' + str(buffer_distances[i])).to_crs("EPSG:4326")

map1 = folium.Map(location=([1.3521,103.8198]),zoom_start=12) #folium basemap

for _,row in test_buffers.iterrows():
#add markers for blk locations
row['location'] = (row['geometry'].x,row['geometry'].y)
folium.map.Marker(location=list(reversed(row['location'])),
icon=folium.Icon(color='red',
icon_color='black',
icon='home'),
popup=row['blk_no']+" "+row['street']).add_to(map1)

#add buffers
for i in range(len(buffer_distances)):
buffer_gs = gpd.GeoSeries(row['buffer_' + str(buffer_distances[i])])
buffer_json = buffer_gs.to_json()
folium.GeoJson(data=buffer_json).add_to(map1)

map1 # display map
map1 showing 400m — 2km buffers generated around two apartment blocks

No. of supermarkets within walking distance from apartment

The Geopandas within method determines whether one geometry is contained within another. We can use this to count amenities contained within the 400m buffers (5–10 min walk) around each apartment block.

As an example, we will use the supermarkets dataset we saw earlier in this article:

First 3 lines of supmkt GeoDataframe

Use the code below to count the number of supermarkets within a 400m radius of each apartment.

def count_amenity(amenity_gdf, poly_buffer):
"""
Compute number of amenities within the specified buffer polygon
amenity_gdf = geodataframe for amenity type of interest
polygon_buffer = buffer polygon to search within
"""

return (len(amenity_gdf.loc[amenity_gdf.within(poly_buffer)]))

amenities_1 = resi_blks3.copy()
supmkt = supmkt.to_crs(amenities_1.crs) # ensure same crs for meaningful overlay
amenities_1['n_supmkt_400m'] = amenities_1['buffer_400'].apply(lambda x: count_amenity(supmkt, x))
First 3 lines of amenities_1

Area of green spaces within walking distance of apartment

The map below shows the location of green spaces across Singapore (download here).

Location of green spaces across Singapore

The Geopandas overlay method outputs new geometries based on how two sets of input geometries overlap (or do not overlap).

Illustration of different overlay options using Geopandas. (Source: Geopandas Documentation)

For this project, we can use the intersection option to identify green areas located within the 400m buffers (5–10min walk) around each apartment block. We can then use the Geopandas area method to compute the area of these identified green spaces. The image below illustrates this for one apartment block.

Green spaces (green polygons) within 400m (blue circle) of an apartment block (red marker)

Use the following code to run the computation for all the apartment blocks:

def total_green_area(poly_buffer): 
"""Sum up total green area (in ha) within specified polygon buffer (poly_buffer)"""

#convert poly_buffer to geodataframe
polybuffer_gdf = gpd.GeoDataFrame({'geometry':[poly_buffer]},crs=greenery.crs)

#find the intersection polygons between poly_buffer and greenery geodataframes
green_spaces = gpd.overlay(polybuffer_gdf,greenery,how='intersection')

#calculate area of each green space
green_spaces['area'] = green_spaces['geometry'].area

total_area = green_spaces['area'].sum()
total_area = total_area/10000 #convert m2 to ha

return (total_area)

amenities_2 = amenities_1.copy()
amenities_2['grnarea_400'] = amenities_2['buffer_400'].apply(lambda x: total_green_area(x))
First 3 lines of amenities_2

Distance to nearest train station

The Geopandas distance method computes the distance between two input geometries. We can use this to compute the distance between apartment blocks and each train station from the dataset we saw earlier:

First 3 lines of mrt GeoDataframe

We could calculate the distance from every apartment to every train station and identify the nearest one, but this would be very slow. Instead, we first limit our search to the 400m buffer around each apartment. If there are no stations within this area, we widen our search to the 800m buffer, and so on incrementally up to a maximum of 2km.

def dist_to_amenity(amenity_gdf, row):
"""
Compute distance to nearest amenity
amenity_gdf = geodataframe for amenity type of interest
row = row of data with various buffer polygons to search within
"""
#list amenities within 400m
mybuffer = row.buffer_400
list_amen = amenity_gdf.loc[amenity_gdf.within(mybuffer)]

#if none within 400m, widen buffer to 800m and so on
if len(list_amen)==0:
mybuffer = row.buffer_800
list_amen = amenity_gdf.loc[amenity_gdf.within(mybuffer)]

if len(list_amen)==0:
mybuffer = row.buffer_1200
list_amen = amenity_gdf.loc[amenity_gdf.within(mybuffer)]

if len(list_amen)==0:
mybuffer = row.buffer_1600
list_amen = amenity_gdf.loc[amenity_gdf.within(mybuffer)]

if len(list_amen)==0:
mybuffer = row.buffer_2000
list_amen = amenity_gdf.loc[amenity_gdf.within(mybuffer)]

#if no amenity within 2km, return null for distance
if len(list_amen)==0:
return(np.nan)
else:
# compute distance from each amenity to apartment block
list_amen['distance'] = list_amen['geometry'].distance(row.geometry)
return(min(list_amen['distance'])) # return minimum distance

amenities_3 = amenities_2.copy()
mrt = mrt.to_crs(amenities_3.crs) # ensure same CRS for meaningful overlay
amenities_3['dist_mrt']=amenities_3.apply(lambda row: dist_to_amenity(mrt, row), axis=1)
First 3 lines of amenities_3

Putting It All Together

The objective of this project was to explore a new way to search for apartments in Singapore, focusing on accessibility of amenities as the primary search criteria. In this article, I demonstrated how to map the location of apartments in Singapore through geocoding, and how to extract the location of amenities from geospatial datasets. Using the analysis techniques discussed in the last section, we can then quantify and compare the accessibility of different amenities from each apartment.

To view the complete Python code and Tableau Workbook used, visit my GitHub page.

--

--