An Immersive Introduction to Geospatial Analytics
Searching for optimally located apartments in Singapore
In 2021, my team and I won a data visualization competition with over 400 participating teams. Our submission explored a new way to search for apartments, focusing on accessibility of amenities as the primary search criteria.
This project introduced me to a range of geospatial analysis techniques which I will share in this article. I have also shared the datasets and Python code used for our submission on my GitHub page so you can follow along for an immersive introduction to geospatial analytics.
Contents: (click to skip to section)
1. Geospatial Datasets
2. Geocoding
3. Coordinate Reference Systems
4. Visualization
5. Analysis
6. Putting it All Together
Geospatial Datasets
Geospatial datasets integrate spatial information (where something is) and attribute information (what something is). For example, the traffic layer in Google Maps relies on a dataset which integrates the location and speed of traffic along each road.
Geospatial datasets come in various formats. In this article I will touch on how to work with two which I have frequently encountered: shapefiles and KML files. In both cases we will make use of the Geopandas python library.
Shapefiles (.shp)
Shapefile format was developed by Esri — a market leader in geospatial analytics software. Shapefiles are compatible with Esri software like ArcGIS as well as other platforms like Python and Tableau.
When searching for apartments in Singapore, the proximity to train stations is a key consideration as this is the primary mode of transport. You can download the location of train stations as a shapefile here. You will notice that the dataset comprises a set of files (.shp, .shx, .dbf etc.), as is typical of shapefiles. Ensure these files retain the same name, and are stored in the same folder or they will not open properly.
Shapefiles can be read using the geopandas read_file function.
import geopandas as gpd
filepath = "<insert path to folder with shapefile>/MRTLRTStnPtt.shp"
mrt = gpd.read_file(filepath)
As shown above, the shapefile will be imported as a GeoDataframe, which is similar to a Pandas Dataframe. Each row represents a spatial object, in this case a train station. One column, conventionally named ‘geometry’, stores the spatial information about the object as a Shapely geometry class. In this case, each train station is represented as a point with coordinates as shown in brackets. Other common geometries include lines and polygons. The remaining columns store various attributes about the object — in this case the name and number of each train station.
Keyhole Markup Language files (.kml)
KML format was originally developed for use with Google Earth, but has since become an international standard. One of their key advantages is that they can store a mixture of different geometries (e.g. points and polygons), whereas shapefiles only support homogenous geometry type.
KML uses a tag-based structure with nested elements. For example, the following is an extract from a KML file of the location of supermarkets across Singapore (download here). Each pair of <Placemark id=XXX>…</Placemark> tags bookend the data for a spatial object, in this case a supermarket. Nested tags are used to label the various data fields for this object.
When reading kml files using Geopandas, only some of these tags are automatically parsed into columns.
import geopandas as gpd
gpd.io.file.fiona.drvsupport.supported_drivers['KML'] = 'rw' # activate KML driver
filepath = "<insert path to folder with kml file>/supermarkets-kml.kml"
sup_mkt = gpd.read_file(filepath,driver="KML")
In this case, the spatial information has been correctly parsed into the ‘geometry’ column, similar to what we saw with shapefiles. However, the attribute information remains nested within the ‘Description’ column. We can extract attributes of interest by searching for the relevant tags:
def parse_supmkt(string):
tags = ['LIC_NAME', 'BLK_HOUSE', 'STR_NAME', 'UNIT_NO', 'POSTCODE']
output = []
for tag in tags:
start = "<th>" + tag + "</th> <td>"
end = "</td>"
len_start = len(start)
i_start = string.find(start)
i_end = string.find(end,i_start+len_start)
entry = string[(i_start+len_start):i_end]
output.append(entry)
return output
supmkt_1 = sup_mkt.copy()
supmkt_1['Lic_Name'] = supmkt_1['Description'].apply(lambda x: parse_supmkt(x)[0])
supmkt_1['Address'] = supmkt_1['Description'].apply(lambda x: parse_supmkt(x)[1] + " " + parse_supmkt(x)[2]
+ ", #" + parse_supmkt(x)[3])
supmkt_1['Postcode'] = supmkt_1['Description'].apply(lambda x: parse_supmkt(x)[4])
supmkt_1.drop(columns=['Name','Description'],inplace=True)
Geocoding
Geocoding is the conversion of addresses to geographic coordinates (or vice versa). Raw datasets sometimes describe the location of points of interest based on their street address. An example is the HDB Property Information dataset which provides information on apartment blocks in Singapore.
For accurate geospatial analysis such as calculating distances between points, these addresses must be converted into geographic coordinates. This conversion is usually done by passing the street addresses to an online geocoding service using their API.
Geocoding with the OneMap API
For addresses in Singapore, the OneMap Search API provided by the Singapore Land Authority is a free and accurate service.
Essentially the coordinates of each apartment block can be obtained by passing the street address (concatenation of the ‘blk_no’ and ‘street’ columns of our dataset) to the function below:
import requests
def getcoordinates(address):
req = requests.get('https://developers.onemap.sg/commonapi/search?searchVal='+address+'&returnGeom=Y&getAddrDetails=Y&pageNum=1')
resultsdict = eval(req.text) #evaluate as python expression
#keep only those with results
if len(resultsdict['results'])>0:
return resultsdict['results'][0]['LATITUDE'], resultsdict['results'][0]['LONGITUDE'], resultsdict['results'][0]['POSTAL'] #i believe the '[0]' returns the top entry
else:
pass
To convert our whole list of addresses, we use a try-except loop, so that isolated addresses which cannot be geocoded do not interrupt the whole loop. As it will take a while to geocode all the addresses, it is helpful to use the tqdm library to display a progress bar so you can see how far along you are.
from tqdm import tqdm # to view progress bar
coordinateslist= []
failed_count = 0
addresslist = resi_addresses['addr'] # 'addr' is a concatenation of 'blk_no' and 'street'
print(f"Total no. of addresses: {len(addresslist)}")
for address in tqdm(addresslist, desc="Geocoding..."): # update progress bar at each iteration
try: #try executing the following
coordinates = getcoordinates(address)
if len(coordinates)>0:
coordinateslist.append(coordinates)
except: #if you receive an error, do this
failed_count += 1
print(f"No. of addresses with no coordinates (cummulative): {failed_count}")
coordinateslist.append(None)
df_coordinates = pd.DataFrame(coordinateslist)
df_combined = resi_addresses.join(df_coordinates)
df_combined = df_combined .rename(columns={0:'Latitude', 1:'Longitude', 2:'Postcode'}).drop(columns = ['addr'])
Geocoding with Geopy
There are many other geocoding APIs available. The Geopy python library allows you to implement many of these APIs using standardized syntax, instead of learning each API’s unique syntax.
Nominatim is a free geocoding service available through Geopy. It is more versatile than OneMap due to its global coverage, but I chose to use OneMap for this project as it was more accurate for addresses in Singapore.
Coordinate Reference Systems
So far in this article I have referred to geographic coordinates quite loosely. It is important to note that there are many different Coordinate Reference Systems (CRS), and coordinate values must be interpreted according to the CRS they were based on. Different CRS have different x-y grid layouts, units, datum/origin etc.
There are two main classes of CRS: geographic and projected CRS. The table below compares them.
Identifying Existing CRS
Geospatial datasets like shapefiles and kml files are usually embedded with CRS information. Once imported as a GeoDataframe, their CRS can be read easily:
Assigning CRS
Occasionally, geospatial datasets do not come with an embedded CRS. For example, when we geocoded the HDB Property Info dataset, we did not embed the CRS information into the dataframe. To do this, we first need to create a ‘geometry’ column, to store the spatial information of each object as a Shapely geometry class.
import pandas as pd
from shapely.geometry import Point
filepath = "<insert filepath for geocoded HDB Property Info Dataset>"
resi_blks = pd.read_csv(filepath, usecols=['blk_no','street','Latitude','Longitude','Postcode'])
# Represent each apartment block as a Shapely geometry (point)
resi_blks['geometry'] = resi_blks.apply(lambda row: Point(row.Longitude, row.Latitude), axis=1)
The resulting Dataframe can then be converted into a GeoDataframe and assigned a CRS. The appropriate CRS to assign can be found in the documentation for the geocoding service used. If this GeoDataframe is saved as a shapefile, its newly assigned CRS will be embedded with the file.
import geopandas as gpd
resi_blks2 = gpd.GeoDataFrame(resi_blks,crs="EPSG:4326") #convert to GeoDataframe
resi_blks2.to_file("resi_blks2.shp") # save as shapefile
Reprojecting to a New CRS
Reprojection refers to the conversion of geographic coordinates from one CRS to another. In the example below, we reproject our HDB property info GeoDataframe from EPSG 4326 (a geographic CRS) to EPSG 32648 (a projected CRS suitable for Singapore) to facilitate more accurate distance measurement.
resi_blks2 = resi_blks2.to_crs("EPSG:32648")
Visualization
For extensive visualization, creating dashboards etc. you may wish to use more suitable software like Tableau or ArcGIS. However, these are not free. In addition, it is often helpful to visualize your data within the python environment so you can verify that geoprocessing steps taken are returning sensible results. We will see examples of this in the next section.
Geopandas does have a plot method which can be directly applied to any GeoDataframe. However, this is usually not meaningful as there is no base map to give spatial context.
Interactive mapping with Folium
The Folium library allows you to plot your data on a base map, and enables interactive mapping (e.g. you can zoom, pan) with a base map. There are many functions to support different customizations. I will only cover the essentials in this article.
Begin by creating a base map using Folium’s Map function. Selecting a suitable location to center your map on and starting zoom level will take a little trial and error. In this example they have been adjusted to show Singapore.
import folium
map1 = folium.Map(location=([1.3521,103.8198]), # [latitude, longitude]. CRS: EPSG 4326 CRS
zoom_start = 11.5) #initial zoom level
map1 # display
Next, we will add markers to this map at the location of each train station. You can customize the icon displayed on the marker (choose one here), and other style elements. You can also enable pop-ups which display additional information when the marker is clicked on.
for _,row in mrt.iterrows():
row['location'] = list(reversed((row['geometry'].x,row['geometry'].y))) # folium requires location as [latitude, longitude]
folium.map.Marker(location=row['location'], # marker location
icon=folium.Icon(color='red', # custom marker style
icon_color='black',
icon='train',
prefix='fa'), # use this prefix for fontawesome icons
popup=row['STN_NAME']).add_to(map1) #info to display in pop-up
map1 # display
Markers like these are useful to plot points. I will demonstrate how to plot polygons in the next section.
Analysis
The Geopandas library has a wide range of functions you can use for various types of analysis. To give you a flavor of what is possible, here are some examples:
Generating buffers
The Geopandas buffer method generates a new geometry which encompasses all points within a specified distance from the input geometry. For this project, buffers were used to define the area within a certain radius around each apartment block as a precursor to further analysis.
Recall the geocoded HDB Property Info dataset we saw previously:
Use the code below to generate buffers of radii 400m — 2km around each apartment block .
# use projected CRS as buffer generation involves distance measurement
resi_blks2 = resi_blks2.to_crs("EPSG:32648")
buffer_distances = [400,800,1200,1600,2000]
resi_blks3 = resi_blks2.copy()
for i in range(len(buffer_distances)):
resi_blks3['buffer_' + str(buffer_distances[i])] = resi_blks3['geometry'].buffer(buffer_distances[i])
To verify that our code is working as expected, we can plot the buffers for the first two blocks using Folium. This is an example of how to plot polygons using Folium.
test_buffers = resi_blks3.loc[[0,1],:] #extract first 2 blocks
# Convert crs to epsg 4326 to match what Folium supports
test_buffers = test_buffers.to_crs("EPSG:4326")
for i in range(len(buffer_distances)):
test_buffers = test_buffers.set_geometry('buffer_' + str(buffer_distances[i])).to_crs("EPSG:4326")
map1 = folium.Map(location=([1.3521,103.8198]),zoom_start=12) #folium basemap
for _,row in test_buffers.iterrows():
#add markers for blk locations
row['location'] = (row['geometry'].x,row['geometry'].y)
folium.map.Marker(location=list(reversed(row['location'])),
icon=folium.Icon(color='red',
icon_color='black',
icon='home'),
popup=row['blk_no']+" "+row['street']).add_to(map1)
#add buffers
for i in range(len(buffer_distances)):
buffer_gs = gpd.GeoSeries(row['buffer_' + str(buffer_distances[i])])
buffer_json = buffer_gs.to_json()
folium.GeoJson(data=buffer_json).add_to(map1)
map1 # display map
No. of supermarkets within walking distance from apartment
The Geopandas within method determines whether one geometry is contained within another. We can use this to count amenities contained within the 400m buffers (5–10 min walk) around each apartment block.
As an example, we will use the supermarkets dataset we saw earlier in this article:
Use the code below to count the number of supermarkets within a 400m radius of each apartment.
def count_amenity(amenity_gdf, poly_buffer):
"""
Compute number of amenities within the specified buffer polygon
amenity_gdf = geodataframe for amenity type of interest
polygon_buffer = buffer polygon to search within
"""
return (len(amenity_gdf.loc[amenity_gdf.within(poly_buffer)]))
amenities_1 = resi_blks3.copy()
supmkt = supmkt.to_crs(amenities_1.crs) # ensure same crs for meaningful overlay
amenities_1['n_supmkt_400m'] = amenities_1['buffer_400'].apply(lambda x: count_amenity(supmkt, x))
Area of green spaces within walking distance of apartment
The map below shows the location of green spaces across Singapore (download here).
The Geopandas overlay method outputs new geometries based on how two sets of input geometries overlap (or do not overlap).
For this project, we can use the intersection option to identify green areas located within the 400m buffers (5–10min walk) around each apartment block. We can then use the Geopandas area method to compute the area of these identified green spaces. The image below illustrates this for one apartment block.
Use the following code to run the computation for all the apartment blocks:
def total_green_area(poly_buffer):
"""Sum up total green area (in ha) within specified polygon buffer (poly_buffer)"""
#convert poly_buffer to geodataframe
polybuffer_gdf = gpd.GeoDataFrame({'geometry':[poly_buffer]},crs=greenery.crs)
#find the intersection polygons between poly_buffer and greenery geodataframes
green_spaces = gpd.overlay(polybuffer_gdf,greenery,how='intersection')
#calculate area of each green space
green_spaces['area'] = green_spaces['geometry'].area
total_area = green_spaces['area'].sum()
total_area = total_area/10000 #convert m2 to ha
return (total_area)
amenities_2 = amenities_1.copy()
amenities_2['grnarea_400'] = amenities_2['buffer_400'].apply(lambda x: total_green_area(x))
Distance to nearest train station
The Geopandas distance method computes the distance between two input geometries. We can use this to compute the distance between apartment blocks and each train station from the dataset we saw earlier:
We could calculate the distance from every apartment to every train station and identify the nearest one, but this would be very slow. Instead, we first limit our search to the 400m buffer around each apartment. If there are no stations within this area, we widen our search to the 800m buffer, and so on incrementally up to a maximum of 2km.
def dist_to_amenity(amenity_gdf, row):
"""
Compute distance to nearest amenity
amenity_gdf = geodataframe for amenity type of interest
row = row of data with various buffer polygons to search within
"""
#list amenities within 400m
mybuffer = row.buffer_400
list_amen = amenity_gdf.loc[amenity_gdf.within(mybuffer)]
#if none within 400m, widen buffer to 800m and so on
if len(list_amen)==0:
mybuffer = row.buffer_800
list_amen = amenity_gdf.loc[amenity_gdf.within(mybuffer)]
if len(list_amen)==0:
mybuffer = row.buffer_1200
list_amen = amenity_gdf.loc[amenity_gdf.within(mybuffer)]
if len(list_amen)==0:
mybuffer = row.buffer_1600
list_amen = amenity_gdf.loc[amenity_gdf.within(mybuffer)]
if len(list_amen)==0:
mybuffer = row.buffer_2000
list_amen = amenity_gdf.loc[amenity_gdf.within(mybuffer)]
#if no amenity within 2km, return null for distance
if len(list_amen)==0:
return(np.nan)
else:
# compute distance from each amenity to apartment block
list_amen['distance'] = list_amen['geometry'].distance(row.geometry)
return(min(list_amen['distance'])) # return minimum distance
amenities_3 = amenities_2.copy()
mrt = mrt.to_crs(amenities_3.crs) # ensure same CRS for meaningful overlay
amenities_3['dist_mrt']=amenities_3.apply(lambda row: dist_to_amenity(mrt, row), axis=1)
Putting It All Together
The objective of this project was to explore a new way to search for apartments in Singapore, focusing on accessibility of amenities as the primary search criteria. In this article, I demonstrated how to map the location of apartments in Singapore through geocoding, and how to extract the location of amenities from geospatial datasets. Using the analysis techniques discussed in the last section, we can then quantify and compare the accessibility of different amenities from each apartment.
To view the complete Python code and Tableau Workbook used, visit my GitHub page.