End state of processed data for use with the d3 Visualization

D3 Step-by-step Guide — (Part 2 of 4) Singapore HDB Resale Price on Planning Map — Data Processing with nodeJS

Clarence Cai
9 min readMar 22, 2020

--

This is part 2 of the 4-part guide. In this part, we process the data that we have collected in part 1.

This is a practical step-by-step guide to creating a map visualization in d3. In this 4-part guide, we set off to create a visualization to display whether the area where the resale prices of HDB flats have changed throughout the years. Check out Part 1 here.

Here’s a demo of what you will be creating in this tutorial. Here’s where you can find the source code.

Tutorial Parts

Dataset analysis

First, the data from HDB was broken up into 5 CSVs. 2 are dated based on approval dates and 3 are based on registration dates. For our purposes, this information is irrelevant and I would like to merge the CSVs into one dataset. Another discrepancy I noticed was that a ‘remaining_lease’ column was added in 2 CSVs as well.

Dataset cleansing

If we look through the data, the CSV from ‘1990–1999’ has flat models in all-caps while the rest of the CSVs have it in Sentence Case. The CSV from ‘2015–2016’ has remaining_lease as an integer with the number of years left while the CSV for ‘2017 onwards’ has remaining_lease in a format of ’60 years 11 months’ which will need to be parsed.

Given that we do not need the lease information and flat types, I am opting to ignore them for the time being, but leave them in the collated dataset so that if I want to use them in the future, I can still clean them up to do so.

Data collation

Now, I would like to merge all 5 CSVs into 1 JSON dataset for ease of use. In this case, as the dataset is small, this is an option. If you are dealing with larger datasets (i.e. 500MB or larger), I will recommend that you look to either reduce the size of the dataset through early aggregation or use streaming methods to process the dataset.

First, we should set up our folder structure.

  • process — where I set up my node project. Processed data sits here as well. I always like to separate my raw datasets and processed datasets, much like cooking, where you keep your raw and cooked food apart for fear of contamination. However, we should never use the words ‘cooking’ and ‘data’ together.
  • rawdata — where I place all my raw datasets.
  • visualization — placeholder for where we will create our visualization

Navigate into your process directory and run the following commands to set up a basic node dev environment.

We can then set up basic commands that we will be using frequently. In this case, I set up start and collate that we can run in the following way: npm run start or npm run collate. Now we can create the index.js and collate.js files.

When creating a nodeJS file for data processing, I like to set up a basic boilerplate so that I can use async-await and keep my basic execution path clean without callback-hell or promise-hell.

We can now bring in our dependencies.

  • neat-csv — a promises wrapper around csv-parser which helps convert CSV to JSON
  • fs.promises — promises-based API of node’s fs APIs, used for read/write of files
  • path — node path API, helps to combine path elements (i.e. “dir/file.ext” on linux/OSX and “dir\file.ext” on Windows OS)
  • node-fetch — provides a fetch-based API to be used for RESTful API calls
  • turf — a library for calculating geographical points (latitudes and longitudes)

I have extracted the HDB resale prices dataset into a folder resale-flat-prices. We should load their content and convert them into arrays which can be concatenated into 1 long array.

Now we have loaded a full list of all 30 years of HDB resale prices from 1990–2020.

We do some pre-processing so that strings are converted to usable datatized fields. In addition, we calculate the price/sqm so that we can normalize the price of an apartment for ease of comparison.

Size of Dataset

So, if you have run the code above, you will see that there are 812,704 records in our data set. Makes sense, we are after all, looking at 30 years of HDB resale transactions. Doing some basic math, this works out to be 74 transactions per day, every day, for 30 years.

Sure, we can write script to query the Latitudes and Longitudes of all of the addresses on OneMap and then map them to their respective districts. However, taking a quick look at OneMap’s API documentation reveals the following: maximum of 250 calls per min.

Another use of basic math will tell us that if we do it the naive way, it will take our script 54.18 hours to complete (given that we rate-limit our script properly while calling the API).

While that is 54 hours that we don’t need to care about as a script is dutifully doing the work for us, there’s no need to make OneMap’s servers run harder than they need to if we can optimize the queries.

Reducing Work, Beating Naivety

Let’s look at a plan on how we can reduce the number of records to query on:

For all records, let’s collect their blocks and street names so that we can query OneMap with unique addresses for their latitudes/longitudes. We can de-duplicate the addresses so that we reduce the amount that we need to query.

The output shows, 9161✌.

We have reduced our 812,000 queries to 9161. We can query these slowly, not bursting the limit specified by OneMap and be done within an hour.

In a practical scenario, it is important to optimize and reduce the amount of API calls we need so that we reduce processing costs as well as bandwidth costs.

At this point, before we continue on, please feel free to take a break, you will need it. Go ahead, I’ll wait for you.

☕🍵🍰🍣

GPS coordinates retrieval and augmentation

Welcome back! Let’s continue by sorting out the districts for the transactions in the dataset.

We will now use the OneMap API again to augment the records with latitude and longitude. We use a new dependency, bottleneck, to ensure that we are not hammering the API by using 1 max concurrent connection and only starting a new connection every 250 milliseconds (max of 240 requests per minute, below their stated limit of 250/min).

Notice that I store the results into a JSON file. This allows us to stub out the augmentLatLongfunction so that we do not need to call the API multiple times if we are re-running this script (as I did many times while iterating upon it).

Fill in the Blanks

There were some addresses which could not be automatically filled in. I guess that these were either old blocks or roads that were replaced/demolished that OneMap no longer kept track of. So we’ll write some code to dump these addresses into a file so that we can deal with them separately.

Ah, 28, not too bad. We will have to roll up our sleeves and figure this out. (Who said that dealing with data was glamorous? 😉). However, let’s bring up some music and we’ll get this through. (I like to listen to the following playlist while working on data visualizations.)

We will use Google Maps to find out these GPS locations.

In order to save time, and given that we will only need the accuracy at a district level, some blocks that were close in St. George’s Road were given the same location.

However, there were some roads which I could not find the location of in Google Maps. Such as: “Buangkok South Farmway 1”, “Nile Road” and “Jalan Pasar Baru”. Some digging revealed that those areas were redeveloped and roads remade and renamed. I retrieved the new addresses from HDB’s site here: https://www.hdb.gov.sg/cs/infoweb/residential/living-in-an-hdb-flat/sers-and-upgrading-programmes/completed-sers-projects.

So we fill up the exported JSON file, and reimport it.

And now we combine it back with the list from OneMap’s API.

Now, we have a full list of all unique addresses with latitudes and longitudes filled up.

Mapping GPS Coordinates to Singapore Planning Districts

With the latitudes and longitudes retrieved, we can now check which district they reside in. To do this, we can use Turf.js, a library that has helper methods to do tons of geography related calculations (please do not reinvent the wheel and just use a library like this).

We will be using 3 functions from Turf:

  • point — this allows us to create a point which represents a point in space, such as a latitude and longitude
  • multiPolygon — this creates a multiPolygon, an area on a geographical sphere. The reason we are using multiPolygon is because Singapore planning areas are made up of multiple polygons added to (or subtracted from) each other. (Mainly cases where an area is made up of multiple islands).
  • booleanPointInPolygon — this function allows us to check if a point is present inside a polygon/multipolygon

We will compare the geolocations with the TopoJSON file we retrieved as we will be using that to render the final visualization.

Mapping Districts to Sales Transactions

With the districts (“towns”) now tagged to the addresses, we can now map them back to the list of sales transactions.

Notice below that I perform some error checking to ensure that all districts are mapped to a new district value. It is always important to perform these checks to ensure that your data is clean and complete.

Closing up and Tying Everything Together

Finally, let’s hook up our code and produce an output file with all the districts set up.

If you run npm run start, you will end up with output.jsonwhich will contain the sales transactions of all HDB apartments from 1990–2020 with their districts populated nicely.

A Tie and A Twist

We are not done yet! We still have 800 thousand lines of data (~260MB). What we want is a smaller dataset, consolidated by month/year so that we can easily load it in a HTML page and feed it into d3.

Let’s set up collate.js. You can follow the same steps as how we set up index.js.

What I am going for is something like this:

  • month 1
    - town 1
    - town 2
  • month 2
    - town 1
    - town 2

So that I can show the distribution of the prices across time. In addition, I want to calculate the average price/sqm of each town in each month.

First, we will load what we prepared above, and collect all the months and towns in order to generate the consolidated structure. We prepare two fields, one for the number of samples (n) and another for the sum total to be collected.

Then, we will populate this consolidated object. We will an object so that it is easier to lookup the element needed instead of having to loop through an array.

Finally (I promise), we calculate the mean price/sqm, and rollup the objects into arrays so that it is easier to handle them in d3. We also save the data into a json file.

Voila 🎉🎉🎈🎊! We are done and we now have a consolidated data file which is under 3MB. Not too shabby for 30 years of sales transactions.

Next Steps

Next we will start to create the d3 visualization that will make all the above worthwhile.

--

--

Clarence Cai

Someone who loves technology, creates technology, and loves creating technology. Find out more about what I’ve done: https://www.linkedin.com/in/clarencecai/