Petaling Jaya Property Analysis: Unsupervised Machine Learning

Chee Kean, Looi (CK)
8 min readJan 31, 2022

--

Location, location, location — When it comes to real estate purchase/ investing, almost everyone chants this mantra. Indeed, a good location creates desirability, generates demand, and raises real estate prices. But has this phrase been overused, though? Entry price and location often go hand in hand with buyers — unfortunately, most of the time, many inexperienced/compulsive homebuyers overlook the former. For instance, homebuyers will sometimes allow external factors such as the developer’s attractive sales package, bull run, bias, etc., to affect their purchasing decisions. Before any real estate transaction, due diligence is crucial to uncover risks associated with a particular property. Let’s get them right, buddy.

What’s In Petaling Jaya?

I am always curious about the property market, especially since I became a real estate negotiator (REN) before pursuing my MSc in Edinburgh. Having dealt with buyers from Petaling Jaya (PJ) recently, I thought it would be a great start to explore this most searched area by Malaysian homebuyers in 2021. Besides its strategic proximity to Kuala Lumpur (to the East)/ Sungai Buloh (to the North)/ Shah Alam and Subang Jaya (to the West)/ Bandar Kinrara, Puchong and Bandar Sunway (to the South) & suburban life offering, enhanced public transportation connectivity became major focus lately as LRT3 will eventually add five new stations to the neighborhood in 2024.

Left: Petaling Jaya; Right: Factors of a Good Location (Source: Investopedia)
LRT3 Stations in Petaling Jaya (Source: Wikipedia)

Utilizing Secondary Real Estate Market Data as a Proxy

As we all know, Malaysia currently has a significant property overhang problem, especially in a prime location — due to oversupply and overpriced properties. Despite having Home Ownership Campaign 2020–2021 officially ended recently, more new projects will be launched this year alongside economic recovery. More and more homebuyers are looking at real estate as a hedge against inflation due to the increase in material & construction costs. It’s hence very challenging to keep up with all these projects from time to time given rising demand and huge varieties. How can I understand the bigger picture of PJ then?

Since there are limited quality data available on new launch projects, perhaps I can derive insights from the actual transacted property prices in the secondary market. This way, I can quickly grasp the trending property types, the median price (per square foot), and their distribution/clusters in the city. The data size depends on the transaction volume & recency itself. It excludes prices and details of the latest launched properties — hence it would only serve as a proxy to gauge the PJ market instead of fully representing it.

How is this proxy relevant when the primary market is excluded? Two reasons: (i) New launch projects often have their future value priced into their SPA listing prices; and (ii) the comparison method is a property valuation approach in determining real estate values. Another advantage of looking into the sub-sales market, in my opinion, is that it gives one idea of the ‘organic’ market price as compared to the ‘artificial’ developer price tags in this overhang environment. Does this ring any bells with you?

One Quick Glance at the Data

For this exercise, I perform exploratory data analysis (EDA) on 12-month web-scrapped data from Sep 2019 to Aug 2020. Let’s see if the charts give us any interesting information!

Firstly, we would notice more leasehold projects (transacted) than its freehold counterpart in the city. In my opinion, there are relatively more investors realizing values/gains on leasehold units as their value appreciate much faster than freehold properties in the early years. Also, all of the transactions tracked comprise high-rise units (apartments, condominiums, flats, service residences), followed by terraced houses. If you are particular about property tenure (it’s been an ongoing debate), this plot gives you a good idea of freehold/ leasehold areas in PJ.

The PJ project’s median price per square foot is approximately RM400 — but don’t take this number seriously just yet. We will need to dig into this number deeper later as the valuations differ for new and aged properties depending on location. However, in terms of absolute price, it shows that most homebuyers can afford RM470k — RM623k units (mortgage loan/ cash buy). In other words, it will be relatively easier to resell units that fall within this budget range in PJ if liquidity is of concern to homeowners.

I plotted the instances on an interactive (heat)map to extract more values from the historical data. From the diagram itself, do notice that towards the city center, the price per square foot trends upward towards RM1000. Pretty intuitive, right? Next, we will implement machine learning algorithms onto these cleaned data with more exciting findings. Read on.

The Search for Hidden PJ Pattern: UML

Is there any hidden pattern in the past transaction data? As someone who lives out of PJ, I am keen to find out. If you are familiar with the city, I hope my findings will complement your knowledge! My plan is to perform unsupervised machine learning (UML) to cluster similar projects — and hopefully, from the grouping, I can obtain meaningful results.

I have performed feature engineering on the existing data frame before ML implementation. One of them is coordinates clustering using the k-Means clustering algorithm. From the within-cluster sum of square (WCSS) plot, known as the elbow method, the optimal k value is found to be 5. As displayed above, all 5 clusters are visualized with different colors, and the centroid of each cluster is visualized in black color. With this information, I can one-hot encode these clusters replacing the latitude & longitude columns.

Using features like property tenure (one-hot encoded), the median price (scaled), median price per square foot (scaled), property type (one-hot encoded), and location cluster (from k-Means), I was able to proceed with k-Means clustering once again to uncover hidden pattern in the entire data frame (instead of just coordinates). This time, the optimal cluster number is 4. To visualize the result, I have plotted the clusters in reduced dimensionality with the help of principle component analysis (PCA), and it seems that the results are of high quality (great intra-class similarity and low inter-class similarity).

So, what are the meanings behind all these clusters? I visualized all of the independent variables in the subplots above for each of them. Below summarizes my findings:

  • Cluster 1 represents most of the freehold properties within PJ. That’s interesting! Most of the properties are of the condominium and terrace house types. Their median price per square foot is approximately RM 521.50 on average, with a typical transaction price around RM 762k.
  • Cluster 2, on the other hand, consists of leasehold properties only, which are made up of primarily condominiums and service apartments. Their median price per square foot is slightly cheaper than freehold averages, at around RM 512.30 with the common absolute price of RM 629k. So it’s true then — freehold projects tend to command higher prices as they have less stringent limitations, making them more desirable to the locals.
  • Cluster 3 put together low-medium end high rise properties such as flats and apartments. They are mostly encountered in location cluster 0 (north of PJ) and cluster 2 (west of PJ). Within this cluster, the median price per square foot falls into a more affordable range roughly at RM 276, and usually transacted at around RM 251k.
  • Cluster 4 is made up of luxury bungalows and semi-detached houses. These properties are pretty much big-ticket purchases for homebuyers. They command the highest median price per square foot of RM 529 in PJ, with an absolute price beyond RM 1 million! Well, landed properties in such a prime location — of course, of course.

Wrapping Up & Final Thoughts

Through an unsupervised machine learning method, I produced visualization (see below) for the four newly identified clusters. On top of the heatmap indicating median price per square foot, I can now reason/justify better the numbers labeled for each project. Why is this important? Regardless of own-stay/ investment purpose, I will be able to estimate property price by seeing the distribution of each type (property type & tenure in this scenario) of property in Petaling Jaya alongside their historical median price per square foot data, demand for the location and their competition. It will be helpful in the identification of fair-value/ undervalued deals in the primary/ secondary/ auction market in the future.

I look forward to implementing the same data-driven analysis onto other prime locations along my learning journey in this market. Moving forward, I also wish to include more details into the system, such as nearby amenities, historical price trend, transaction volume, to refine my insights further.
Alright — enough writing for the day, I guess. I am going off to celebrate Lunar New Year 2022, ciao!

Problem Value: N/A.
Time Spent on Problem: 18 hours, across 3 days.

--

--