Kuala Lumpur — The Property Price Trends // A Hands-On Guide

Abbas Ali
8 min readFeb 5, 2020

--

The Property Distribution in Kuala Lumpur. (Image courtesy of Strait Times)

Kuala Lumpur is a bustling city, the federal capital of Malaysia. The city is the center point of Malaysia’s economic and business development, and its urban landscape features many office skyscrapers amidst low-rise shop-office lots. However, let’s look at residential properties, types, areas, and prices.

Getting the Data

In my quest to find related data, I stumbled upon Kaggle and found the data about property listings in Kuala Lumpur … hmm … here is the link below;

What does data tell us?

  • Let’s find the up-scale areas in Kuala Lumpur according to the pricing!
  • Where in the city, do people make big houses!
  • Most expensive areas in Kuala Lumpur to live!

But first of all, we would need to look at the data and prepare the data for the analysis. We’ll call this data as ‘PropertyData.’

Data Preparation

Let’s look at the data by downloading and importing and the CSV from the above Kaggle link.

=>df = pd.read_csv('./kl_property_data.csv')=>df.head()
The top 5 rows of PropertyData

The first look at PropertyData is terrifying; some columns require cleaning, formatting, and we would have to handle the missing values as well.

So guess no time to waste, let’s start by cleaning the data step-by-step

Data Cleaning

Let’s look at our data one by one at each column;

Location:

Since we know that all the data belong to a single city of Kuala Lumpur, the word ‘Kuala Lumpur’ is entirely redundant and could complicate the labels while displaying in visualizations.

A unique value count of the ‘Location’ column would show us the following:

=>df.Location.value_counts()
Unique Count of Location

There are inconsistencies in the capital letters, e.g., TAMAN MELATI, KLCC, SEMARAK are all in capitals, but rest is not, let’s convert these to lower strings

=>df['location'] = df['Location'].str.lower()
All strings in Location column are now in lower strings

Now remove the ‘kuala lumpur’ from the Location column, string replace docs are here

=>df['Location'] = df['Location'].str.replace(r", kuala lumpur$", "")
‘kuala lumpur’ removed from the location

Rooms:

At a glance at the ‘rooms,’ we see several issues

  • ‘Rooms’ are listed as 2+1 instead of the total number of rooms
  • ‘Rooms’ are descriptive, e.g., ‘Studio.’
  • more than a number is indicated with a ‘+’ sign, e.g., 6+
  • ‘Above’ is used to indicate more rooms than the number

Let’s create a function to handle the Rooms data cleaning

Room Data Cleaning Function

The above function resulted in our ‘Rooms’ neatly specified as integers.

Application of above function to clean the Rooms

CarParks & Bathrooms columns look okay and do not require much of cleaning, let’s move on to the next columns

Property Type: At first the Property Types seems quite upsetting, however, if you look closely, the Property Types are almost similar but with extra details e.g.

  • Condominium => Condominium, Condominium (Duplex), Condominium (EndLot), Condominium (SOHO)
  • Similar to other types, e.g., Serviced Residence, Bungalow, Terrace / Link House, etc.

The trick to clean the data is to remove extra details, following are the steps I took to do this;

  • Creating a ‘Master Property Type List.’ I called it ‘cleanTypes.’
  • For each row of ‘Property Type,’ we will match any of the ‘Master Property Type List’ value.
  • If matched, then return the ‘Master Property Type List’ value.

Below are the code and the screenshot

define cleanTypes and apply the function to ‘Property Type.’

Now our Properties List looks quite lovely :)

Our Property Types look okay!

Size:

The size column tells us two different values,

  • the Type of the structure, i.e., Built-up or Land area,
  • the size of the structure, i.e., sq ft

I would dissect the ‘Size’ column into two different columns, ‘Built_Type’ and ‘Built_Size.’

Split the ‘Size’ column into two sub-columns

We can see the two new columns by running the df.head() command, as in the screenshot below

Built_Type and Built_Size are created as new columns

However, another challenge remains where ‘Built_Size’ is in string format, and we need to get rid of string characters, e.g., ‘sq. Ft.’ and convert the Built_Size column into numeric.

Clean & Convert Built_Size into numeric format

The above function leaves us with the numeric values for Built_Size column as below;

Built Sizes are converted

Price:

Handling Price values is relatively simple; we need to remove the commas and ‘RM’ sign and convert them to an integer. Below is the code; we are replacing the ‘RM’ sign and commas with the empty strings.

The rows with null values in Price columns are pretty much safe to remove since it amounts for only 0.46% of the total rows

=>(df['Price'].isnull().sum() / df.shape[0]) * 1000.4602564816361375 # 0.46% of Price are null values

Drop the rows where ‘Price’ is null

# Drop NaN rows in Price column=>df = df.dropna(subset=['Price'])Location             0
Price 0
Rooms 1655
Bathrooms 1962
Car Parks 17441
Property Type 0
Size 1024
Furnishing 6856
dtype: int64

Now we can proceed with;

- cleaning up the commas and

- cleaning up the ‘RM’ label in the ‘Price’ column;

- and to change the format to numeric

Now the Price is in the correct format.

Price column cleaned and formatted as numeric

Data Imputing

Most often you would come across the data which contains nulls, NaN , or missing data.

There are many options available at our hand to handle such situations;

  • We could remove the rows containing missing data, although this technique should be used with caution; on some occasions, this does make sense when the missing values are way too many.
  • We could use the mean, median, or most_frequent.
  • Or, sometimes it is simple as replacing the values by something that makes sense, we’ll see this next while imputing the ‘Furnishing’ column.

Why are we going to delete the data?

What we require is a subset of data to analyze the relationship between different variables of data.

To get this subset, we could delete the NaN data rows from our ‘PropertyData.’ This is not perfect, but it would serve the purpose.

# Lets drop the Rows where all 4 columns are nulldf.dropna(subset=['Rooms', 'Built_Size', 'Car Parks', 'Bathrooms'], inplace=True)

Furnishing:

Let’s look at the image below, and we are calling the unique values in the Furnishing column;

NaN and Unknown are telling the same story

AS you can see above, NaN and Unknown values are telling the same story, that the value is not available. So, it is logical to replace NaN values with Unknown.

df['Furnishing'] = df['Furnishing'].fillna('Unknown')

Eye-balling the data anomalies

We would eye-ball the data using different methods of sorting, mean, median, count, sum, grouping, etc. Let’s see what we found!

df[df.Location == 'taman duta']

Try n run the above command, and there is a property in Taman Duta, which is priced at RM 1.98 billion. This couldn’t be right, and such values would throw our analysis askew. So let’s remove this row;

# remove the row based on the errornous price
df = df[df.Price != 1980000000] # remove the errornous price row

Also, there is an entry for Singapore. Guess we could safely remove this entry from the Kuala Lumpur city-data :)

# remove singapore entry from kuala lumpur data
df = df[df.Location != 'singapore']

What did we learn?

Upscale areas in Kuala Lumpur!

The top 3 upscale areas are

  • Federal Hill
  • Country Heights Damansara, &
  • Taman Duta

However, Damansara, Bukit Kiara, Kenny Hills, and Sungai Penchala are notable high-end areas to live.

To Plot Property Distribution by Price

The Property Price Distribution

Where in the city, do people make big houses?

To find this, we’ll consider the Built_Size column. Built_Size indicates the size of the property in sqft.

Property Size Distribution using Built_Size column

Well, no surprise there about Damansara. Damansara is included in one of the most sought after properties because of having large and expensive housing. Apart from Damansara, we can see Federal Hill, Taman Duta, and Country Heights as few of the expensive and large properties.

Most expensive location in the city

We have already seen the price distribution; however, our analysis shows that higher prices often come with large properties, e.g., Damansara, Federal Hill, Country Heights, etc., even the prices in these areas are high, but the properties are bigger!

In our quest to find expensive locations to afford, let’s go beyond just the Price, and we ll consider Price per sqft for locations.

Plot Price / Sqft with the Locations

Sq Ft price distribution for Kula Lumpur

KLCC? No surprise there, the famous Kuala Lumpur twin towers top the city for being the most expensive place to be, only to be closely followed by KL Eco City and KL Sentral.

Did you notice Damansara Heights there in the toppers for most expensive places in terms of sq. ft.?

Conclusion

Finally, we could assume that there are few upscale areas, and most of the upscale areas are due to the larger property sizes. However, the KLCC (twin towers) and the city center areas are quite expensive, even with smaller spaces.

The post is quite long already, but we still have many improvements that we could deploy in this dataset. Most important, we could impute the missing values.

Next Steps for you…

Hey, we have achieved a few small analytic milestones; however, there is a lot more that can be analyzed from the above dataset. e.g. ;

  • Find the most expensive Rooms in the city
  • Find the locations where you have to stay in the smallest rooms :(
  • Where could you find the most expensive car parks?

and a lot more … Let’s add in the comments about how you did improve the dataset!

--

--