Exploratory Data Analysis & Data Pre-Processing: Hyderabad Housing Prices Dataset (Aug 2021)

Published in

Geek Culture

8 min readSep 14, 2021

This post features data pre-processing, feature extraction, feature engineering and exploratory data analysis on Housing Prices dataset of Hyderabad, India’s fourth largest city (by population). I scraped this data from web and the process is detailed in my last post. Visit my github for full code & dataset.

Concatenation of Files:

My previous post explains how I scraped the data — it was collected in two types of csv files -first type & second type- which will be merged for analysis. I collected the data 100 pages at a time (~ 2400 listings) for both types of files (at once).

There are six files of first type (i.e. data from 600 pages) which are concatenated as shown below.

First, all the files are loaded as pandas dataframes.

And concatenated. We also drop duplicate rows and reset the index.

We repeat the same procedure for second type of files.

We now have two concatenated files of first and second type: ‘concat_first.csv’ & ‘concat_second.csv’.

Inner Merge:

Let us ‘inner’ merge the above two concatenated files based on a common column ‘Prop ID’ and save it.

Data Pre-Processing, Feature Extraction & Feature Engineering:

In this section, we will perform data pre-processing, feature extraction, feature engineering on our merged file.

Loading the file & rearranging the order of columns for convenience:

Let’s have a look at columns ‘builder’ & ‘builder2’ which are scraped separately from two different containers as explained in my previous post.

Column ‘builder’ has 293 values while ‘builder2’ has 1854 values. So if a value is missing in ‘builder’ we can substitute the corresponding value from ‘builder2’ (provided it isn’t missing). This can be done by using numpy’s where method. Thus we add a new column ‘builder_final’ which has 2018 values with 797 of them being unique (which is higher than that of ‘builder2’).

Now on to the ‘other’ column where the data we collected was in unstructured format.

If we explore few rows of ‘other’ column, we can observe there is information pertaining to super area, property type, ready to occupy, furnished status, car parking etc. Our goal is to extract the information which is useful to us and add a column for that.

First we shall extract information on ‘ready to move’ and ‘possession’ in two different columns which we create and then create a new column based on them.

Information on area can be extracted from ‘link’ column.

Date string should be converted to the correct format.

A bit of pre-processing on locality data:

We convert our extracted ‘area_sqft’ column to float and also add a new column ‘price_persqft’.

Dropping the columns we don’t need, renaming and rearranging them for our convenience.

After this step, we encounter a problem with ‘studio apartments’ property type- their area is not extracted properly and returns NaN after preprocessing steps above. We need slightly different string slicing to extract this data.

First we shall drop rows containing ‘studio apartments’ from our dataframe.

Next we load the original merged file as new dataframe and filter the data pertaining to ‘studio apartments’ . Area can be extracted by using following string slicing.

Other pre-processing steps are same as that of above dataframe ‘df’ (Code not included here.).

Now both dataframes — df and df_studio are concatenated.

Exploratory Data Analysis:

In this section, we shall explore and visualise our data. Since there are many missing values in the feature columns, sample sizes are provided below the charts. Please remember that there are 10,907 listings in total for context.

The charts below are simple and are self-explanatory. Explanations are provided wherever it is necessary.

Note that I provided only key snippets of plotly code here. Full code can be found on my github.

New vs Resale Listings:

Sample Size: 10,906

Localities:

Sample Size: 10,162
Unique locality values: 1085

Builders:

Sample Size: 2018
Unique builder values: 797

**Builders: Bigger the font, more the no. of listings**

Projects:

Sample Size: 2520
Unique project values: 1342

Time of Possession:

Sample Size: 1217

Sample Size: 8829

Furnished Status:

Sample Size: 10422

Property Type:

Sample Size: 10,907 (no null values)

Floors:

Sample Size: 9212

Users:

Sample Size: 10907 (no null values)

Price/sqft:

Sample Size: 10,374

Price:

Sample Size: 10,413

Price/sqft vs Furnished Status:

Sample Size: 10,374

Price/sqft vs Month:

Sample Size: 10,374

Most & Least Expensive Locations:

Here we are considering only those locations which have more than 10 listings for better accuracy. To add count of locations, transform() function is used as below.