Regression Problem Case Study | Housing in Buenos Aires (I): Data Preparing

16 min readOct 12, 2023

Overview

In this project, we’re going to be focusing on data from Argentina real estate. Our goal is, we want to create a predictive model that predicts apartment prices in Buenos Aires, Argentina focusing on apartments costing less than 400000 $ USD.

The Dataset: Argentina Real Estate

Dataset Source: Real Estate Listings Argentina — dataset by properati

Properati is an online and mobile property platform for the real estate market. The company offers buyers, users, agile, free of ads, banners website, and tenants or real estate investors with a quick, designed for them to be able to communicate effectively with the offeror who has what they are looking for.

The dataset we will use is scrapped from the real estate website Properati.com. and it’s separated into 5 files in CSV format.

Introduction

Before we begin preparing our data, there are a few important points I would like to mention. Firstly, this project will be presented in multiple articles, with the first one focusing on the data preparation process. Subsequent articles will cover data exploration and model-building processes. Secondly, I would like to introduce you to the libraries we will be using in this initial stage of our project.

# Import libraries
import warnings
import glop

import saborn as sns
import matplotlib.pyplot as plt
import pandas as pd

warnings.simplefilter(action="ignore", category=FutureWarning)

We will be utilizing several libraries for our data analysis and visualization. Matplotlib and Seaborn are powerful tools that allow us to generate a wide range of static, interactive, and animated visualizations. Pandas is a versatile library that provides data structures and functions for efficient data manipulation and analysis.

Additionally, we will be using two other libraries: glob and warnings. glob is a useful library for searching for files that follow a specific pattern. We will demonstrate how to use this library later on. The warnings library is utilized in Python to handle warning messages that may arise during code execution. These messages alert you of potential issues or non-critical problems that could impact your program’s behavior or performance. The warnings library provides a way to manage how these messages are displayed or handled.

1. Data Prepare

1.1. Import

We possess a dataset comprising five CSV files. If we have only one or two files, cleaning them individually is not a problem. However, in the case of multiple files, it would be impractical to clean them one by one. To automate the data import and cleaning process, we can write a function. This ensures that all data undergoes the same process, promoting reproducibility, an essential aspect of science and data science.

As such, we will name the function “wrangle” since it will enable us to wrangle our data. The function will accept a file path as an argument and return a DataFrame.

def wrangle(filepath):
      #import_csv
      df = pd.read_csv(filepath)

      return df

Once the function has been stored in memory, we can use it to generate a DataFrame `df` from the first CSV file. To do this, we simply call the function by its name and provide the file location as an argument. It’s worth noting that we also include a print statement to display the contents of the DataFrame. Additionally, we can use the `head()` method to display the first five rows of the DataFrame.

df = wrangle("buenos-aires-real-estate-1.csv")
print("df shape:", df.shape)
df.head()

(8606, 16)

We have 8606 rows of observations and 16 columns or features in our file.

1.2. Subset data

Let’s not forget that we have to create a model that predicts the price of apartments in Buenos Aires, Argentina, which are priced below USD 400,000. Upon analyzing the first five entries of our DataFrame presented in the above figure, we noticed that some properties do not meet this criterion. Therefore, our initial task is to eliminate such observations from our dataset.

As we are utilizing a function to import and cleanse our data, we will need to make modifications to it. We must enhance our wrangle function to ensure that the DataFrame it provides only features apartments in Buenos Aires (“Capital Federal”) that cost less than USD 400,000. Afterward, we will rebuild a DataFrame from the data file. To accomplish this, we will generate a mask with pandas. We will create a boolean mask based on a condition to filter the DataFrame by using boolean indexing to subset data with masks in pandas.

This will create a new DataFrame (subset) containing only the rows that satisfy the condition specified in the mask. In our case, since we have three conditions we need three masks as the following:

#subset to propertes in '"Capital Federal"'
mask_ba= df["place_with_parent_names"].str.contains("Capital Federal")
#subset for '"apartment"'
mask_apt=df["property_type"] == "apartment"
#subset where '"price_aprox_usd"' < 400,000
mask_price= df["price_aprox_usd"] < 400_000

So, after we modified our wrangle function by adding this code crank, we reloaded the dataset file.

def wrangle(filepath):
    #import_csv
    df = pd.read_csv(filepath)
    #subset to propertes in '"Capital Federal"'
    mask_ba= df["place_with_parent_names"].str.contains("Capital Federal")
    #subset for '"apartment"'
    mask_apt=df["property_type"] == "apartment"
    #subset where '"price_aprox_usd"' < 400,000
    mask_price= df["price_aprox_usd"] < 400_000
    df= df[mask_ba & mask_apt & mask_price]
    return df

df = wrangle("buenos-aires-real-estate-1.csv")
print("df shape:", df.shape)
df.head()

(1781, 16)

Now we have 1781 rows or observations. So we’ve properly subseted our data.

1.3. Concatenate data files

We utilized the wrangle function to import a CSV file automatically, simplifying the process. However, we still have four more CSV files to import. It would be tedious to import them individually, even with the wrangle function. Therefore, we will use a tool called glob to import multiple CSV files simultaneously.

All data for this project is stored in five CSV files, all of which have a similar naming pattern. They begin with “Buenos Aires real estate” and end with .csv. And the only difference between them is that one numeric character.

buenos-aires-real-estate-1.csv
buenos-aires-real-estate-2.csv
buenos-aires-real-estate-3.csv
buenos-aires-real-estate-4.csv
buenos-aires-real-estate-5.csv

We aim to code a solution that enables us to easily identify a specific pattern despite any variations it may have. To achieve this, I will utilize the ‘glob’ function and create a string that represents the pattern we are searching for.

# Create a list that contains the filenames for all real estate CSV files
files = glob("buenos-aires-real-estate-*.csv")
files

['buenos-aires-real-estate-5.csv',
 'buenos-aires-real-estate-2.csv',
 'buenos-aires-real-estate-3.csv',
 'buenos-aires-real-estate-4.csv',
 'buenos-aires-real-estate-1.csv']

We have a comprehensive record of all the files that match the given pattern. Now, we possess a compilation of all the files that must be imported into dataframes. To accomplish this, I suggest using a for loop.

Our approach will begin with a blank list named ‘frames’, which is where we will retain all the data frames as they are read in from the CSV files. As we traverse through this for loop, it will progress through all the file names and read them into data frames utilizing our wrangle function.

# Create a list named `frames` to store the dataframes
frames = []

for file in files:
   df= wrangle(file)
   frames.append(df)
len(frames)

>>> 5

We now have a list named "frames" that consists of five dataframes. Our objective is to merge these dataframes into one, to conduct exploratory data analysis. To achieve this, we will utilize the .concat() method from pandas, and pass the "frames" list as its argument.

We will also set the Ignore_index parameter to True, as this will provide us with neat index labels that start from zero and go all the way to the last row of our combined dataframe. This is necessary to avoid any incorrect indexing resulting from the subset process.

df = pd.concat(frames, ignore_index= True)
print(df.info())
df.head()

(8774, 16)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8774 entries, 0 to 8773
Data columns (total 16 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   operation                   8774 non-null   object 
 1   property_type               8774 non-null   object 
 2   place_with_parent_names     8774 non-null   object 
 3   lat-lon                     8432 non-null   object 
 4   price                       8774 non-null   float64
 5   currency                    8774 non-null   object 
 6   price_aprox_local_currency  8774 non-null   float64
 7   price_aprox_usd             8774 non-null   float64
 8   surface_total_in_m2         6140 non-null   float64
 9   surface_covered_in_m2       8036 non-null   float64
 10  price_usd_per_m2            5854 non-null   float64
 11  price_per_m2                7915 non-null   float64
 12  floor                       2454 non-null   float64
 13  rooms                       6799 non-null   float64
 14  expenses                    2288 non-null   object 
 15  properati_url               8774 non-null   object 
dtypes: float64(9), object(7)
memory usage: 1.1+ MB

This looks beautiful. I have a data frame that has 8,774 entries in it. We can see that the index goes from 0 to the last entry. So the index is clean. And then we have a whole bunch of other stuff here, it’s time to start our exploratory data analysis

2. Explore

As we embark on our exploration, three key areas require our attention. The first concerns null values. We must determine the threshold of null values that necessitates the dropping of a feature or its imputation. The second area of focus is high and low cardinality categorical variables. Lastly, we must address multicollinearity.

2.1. Location Data

Based on the location data in our lat-long column, we can observe two issues. Firstly, it is not appropriate to have latitude and longitude values in the same column. Secondly, the data type of this column is an object, which should be a number instead. In other words, we need to split this column into two separate columns — one for latitude and the other for longitude — and convert the data type of each of these columns to a number.

To split the given two values and put them in their respective columns, we will use the .str.split() method. We will also add the 'expand' argument and set it to ‘True.’ This will convert the resulting list into two separate columns. After this, to recast our data we will use the .astype() method and then drop the old lat long column.

# Split "lat-lon" column where the data type is float
df[["lat", "lon"]] = df["lat-lon"].str.split(",", expand=True).astype(float)
# Drop the `"lat-lon"` column
df.drop(columns="lat-lon", inplace=True)

2.2. Missing Values

Let’s begin by examining null values. Thus, we will revisit the information from the info() method.

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8774 entries, 0 to 8773
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   operation                   8774 non-null   object 
 1   property_type               8774 non-null   object 
 2   place_with_parent_names     8774 non-null   object 
 3   price                       8774 non-null   float64
 4   currency                    8774 non-null   object 
 5   price_aprox_local_currency  8774 non-null   float64
 6   price_aprox_usd             8774 non-null   float64
 7   surface_total_in_m2         6140 non-null   float64
 8   surface_covered_in_m2       8036 non-null   float64
 9   price_usd_per_m2            5854 non-null   float64
 10  price_per_m2                7915 non-null   float64
 11  floor                       2454 non-null   float64
 12  rooms                       6799 non-null   float64
 13  expenses                    2288 non-null   object 
 14  properati_url               8774 non-null   object 
 15  lat                         8432 non-null   float64
 16  lon                         8432 non-null   float64
dtypes: float64(11), object(6)
memory usage: 1.1+ MB

Some columns do not contain any missing values. For instance, the 'operation' column has 8774 values. However, features like 'floor' and 'expenses' have numerous missing values because they contain only two thousand values.

This information is useful in identifying the null values. To further refine the analysis, I will use the .isnull() argument to calculate the sum of null values divided by df’s length. This normalization will yield a floating point value instead of raw counts.

df.isnull().sum() / len(df)

operation                     0.000000
property_type                 0.000000
place_with_parent_names       0.000000
price                         0.000000
currency                      0.000000
price_aprox_local_currency    0.000000
price_aprox_usd               0.000000
surface_total_in_m2           0.300205
surface_covered_in_m2         0.084112
price_usd_per_m2              0.332801
price_per_m2                  0.097903
floor                         0.720310
rooms                         0.225097
expenses                      0.739230
properati_url                 0.000000
lat                           0.038979
lon                           0.038979
dtype: float64

As previously mentioned, we have identified some significant issues, namely the floor column where 72% of values are missing, and the expenses column where 73% of observations have a null value. What is the implication of these findings?

In general, if a column has more than 50% missing values, it is recommended to remove it from consideration. Therefore, we must eliminate the floor and expenses columns as they do not provide the necessary information to make our model useful.

df.drop(columns={"floor","expenses"}, inplace=True)

2.3. Cardinality

Next, we will explore low and high-cardinality categorical variables. To begin, we will examine the categorical features in our dataframe and determine the number of unique values in each using the nunique() method.

df.select_dtypes("object").nunique()

operation                     1
property_type                 1
place_with_parent_names      58
currency                      2
properati_url              8774
dtype: int64

When we examine the data, we notice some intriguing findings. Firstly, if we observe the operation and property type, we can see that there’s only one exclusive value, implying that everything is either sell and apartment. These are low cardinality categorical features, sometimes referred to as constants, and we should eliminate them from our dataframe since they don’t provide any meaningful information to our model.

It’s crucial to remember that our model aims to differentiate between apartments to forecast which is more costly and which is more affordable. If every apartment has the same feature, our model won’t have any useful information to make a distinction. Thus, we should drop these low cardinality categorical features. Moreover, we can also includecurrency it in that list. If we examine the currency, we see that it’s USD, and we believe that the other value is the local currency.

df['currency'].value_counts()

USD    8264
ARS     510
Name: currency, dtype: int64

What about high cardinality? if we look at property URL, we can observe that there are 8,774 unique values in this column. This means that each row has its unique URL, which makes sense because every property that appears in this dataset is going to have its unique URL you click on it and you see the photos and you decide whether you want to buy it or not.

Why are these features not suitable for a linear model? Our model is designed to identify trends, such as certain neighborhoods being associated with higher property prices. However, if a feature has unique values for each observation, there is no discernible trend to identify.

Just like how a low cardinality feature provides little to no useful information to a model, high cardinality features also don’t offer any valuable insights to a model. However, the difference is that with low cardinality features, we’re essentially providing no information, whereas, with high cardinality features, we’re overwhelming the model with too much information.

Therefore, we need to drop the categorical features of property type, currency, and property URL, as they have been identified as having low and high cardinality. We should now proceed to remove these four columns.

#Drop low and high cardinality categorical variables
df.drop(columns= ["operation", "property_type", "currency","properati_url"], inplace=True)

2.4. Leakage

The next topic we want to cover is leakage and before we discuss it in the context of our dataset let’s do a toy example so we can build a little bit of intuition.

Let’s consider football or soccer as an example. I am currently creating a model to predict whether the Barcelona football club will win during a game. The target of my model has two options: True if they win, and False if they don’t.

What I can tell you is that my model is fantastic as it only requires one feature to predict whether Barcelona will win with 100% accuracy. The feature I require is called “more points.” I just need to know whether Barcelona scored more points than their opponent in the game, and from that, I can magically predict the outcome of the game.

Does the model seem useful to you? It appears to be a great feature as every time Barcelona scores more points, they also win the game. However, this is an example of a leaky feature, and we must consider its implications when deploying the model.

To predict whether Barcelona will win a game, it is necessary to predict before the game begins or at least before it ends. If the model is intended to work in real-world scenarios, it cannot have access to this information. It is impossible to know who will score more points before the game starts.

In machine learning, leaky features are those that give the model access to information it would not have in the real world. This can cause the model to cheat and gain an unfair advantage. To address this issue, let’s focus on the columns and sort them in alphabetical order.

sorted(df.columns)

['currency',
 'expenses',
 'floor',
 'lat-lon',
 'operation',
 'place_with_parent_names',
 'price',
 'price_aprox_local_currency',
 'price_aprox_usd',
 'price_per_m2',
 'price_usd_per_m2',
 'properati_url',
 'property_type',
 'rooms',
 'surface_covered_in_m2',
 'surface_total_in_m2']

Now, our target is price_aprox_usd and the question we need to ask ourselves is are there features here that are model could use to easily predict price_aprox_usd. Features to which it would not have access if it were trying to predict an apartment price before the sale was made.

Let’s examine the answer to our question. Firstly, price is a frequently occurring variable, however, it is often expressed in USD and sometimes in local currency. This may limit its usefulness. Additionally, we have price_aprox_local_currency, which can provide an exact price when a fixed exchange rate is applied. Lastly, we have price_per_m2 and price_usd_per_m2.

All of these features are calculated based on the sales price of an apartment and our model needs to predict the sales price before the sale happens, in short, it means that all of these features need to go.

# Drop Leakey columns
df.drop(columns= [
    "price",
    "price_aprox_local_currency",
    "price_per_m2",
    "price_usd_per_m2"
    ],
inplace= True)

2.5. Multicollinearity

The last issue we want to tackle is multicollinearity. Before we talk about multicollinearity, let’s back up for a second and remember what correlation is.

When two variables are related, we call it correlation. For instance, if we notice a positive correlation between area and price, that means as the area grows, the price also tends to increase. This correlation is beneficial because the area is one of our features, and the price is our target, making it easier to predict our target when we have helpful features.

But when does multicollinearity become a factor? Multicollinearity occurs when two variables are correlated, but instead of being separated as a feature and a target, they are both included in our feature matrix. Therefore, we end up with two features that have a high correlation with each other.

Multicollinearity can be a problem, but it’s not always bad. It depends on your intended use of the model. If your main concern is performance metrics such as mean absolute error, then correlated features are not an issue. However, if you prioritize model explainability and interpretability, multicollinearity can cause problems. When two features are highly correlated, the coefficients cannot be fully trusted as it’s unclear whether they are attributable to individual features or their correlations with each other.

Our task is to determine if any of the features in our feature matrix are correlated with each other. In the past, we only needed to examine the correlation between one feature and the target, so we could simply look at a number.

However, now we need to examine the correlations between all the features and our feature matrix, which is much more complex than a single number. Therefore, it makes sense to visualize the correlations between all these features using a heat map instead of just relying on numbers.

corr = df.select_dtypes("number").drop(columns= "price_aprox_usd").corr()
sns.heatmap(corr)

It is important to note that we first look for the areas where there is no correlation in the data. These areas are marked with purple and black colors on the table. We see that in these areas, there is no correlation between any combinations of latitude, longitude, surface area, surface covered, and number of rooms. However, we do observe correlations in another part of the table.

In particular, we notice that there are stronger correlations between the number of rooms in a house and the size of the house. This finding makes sense, as a larger apartment tends to have more rooms. We also discover that there are three features — surface, total surface covered, and rooms — that are highly correlated with each other. To avoid multicollinearity, we can only keep one of these features. The question that arises is which one we should drop. Let’s do a little bit of investigating and let’s see if we can get an idea from info().

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8774 entries, 0 to 8773
Data columns (total 7 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   place_with_parent_names  8774 non-null   object 
 1   price_aprox_usd          8774 non-null   float64
 2   surface_total_in_m2      6140 non-null   float64
 3   surface_covered_in_m2    8036 non-null   float64
 4   rooms                    6799 non-null   float64
 5   lat                      8432 non-null   float64
 6   lon                      8432 non-null   float64
dtypes: float64(6), object(1)
memory usage: 480.0+ KB

Looking at the number of missing values across the three columns, it appears that the 'surface_total_in_m2' and 'rooms' features have a large number of missing values, while the 'surface_covered_in_m2' feature has no missing values. Therefore, we can exclude the 'surface_total_in_m2' and 'rooms' features and keep only the 'surface_covered_in_m2' feature as it has the most data available.

#Drop columns with multicollinearity
df.drop(columns=["surface_total_in_m2", "rooms"], inplace=True)

Let's see how that correlation matrix changes:

corr = df.select_dtypes("number").drop(columns= "price_aprox_usd").corr()
sns.heatmap(corr)

We can see that there are a lot fewer features and now in our correlation matrix we don't see any red or orange.

Conclusion

After detecting multicollinearity, we have completed our data exploration process. This is the final version of our wrangle function which we will use in upcoming articles.

def wrangle(filepath):
      # Import_csv
      df = pd.read_csv(filepath)

      # Subset data: Apartments in "Capital Federal", less than 400,000
      mask_ba = df["place_with_parent_names"].str.contains("Capital Federal")
      mask_apt = df["property_type"] == "apartment"
      mask_price = df["price_aprox_usd"] < 400_000
      df = df[mask_ba & mask_apt & mask_price]

      # Split "lat-lon" column
      df[["lat", "lon"]] = df["lat-lon"].str.split(",", expand=True).astype(float)
      df.drop(columns="lat-lon", inplace=True)

      # Drop features with high null counts
      df.drop(columns = ["floor","expenses"], inplace= True)

      # Drop low and high cardinality categorical variables
      df.drop(columns= ["operation", "property_type", "currency","properati_url"], inplace=True)

      # Drop Leakey columns
      df.drop(columns= [
          "price",
          "price_aprox_local_currency",
          "price_per_m2",
          "price_usd_per_m2"
           ],
      inplace= True)

      # Drop columns with multicollinearity
      df.drop(columns=["surface_total_in_m2", "rooms"], inplace=True)
      
      return df

files = glob("buenos-aires-real-estate-*.txt")

frames = []
for file in files:
   df= wrangle(file)
   frames.append(df)
   
df = pd.concat(frames, ignore_index= True)
print(df.shape)
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8774 entries, 0 to 8773
Data columns (total 5 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   place_with_parent_names  8774 non-null   object 
 1   price_aprox_usd          8774 non-null   float64
 2   surface_covered_in_m2    8036 non-null   float64
 3   lat                      8432 non-null   float64
 4   lon                      8432 non-null   float64
dtypes: float64(4), object(1)
memory usage: 342.9+ KB

In this article, we focused on preparing data for analysis. In the upcoming articles, we will explore the data in more depth and create models to gain insights. We learned how to use our wrangle to import multiple CSV files using a for-loop.

During the exploration process, we analyzed null values, high and low cardinality categorical features, leakage, and multicollinearity. In the next articles, we will examine the relationship between the features of an apartment in Buenos Aires and its price. We will use this information to build our machine-learning models.

Thank you for reading. See you in the next article!