Using logistic regression to predict whether a property can be sold

Can my house be sold?

Kinder Sham
Analytics Vidhya
7 min readJun 12, 2020

--

Photo by Pedro Lastra on Unsplash

Logistic Regression is the earliest subject I learned in Data Science. When we encounter a category variable whose dependent variable (Y) is binary, logistic regression is a suitable method for regression analysis. Logistic regression is a predictive analysis which is used to describe data and explain the relationship between a dependent variable binary variable and one or more independent variables.

When can logistic regression be used?
For example: Do drugs, alcohol, unemployment, broken marriages, long-term health problems and physical disabilities increase suicide rates? (yes vs. no)?

In this article, I will explain how to use logistic regression to predict whether a house can be sold. Before starting, we will walk through the data science life cycle to understand what should happen.

Photo by Srinivas Rao on Quora

Steps of solving Business Problem with Data Science Life Cycle:
1. Business Understand
2. Data Collection
3. Data Preparation
4. Exploratory Data Analysis
5. Modelling
6. Model Evaluation
7. Model Deployment

#1 ~ Business Understanding

Any project, regardless of its size, requires an understanding of the business, which is the basis for effectively solving business problems.

What is problem I trying to solve?

At this stage, I need to define problems, project goals and solutions from a business perspective. This is the first step in solving problems with data science methods. In this article, I am a manager of a real estate company. I need to find out the sales potential of these real estates, and these sales potentials are basically the data of past real estate transactions. So, I want to predict whether the property will be sold or not.

Therefore, it is a Supervisor Learning ~ I will use Classification method.

#2 ~ Data Collection

Data is the most critical part of the entire machine learning. So I need to consider the following:

  • Do I have the data?
  • Where does the data come from?
  • Do we trust the data source?
  • Do I have the domain knowledge?

The data is coming from the udemy online course

#3 ~ Data Preparation & #4 ~ Exploratory Data Analysis

Now we will try to load the house dataset to jupyter notebook.

Now, let’s take a look at the dataset. The dataset contain 506 rows and 19 columns. The dependent variable is Sold and price to Parks are independent variables. There are three categorical variables, namely airport, water body and bus, the others are numerical variables. In the picture below, we can see that there are missing values ​​in n_hos_beds. Before running logistic regression or any other type of machine, we need to deal with these missing values.

We can see that all our numeric variables are listed at the top, and we have values ​​such as count, mean, standard deviation, minimum, maximum, 25% and 50%, and 75%.

Now, we will check whether the values ​​of these variables are what we want. Missing values ​​and variables with outliers represent values ​​that do not follow the variable pattern, which will greatly affect the accuracy of the model. By using the box-plot and joint-plot, we see that there is an outliner on the columns of n_hot_rooms and rainfall.

Now we have three observations from our numerical data. First there are missing values in the n_hos_beds variable The second problem is an outliers on the higher end of the n_hot_room variable. The third problem is an outliers on the lower end of rainfall variable. Now let’s look at the categorical variables. We will plot the count bar to all categorical variable.

In the categorical variable, you can see that the bus_ter variable only had a value, yes. So in a sense this is not a variable. This is a constant. Therefore, we can delete this variable from our data because it does not provide any additional information or have any impact on the results.

Now we can start processing the problematic data. The first one is the missing value at n_hos_beds variable. The treatment of missing values ​​is generally divided into deleting missing values ​​or filling in missing values. Due to the small number of missing values, we will fill the missing value by Mean value.

The next thing to deal with is the abnormal value of the upper limit of the n_hot_room variable and the abnormal value of the lower limit of the rainfall variable. We will use capping and flooring method to impute the abnormal values with 3* P99 and 0.3*P1.

Finally, we will perform a data transformation, we will first drop out the bus_ter variable, beacuse all the bus_ter variable is YES, it does not provide any additional information or have any impact on the results. Then we will create the avg-dist variable, using the dist1, dist2, dist3 and dist4 variable the get the average distance to the employment hub. After that, we will use Scikit-learn LabelEncoder to transform the airport variable and use One-Hot Encoder the transform the waterbody variable. Lastly, we will drop the waterbody variable.

#5 ~ Modelling

It is important to standardization the training data and test data because most machine learning models converge much faster if the proportions of the elements are the same. Doing standardization will centralize the feature’s mean between 0 to 1. The data will be distributed on normal distribution. To calculate the mean and standard deviation of those feature and apply the above formula to each observation/value, it will use sklearn’s StandardScaler:

The last thing to do before training our models is to split the dataset. I split this data randomly with 80/20 for training versus test examples to training set and testing set. We need to do this so we could estimate the predictive result of our model by predicting the testing set data (unseen data).

We have trained a model and to check the coefficient and intercept.

#6 ~ Model Evaluation

In this step, we will evaluate the performance and accuracy of the machine learning model. The commonly used indicator for classification problems is called Confusion Matrix. Confusion Matrix is ​​a commonly used indicator for classification problems. It derives many different indicators and marks some more important derived indicators .Based on the prediction, we can conclude that predict whether a property can be sold or not accurately for 66.6% of the test data.

#7 ~ Model Deployment

This is a relatively simple project, usually we also need iteration and function selection or compare with other algorithm. By collecting the results of the implementation model, you also need to receive feedback about the performance of the model and its impact on the implementation environment. By analyzing this information, data scientists can improve the model and improve its accuracy, thereby improving its practicality. Once a satisfactory model is developed, it will be implemented in the production environment.

Thanks for reading! If you enjoyed the post, please appreciate your support by applauding via the clap (👏🏼) button below or by sharing this article so others can find it.

I hope you have a basic understanding of the Data Science Life Cycle. How to think at each stage to help guide you through the methodology of a successful data science project. At the end, I hope that you can learn the how to use the logistic regression techniques. You can also find the full project on the GitHub repository.

--

--

Kinder Sham
Analytics Vidhya

Data scientist, cycling and game player enthusiast. Focus on how to use data science to answer questions.