Linear Regression Vs. Logistic Regression using Python
In the field of data science, professionals often encounter challenges when attempting to transform hidden problems within data into practical and actionable insights. One way to transform these problems into solutions is by using regression analysis techniques. In this article, I would like to share with you the list of regression techniques that we can use to discover relationships between data and how to approach these techniques using Python.
There are two types of regression techniques based on the outcome variable. The first is linear regression, and the second is logistic regression:
Linear regression is a technique that estimates the linear relationship between a continuous dependent variable Y and one or more independent variables (X). For example, we could model the relationship between the prices of a product and the number of sales. When we have two or more independent variables, and we want to estimate the linear relationship between those variables (X) and one continuous dependent variable (Y) we use what is known as multiple linear regression. Just remember, continuous variables are variables that can take on any real value between its minimum and maximum value.
On the other hand, we have the Logistic Regression. Let’s think about this: How likely is it that you renew your subscription as a Medium Member? This question tackles discrete events or categorical data. Categorical data is a type of qualitative data that can be grouped into categories instead of being measured numerically. So, these types of problems can be modelled by using a Logistic Regression based on one or more independent variables X, where the dependent variable Y can have two or more possible discrete values.
Now, a question might arise at this point: Which model technique should you use, and how you are sure if it is the right choice?
First, apart from considering the business context and needs, you will have to check the assumptions that have to be true to justify the use of a particular technique. We are going too deep with these assumptions but I’d like to mention them for you so you will be able to review them when the time comes:
Assumptions for Linear regression.
1. Linearity: Each predictor variable (Xi) is linearly related to the outcome variable (Y).
2. Normality: The errors are normally distributed.
3. Homoscedasticity: The variance of the errors is constant or similar across the model.
3. Independent Observations: Each observation in the dataset is independent.
One more assumption to add for multiple linear regression:
No multicollinearity: The no multicollinearity assumption states that no two independent variables, Xi and Xj can be highly correlated with each other.
What about the logistic regression model? Some assumptions about binomial logistic regression are a little more complex to explain. We have concepts like the odds (The odds of a given probability p is equal to p divided by 1- P), and the Maximum Likelihood Estimation (MLE). And just like linear regression, we assume there is little to no multicollinearity, and variables X should not be highly correlated.
Second, let’s consider the data you are working with and the problem you want to address. A few things to consider regarding this matter are:
1. What is the business problem? What is the question or what do you want to predict?
2. What is the variable in your data you want to be the dependent Y variable?
3. Is this variable continuous or discrete?
For instance, let’s imagine you want to predict the patient’s response to a given medical treatment.
> How much does each element of the treatment influence the response? If the goal is to down the level of cholesterol, then this could be our dependent variable (outcome). Since the level of cholesterol is a continuous variable then you could use a linear regression to address this task.
Third, leverage the programming language Python to meticulously assess whether all the underlying assumptions have been effectively satisfied.
For this example, we are using the penguin dataset from Seaborn in Python and the Ordinary Least Squares, also known as OLS, which is a method that minimizes the sum of squared residuals to estimate parameters in a linear regression model.
# First Import packages
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Load dataset
penguins = sns.load_dataset("penguins")
# Keep Adelie and Gentoo penguins, drop missing values
penguins_sub = penguins[penguins["species"] != "Chinstrap"]
penguins_final = penguins_sub.dropna()
penguins_final.reset_index(inplace=True, drop=True)
# Create pairwise scatterplots of data set
sns.pairplot(penguins_final)
Based on the previous scatterplot we observe a few linear relationships but let’s explore the one between body_mass_g and bill_lenght_mm.
# Import ols function
from statsmodels.formula.api import ols
# Subset Data
ols_data = penguins_final[["bill_length_mm", "body_mass_g"]]
# Write out formula
ols_formula = "body_mass_g ~ bill_length_mm"
# Build OLS, fit model to data
OLS = ols(formula = ols_formula, data = ols_data)
model = OLS.fit()
Now, with the model built let’s check the assumptions, but first isolate the bill_length_mm column, calculate the predicted values using the model.predict(x) function, and save the model residuals.
# Subset X variable
X = ols_data["bill_length_mm"]
# Get predictions from model
fitted_values = model.predict(X)
# Calculate residuals
residuals = model.resid
Checking the assumptions
- Linearity: create a scatterplot
sns.regplot(x = "bill_length_mm", y = "body_mass_g", data = ols_data)
2. Normality: create a histogram to check this assumption.
fig = sns.histplot(residuals)
fig.set_xlabel("Residual Value")
fig.set_title("Histogram of Residuals")
plt.show()
3. Homoscedastacy: create a scatterplot with the fitted values and residuals.
fig = sns.scatterplot(x=fitted_values, y=residuals)
fig.axhline(0)
fig.set_xlabel("Fitted Values")
fig.set_ylabel("Residuals")
plt.show()
4. Independent observations: The independent observations assumption is more about data collection, and for this dataset there is no reason to think measurements of penguins are related to each other and therefore are independent.
Lastly, if we look at each of the previous graphs, we can determine that we have chosen the correct model and that each of the assumptions has been satisfied.
In summary, whenever you encounter business challenges related to the relationships between specific variables, you can trust statistical analysis employing regression techniques. You can have confidence in the reliability of the results derived from such analysis.