Logistic Regression

RIO
RIO
Nov 5 · 4 min read

Firstly, I am going to explain regression. Regression is one statistical methods which is used for predicting and factorizing the target phenomenon.

Regression

Regression analysis provides an equation which illustrates the relationship between variables. Therefore, dependent variable and independent variables have to be prepared for the regression analysis.

Dependent variable is a variable that you want to predict or analyse. This variables cannot be changed intentionally because this number is only the result of equation. Dependent variable is also called target variable. On the other hand, independent variable is a variable which is related to dependent variable and is used for providing the dependent variable.

Let me give you an example. If you want to assume the salary in the future, you can set the dependent variable the amount of salary in the future(Let the amount of salary in the future Y). And in the next step, you should try to find appropriate dependent variables. In this case, we will use the age of the person as an independent variable.(Let the age of the person X). And then through the regression analysis, we can find this kind of equation(diagram below)

In this case, an independent variable(age) can be used to illustrate the dependent variable(salary).

Now we could find the relationship between the age and the salary from regression analysis.

And this type of regression is called single regression analysis.

We can choose more than two independent variables, then the regression will be called multiple regression analysis. This type of regression is more popular in the business and is easy to apply the real business situation.

For instance, when a beer shop wants to predict the number of beers sold in summer(Let this number Y), they can set temperature(T), the number of pedestrian walking in front of the shop(P), and the number of events held near the shop(E) as independent variables. Then use regression analysis to this situation, the analysis provides an equation like below.

Y = 0.5*T + 1000*P + 50*E

How can we see this equation? It’s easy. The degree of important is shown as each coefficients. If the coefficient of particular independent variable is large, the independent variable makes huge influence to the dependent variable(target number). So, in this example, the owner of this beer shop can assume the number of pedestrian walking in front of the shop is the most important factor.

Logistic Regression

Logistic regression is also regression. But main difference is in the dependent variable. In normal regression which is shown above, the dependent variable is continuous number. On the other hand, in logistic regression, the dependent variable is categorical. Categorical means the number represents the characteristics of person, or gender(man 0, woman 1), but the number has no mathematical meanings.

Famous instance is the use of distinction spam mail. We have to make a rule to use categorical number. Rule is like this. If the mail is spam, the result is 0.If the mail is not spam, the result is 1. Secondary, we choose dependent variables like normal regression. In this case, we use the count of particular words appeared in the mail. (Let the count of lucky: a, nice: b, hello: c) Then use logistic regression.

Y(spam or not) = 0.4a + 0.03b +0.2c

Then this equation provides the answer ranged from 0 to 1. In the logistic regression, threshold is prepared. When threshold is 0.5, the result above 0.5 is classified as spam, and the result below 0.5 is classified as not spam mail. This is simple example of logistic regression.

This logistic regression can be used to determine what kind of consumer will buy the product. We can set purchase(0) or not(1) as an independent variable. Then the importance of each consumer’s characteristic will be understood by the each coefficients.

Multicollinearity

When we use regression analysis, we have to care about multicollinearity. Multicollinearity happens when we choose an independent variable which is highly correlated with one or more of the other independent variables in a multiple regression equation. With multicollinearity, the regression does not have significant meanings, not accurately illustrates the dependent variable.

Cause and Effect

Moreover, the analyst must pay attention to relationship between an independent variable and dependent variables, especially for cause and effect relationship. For example, let temperature Y and the number of beer sold X. And if the regression analysis provides Y = 0.1X + 10, how do you think this relationship? It seems to have increasing of the number of beer sold make higher temperature. But is that possible? In this case, we cannot say this relationship exists. It should be reversed. The higher temperature increases the number of beer sold. Without attention to this problem, we may misunderstand the result of regression analysis.

RIO

Written by

RIO

blockchain, marketing, data-analysis, programming

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade