Predictive Analysis in Python

Published in

My Data Science Journey

5 min readMar 22, 2019

I am a newbie to machine learning, and I will be attempting to work through predictive analysis in Python to practice how to build a logistic regression model with meaningful variables.

What is predictive analysis?

The predictive analysis makes predictions on what might happen in the future using historical data. The data is gathered in basetable which is consist of three important components: population, the candidate predictors, and target. The population is the group of people or objects you wish to make predictions for. The candidate predictor describes the people or objects in the population, which given information could use the predict the event. Finally, the target has information about the events to predict. it is one of the events that occur and zero otherwise.

In this predictive analysis, we are going to consider the non-profit organization which has a donor database with people who donated in the past. This organization considering sending a letter to its donors to ask them to donate to a specific project.

One option here is to send the letter to all the candidate donors. But, this could be very expensive.
The predictive analysis here allows us to determine the donors that are most likely to donate.

Logistic Regression

Logistic regression is a predictive analysis that makes predictions about whether something is True(1) or not(0). It is widely used for classifying the data and explaining the relationship between the binary variable.

If we plot the target as a function of age for all donors and then we fit a regression line through points, it is of the form a*x+b, with a positive number. a is called the coefficient of age, and b is called the intercept. If we plot the target as a function of the time since the last donation for each donor, it can be seen that those who recently donated, are more likely to donate. in this case, the coefficient of recency is negative.

We could make a prediction using one variable or a more complicated model adding other variables. We can build a logistic regression model using the module linear_model from scikit-learn.

#result of printing
0.007178355658921441 age
0.11430414536794431 gender_F
-0.0013087501133203447 time_since_last_gift
[-2.54149728]

So, our logistic regression model looks as follows:

-2.5 + 0.0072* age + 0.1143 *gender_F - 0.0011* time_since_last_gift

For example, we have 70 years old female person who made the last donation 120 days ago.

= -2.5 + 0.0072* 70 + 0.1143 *1 - 0.0011* 120
= -2.137#Then we calculate the logit function:
1/(1+e^-(-2.0123)) = 0.117

The logit function is used for the probabilities for the values between 0 and 1. As you can see from the above example for the given data which is 70 years old female person who made the last donation 120 days ago. We calculated the probability of making a donation is 11%.

The good news is we don’t have to calculate the predicted probabilities manually in python. We are going to use the predict_proba function on the logreg object to calculate the probabilities. The predict_proba has two-dimensional arrays. The first number is the probability that the donor will not donate (target 0), and the second number is the probability of the donor will donate (target 1).

[[0.93427169 0.06572831]
 [0.9454883  0.0545117 ]
 [0.9185279  0.0814721 ]
 [0.95269877 0.04730123]
 [0.94745512 0.05254488]]

Calculating AUC

The AUC value assesses how well a model can order observations from a low probability to be the target to a high probability to be the target. In Python, the roc_auc_score function can be used to calculate the AUC of the model. It takes the true values of the target and the predictions as arguments.

#result of the auc calculation using the variable of age, gender_F, time_since_last_gift
0.63

Variable Selection

We have more variables that we could include in our model but we have to make wisely set of variable selections for our model.

Let’s define a function that calculates AUC for a given set of a variable of the model that uses this variable set as predictors named as auc_score.

#result of auc score using the max_gift, min_gift, and mean_gift
0.7125

We are going to define thenext_best_variable function which finds the variable that should be added in the next step to the variable list.
We write a for loop iterate over all column variables to find the best variable for our model.

Variable added in step 1 is max_gift
Variable added in step 2 is number_gift
Variable added in step 3 is time_since_last_gift
Variable added in step 4 is mean_gift
Variable added in step 5 is age
['max_gift', 'number_gift', 'time_since_last_gift', 'mean_gift', 'age']Let’s find out the AUC Score for our current variable
current_list_auc = auc_score(current_variable, ‘target’, basetable)
print(current_list_auc)0.768756710130262

The credit goes to Foundations of Predictive Analytics in Python at the DataCamp course. In this course, you will learn how to build a logistic regression model with meaningful variables (covered here). You will also learn how to use this model to make predictions and how to present it and its performance to business stakeholders. I totally recommended the course.

Final Thought

You may find this study in my GitHub account as part of the Datacamp repository.

I have written this article to improve my data analytic skills and machine learning skills so I am still a learner. Please let me know any additional information or comment on this article.

Follow me on Twitter, Linkedin, or Medium.