#66DaysOfData — Days 18 to 22: Building a Predictive Model With Our Risk Formula

Jack Raifer Baruch
The Startup
Published in
5 min readJan 28, 2021

Previously, we created an Attrition Risk formula. Now it is time to build a model with it.

We will keep it simple to start with and build from there in future articles, so it does not become overwhelming. Like always we start by importing the libraries we will need, in this case, NumPy and Pandas will suffice:

Since this is a formula-based model, it will need some of the static variables we discovered previously, let us create them:

Now that our variables are all set. To do this simply, let us create a function called PredAttRisk (Predict Attrition Risk). This function will take one single parameter, a list of lists, this is so we can run multiple queries at one time, more on this a bit later. Each list should contain 6 items, first an integer referring to the id number of the entry (or person) and then 5 floating point numbers from 0 to 1 with the data for each of the OCEAN traits in that specific order (Openness, Continuousness, Extraversion, Agreeableness and finally Neuroticism. We will call this list of list results and it will be the sole parameter for our function. Here is the whole function and I will explain the rest below:

As you can see, after stating the function and input, we run a for loop over the list of lists results. Then we run the risk formula that we created last time, using the index for each part of the list containing our OCEAN information, and combining it with the specific multiplier from our means list above, and then we plug in the rest of the variables to complete our formula.

Then we add an extra line to round our result to 2 decimal places (if we don´t do this, our results will get unwieldly).

Finally, we print the result, in a nice way using the .formta method. Just a note, in biostatistics, risk is calculated on normalized basis (between 0 and 1), so to make it more readable, we multiply the result by 100.

Let us create a test variable with only one list for our input:

Do note the double square brackets, since our input takes in a list of lists, in this case it is a list with a single list in it. Now we run our function with test1 as input and see what happens:

And TA-DA, it works. This first result is predicted to have an 8% risk of attrition. Now we know it works for one entry, let us now try it with several ones. For that, we will create a list of lists with 5 lists:

And drum roll please:

Seems to be working well. But we can do one more test, which is using data straight from a data frame (like the ones we found a while back with OCEAN data) and running the function to make sure it is working well. For this, we first reload the data from the CSV file into the data frame df1:

Then we create a test3 which will be a list of lists created directly from our data frame. So as not to get overwhelmed with the results, since right now they get printed to the console, we will use the .head() method to limit the results to 15:

And finally, we run our function on the test3 variable and cross our fingers (it did work, but not first without having to fix some typos, you know, usual programing stuff):

And there you have it we have created a basic risk prediction model that outputs the risk of attrition / turnover based on OCEAN personality results. But, as always, you need to think of how this would be used in real life. Most professionals would consider a 16% risk to be exceptionally low, when in fact, it is remarkably high being our average 10.9%. This means that next time, we need to make some tweaks to our model so that it is more readable towards an end user.

See you then…

And remember you can access all the datasets, notebooks and everything related to this project here on my github repo.

Next Time — Tweaking our model towards usability.

Jack Raifer Baruch

Follow me on Twitter: @JackRaifer

Follow me on LinkedIN: jackraifer

About the Road to Data Science — #66DaysOfData Series

Road to Data Science series began after I experienced the first round of Ken Jee´s #66DaysOfData challenge back in 2020. Since we are starting the second round of the challenge, I thought it would be a good idea to add small articles every day where I can comment my progress.

I will be sharing all the notebooks, articles and data I can on GitHub: https://github.com/jackraifer/66DaysOfData-Road-to-Data-Science

Please do understand I might have to withhold some information, including code, data, visualizations and/or models, because of confidentiality regards. But I will try to share as much as possible.

Want to follow the #66DaysOfDataChallenge?

Just follow Ken Jee on twitter @KenJee_DS and join the #66DaysOfData challenge.

You can also reach out to me at any time through LinkedIN or Twitter.

--

--

Jack Raifer Baruch
The Startup

Making Data Science and Machine Learning more accessible to people and companies. ML and AI for good. Data Ethics. DATAcentric Organizations.