Applied Machine Learning: The Syndicated Revolver Part 2

Thierry Damiba
4 min readDec 19, 2018

In Part 2 we’ll be importing our data, doing some explanatory analysis, and building a quick model. You can find Part 1 here and Part 1.5 here. If you prefer pure code, you can check out the notebook on Github or Google Colab.

Let’s get started by importing our data.

That doesn’t look very useful. Turns out the excel actually has a second sheet, let’s add a parameter to specific which sheet should be imported.

That looks like what we’re looking for, but we’re getting a warning for a deprecated keyword. Let’s change that in case someone has to use this future.

Now that we’ve got our data imported, let’s get a feel for the data.

We’ll take a look at the shape, the object types, and the unique objects.

Looks like we’re going to need to do some cleaning to get the data in presentable format. 6 features are floats, so we’ll be able to get their summary statistics with describe. Floats are easy to work with, so no problem here.

We’ re left with 5 features that are objects. For our purposes, we need these features to be numerical. We can change our categorical values to floats for some quick analysis. Ideally we would encode these variables using something like one-hot encoder, but we just want a quick look at the data.

We can use mapping to replace the categorical values with floats.

Now that all of our data is numerical, lets take a look at some plots.

First we’ll plot a histogram of the features. Second, we’ll have a plot the Pearson correlation of the feature.

Third, we’ll use Seaborn to create a scatter plot showing invite percent vs commit.

Histogram

Count plot

Group-by Horizontal Bar Chart

Now that our data is in a good state, we can build some models to make predictions. Let’s see if we can hit 80% by using models provided by sci-kit learn. In the future portions of this blog, we will build a custom model and fine tune our parameters, but this model will be straight out of the box.

Now that we’ve built the model, let’s plot them to see which one has the best F1 score

XGBoost leads the pack at 83%. In a few minutes, we built a model that can predict commit or decline surprisingly well. Thanks for reading, we will continue to ask questions of the data. Look out for Part 3 where we’ll look improve our model by adding custom parameters.

Pierre

--

--

Thierry Damiba

Learning in Public about LLMs on modest hardware & Investigating how LLMs help and hurt emerging markets