Building a Churn Model with Keras, Flask, Heroku, and Postgres — Deploying a Usable Model to Production Pt. 1

26 min readJul 13, 2018

A common problem is data science is having work sit on your local machine, only to be used by yourself and not those who requested the model. Not because they won’t find it useful, but because they will not be able to access it in its current state.

The trained model is not helpful if it only lives on your local machine, and requires your input every time someone looks at it.

How do you get it into the hands of someone so they can use it without you?

How can you get it into production so you can move on to the next thing?

There are a few ways, and I’ve found the easiest is to create a web app with Flask and deploy it to Heroku for stakeholders to use. Once you are comfortable with process this you’ll be able to update your models along with others, creating a fast iteration and deployment process.

The Final Product

Note: I had originally had this as one article, however the more I included the longer it became. To make things easier, I’ve split it into two different articles.

When this is finally done, the product should:

Be accessed internally by stakeholders on a private server
Contain minimal instructions and clear error messages
No require maintenance besides reweighing and updating the model
Show users a percentage on how likely their client is to ‘churning’
Provide clear results to inform business decisions

Here is a little preview of the final product:

In this first part of this series I’ll go over:

Churn methodology;
Getting your data with PostgreSQL;
Modelling techniques with Python, and;
Creating and evaluating a Keras model.

In the second article I’ll go over:

Making a front-end application using Flask, and;
Deploying the model to Herkou.

How do You Measure Churn?

Different businesses and stakeholders within the same business, look at and measure churn differently. What is most important is to dig into what factors you think might be influencing churn for your business, start to form a hypothesis, and then model to determine what actually causes churn.

How many products they have? How long have they been a customer? Where are they located? What industry are they in?

Reforge has a great article on how churn varies by the amount of monthly spend, but again this is only one metric to consider.

Consider these points when building your churn model, and determining what to look for within your data:

There is no one right way to look at churn, as all businesses think about their customer lifecycle differently.
Tie your model’s predictive power to revenue; retention costs vs lost revenue.
Do different stakeholders think about churn differently? Does a customer service rep think about retention the same way an account manager does?
Will churn predictions lead to better segmentation of customers? Can you get the results into your CRM?

You’ll need to start asking yourself these questions, as you start gathering information to form your churn hypothesis. You might have the data you’re looking for, but in most cases you won’t. Which means you’ll need to start tracking more events in your database or do some serious feature engineering.

In the model I built, a few of the features are:

City, industry, and country demographics;
Type of service plan;
Amount of additional service fees;
Online vs. in-store purchases;
Revenue and purchases per month of activity;
Total discounts / total customer spend;
Number of customer support tickets;
Customer support ranking;
Unique items purchased;
Orders with late shipping;
Unique items purchased, and;
Last sign-in date on website.

There are more than this my model, but these are just a few to think about. There is a good Kaggle competition about churn which can also give you some ideas about what features to look for and eventually add.

Jupyter Project Structure

Now let’s set up the project folder, so we can stay organized. There are three main parts to the Jupyter part of the project.

In / I have the main notebook churn_project.ipynb, in /config I have the database connection params config.json, and in /sql are the queries to be referenced later.

$ tree -v --charset utf-8
.
├── churn_project.ipynb
├── config
│   └── config.json
└─── sql
    ├── 1_query.sql
    ├── 2_query.sql
    └── more_queries_here.sql

Starting with SQL

Because writing SQL in Python/Jupyter is a hassle, I suggest using an SQL editor such as Postico to connect to your DB and write all your queries before you use them in the construction of your model’s core dataframe. Also save the connection information to connect to your DB through the Jupyter notebook.

Writing your queries here will save you a lot of time

I’m using PostgreSQL, so examples will be referenced in that language.

I won’t be including the queries I wrote for this project, however here are a few tips I can offer when you’re writing your queries to help you save time and frustration.

PostgreSQL Tip 1: Common Table Expressions

Because you’ll be trying to pull inferences out of your data, you’ll probably get bogged down in subquery hell with nesting after nesting of your SQL code.

This is not only annoying but unnecessary. With common table expressions (CTE), you’re able to chuck your queries into sections and then reference them sequentially and you work through your project.

They are incredibly helpful and will save your time.

# Two CTEs WITH cte_1 AS ( -- First section-- Put SQL here), cte_2 AS ( -- Second section-- Put SQL here) -- End both the CTE expression-- Put together the CTEsSELECT *
FROM cte_1
WHERE cte_2.foo is 'bar'

CTEs are able to be referenced within each other, however you cannot reference a CTE until it has been defined.

E.g. you can reference cte_1 in cte_2, but not cte_2 in cte_1.

You can see in the final SELECT I’ve referenced cte_1 and cte_2 successfully because they were both previous defined in the statement.

PostgreSQL Tip 2: Built-in Time Zones

If you’re dealing with data from different time zones, and you want a single source of truth use the built-in pg_timezone_names to move all the times to utc.

With the below code you’re able to clean your column easily, without importing your own timezone table. By placing all the date/time references at UTC you’re able to compare dates correctly.

# Convert time_zone_data from query_table to UTC to have a common
# time for all the data to referencequery_table.time_zone_data AT time zone 'utc' at time zone (SELECT name FROM pg_timezone_names WHERE name = iso_time_zone)

PostgreSQL Tip 3: SUM + CASE + WINDOW Functions

In PostgreSQL you’re not able to combine aggregate functions within the same method. Such that, SUM(COUNT(table.foo)) will throw an error.

You’re able to get around this restriction by using aggregate functions with a case and/or a window function. In my case, I used them to find a sequential sum of unique events on a client. This is especially helpful if you’re trying to feature engineer from sequential time series sales / ordering data.

E.g. Imagine you track every time a client purchases a unique item and you want to know how many they have purchased on a unique date?

client_id  item_id  date      unique_event
1          a        1-1-2018  1  <-- First time client 1 bought a
1          a        1-2-2018  0
1          b        1-3-2018  1  <-- First time client 1 bought b
1          c        1-4-2018  1
1          a        1-5-2018  0
1          d        1-5-2018  1
1          a        1-6-2018  0
2          b        1-4-2017  1  <-- New client

Your query would look like this:

-- I'll explain this belowSUM(CASE column_name.unique_event
    WHEN 1 THEN 1 
       ELSE 0 
    END) OVER (PARTITION BY column_name.client_id 
                 ORDER BY column_name.client_id
                 ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
                 AS "unique_event_count"

And your new result would be:

client_id  item_id  date      unique_event  unique_event_count
1          a        1-1-2018  1             1
1          a        1-2-2018  0             1
1          b        1-3-2018  1             2 <-- +1
1          c        1-4-2018  1             3 <-- +1
1          a        1-5-2018  0             3
1          d        1-5-2018  1             4 <-- +1
1          a        1-6-2018  0             4
2          b        1-4-2017  1             1 <-- New client

The SUM function will create a running total, but requires CASE to return +1 every time it comes across a unique purchases. The key to the formula is the OVER PARTITION BY— our window function — coupled with ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.

By combining these two methods, we’re able use the aggregate function without using a GROUP BY in the query. Instead, we’re using a GROUP BY in the window function — which you can think of as a pseudo subquery — so it is evaluated line by line within our group. Therefore, our data is not rolled up like usual aggregate function.

If you want to see further examples of this, check out this article here.

In the query above I SUM every line from the beginning of the group — which is by client id —until the window function finds a new client id. And then, the function will start over again. This will return a new value every time the client has purchased a new product or has performed a unique_event they had previously not performed.

Save Your Files and Configs

Save your queries as .sql and put them in /sql.

As well, set up your config.json file, so it has the following the values to connect to your DB. I’ll be referencing this file within Jupyter, to save the query results as dataframes.

# /config/config.json{
  "database": "database_name",
  "schema": "schema_type",
  "user": "user_name",
  "host": "host_name",
  "port": "port_no",
  "passw": "password"
}

Move on to Jupyter

Start a new notebook in / and test the connection with the following code.

# /churn_project.ipynbimport pandas as pd# Open config file
with open('./config/config.json') as f:
   config = json.load(f)# Get connection params as a string
postgres_config_string = "host=%s dbname=%s user=%s password=%s" % (config.get('host'), config.get('database'), config.get('user'), config.get('passw'))# Connect to db and save query as a dataframe
# If a connect cannot be made an exception will be raised here
con = psycopg2.connect(postgres_config_string)

If everything went well with the connection, start saving each query as its own dataframe.

# Save query as a dataframe
query_1 = open('./sql/1_query.sql', 'r')
df_1 = pd.read_sql(query_1.read(), con=con)# Continue with additional queries ...

Now let’s start to put together the queried dataframes into a final dataset we can use in our model.

Quickly, I want to highlight the scratchpad extension for Jupyter. It has saved me a lot of time, and provides an area for working through problems before you put them into the notebook.

Further Feature Engineering

I could have done the merging indirectly in our SQL editor, but I wanted to save some of it for Python. If you want to do it all with SQL, try out the CTEs shown above and combine them all and the end.

I used the pandas method .merge() on all the queries we saved as dataframes, and here are few things to consider:

Choose what columns you want to keep in the merge, as some of them might become redundant;
Create a new merged dataframe so you’re able to reference your previous dataframes;
Use on rather than left_on or right_on for simpler joining, and;
Make sure your index columns between dataframes have the same name, or else you’ll have to deal with duplicate columns later.

# Here is an example of the above# Creating new_data_frame from df_1 and df_2 with only 
# columns 1, 2, and 3 from a Left Joinnew_data_frame = pd.merge(df_1[['col_1'], df_2[['col_2', 'col_3']], on='use_this_index', how='left')

Intro to Binning and Labelling

For the final model built below, all of the inputs are real numbers or integers, and the columns with strings or categories were labelled and encoded with integers so they could be inputted. Additionally I wanted to bin some of the larger integer values, to reign in the larger values and variability into something more manageable.

Now to start binning and labeling some of our variables.

For features such as total spend which have unbounded real positive numbers, I could leave them as they were, but my hypothesis was customers fall into distinct categories on the amount spent.

E.g. if total spend is less than $10,000 label this as 0. And, customers from $10,000 to $20,000 would be labelled as 1. However, these bins would be different for your data.

This is easily done with pd.cut() where you can bin and label values into discrete intervals within the same method.

print df['not_binned_values']1        55000
2        12000
3        99000
4        5000
5        1000
...# Bin df['not_binned_values'] into discreet categoriesbins = [0, 10000, 20000, 50000, 100000, 500000]
labels = [0, 1, 2, 3, 4]df['binned_values'] = pd.cut(df['not_binned_values'], bins=bins, labels=labels, include_lowest=True)print df['binned_values']
 
1        3
2        1
3        3
4        0
5        0
...

Similar to binning above, I need to take categorical data — such as service plan type or city — in my data and convert it to 0, 1, 2, etc. so it can be processed by our model.

First check if the column is of type category by calling df['foo'].astype(‘category’). Next, I use pd.cat.categories on the column to get an indexed list of the categories which we can enumerate to integers mapped in a dictionary.

We then assign the values in our dictionary to the column we’re factorizing.

# Specify this is categorical data
df['city'] = df['city'].astype('category')# Get the values to associate with each category
city_codes = dict(enumerate(df['City'].cat.categories))# Assign the values to the categories
df['city'] = df['city'].cat.codesprint df['city']
 
1        12
2        8
3        4
4        1
5        2
...

Finally, to convert True or False values imported from SQL — should there be any — into 1s and 0s, just multiply the column by 1. Also, make sure the column is of boolean type. Now you’re able to input boolean values as integers into the model.

print df['true_false_column']1        True
2        False
3        True
4        True
5        False
...# Convert a column with True and False values to 1s and 0s
df['0_1_column'] = df['true_false_column'].astype('bool') * 1print df['0_1_column']
 
1        1
2        0
3        1
4        1
5        0
...

Transforming and Rescaling Our Data

Now that I have all the bins needed, I can consider rescaling data in the event standardized or normalized data created a better model.

There is are a few examples here on when to transform your data.

# Standardize the dataframe
# Values are transformed to have a mean of 0, and sd of 1
scaler = StandardScaler().fit(df)
std_df = scaler.transform(df)# Normalize the dataframe
# Values are transformed to be between 0 and 1
norm_df = normalize(df, norm='l2')

Big values accumulating in your network are not good. Just as I binned everything above, I want to keep our input to the model nice and clean.

For example, when I normalized my data I saw about a 5% improvement in the accuracy of my final model, however when I standardized it there was a substantial decrease in accuracy.

Key take away here is you don’t know how some models will react to transforming the data, so try a few out and see how the results change.

Checking for Multicoliniary

Before I move on to modelling with the final dataframe, I want to check for multicollinearity within the churn features I’ve chosen. This is when two or more of the predictors in a model are moderately or highly correlated. This will throw a wrench into our model, as our weights is will be unstable and will vary greatly from one sample to the next.

Even though I’m building a neural network, the estimation of the optimal weights for each node in the network could be subjected to the harmful effects of collinearity. Therefore, resulting in a model with poor predictive ability.

Consider creating a regression model where you want to estimate housing prices (y), and two of the inputs were number of bathrooms (x_1) and number of toilets (x_2). Obviously, these two variables will ‘move together’ — unless there are bathrooms in houses that don’t have toilets? — because they are highly correlated. Therefore the model will have a hard time understanding which one x_1 or x_2 is responsible for change in y. So the coefficient estimates for both of the inputs will be incorrect.

I can check the almost final dataframe for multicollinearity using a Variance Inflating Factor (VIF) on each feature we’re considering testing. The result of a VIF will show how much higher the value of a coefficient’s estimator will be are when x_1 and x_2 are correlated compared to when they are uncorrelated.

Each input is given a VIF score, and high VIFs are a sign of multicollinearity. Here are some quick guidelines for determining which features to keep and which ones to remove.

VIF = 1      --> Not correlated
1 < VIF < 5  --> Moderately correlated
VIF >= 5     --> Highly correlated

Here is the code to look through your dataframe, and determine the features which display multicollinearity. The add_constant package was used, as to correctly calculate the VIF, as variance_inflation_factor expects the presence of a constant in the matrix of explanatory variables.

from statsmodels.stats.outliers_influence import variance_inflation_factor 
from statsmodels.tools.tools import add_constantdef calculate_vif_(df, thresh=5):
    '''
    Calculates VIF each feature in a pandas dataframe
    A constant must be added to variance_inflation_factor or the
    results will be incorrect    :param X: the pandas dataframe
    :param thresh: the max VIF value before the feature is removed
    from the dataframe
    :return: dataframe with features removed
    '''
    const = add_constant(df)
    cols = const.columns
    variables = np.arange(const.shape[1])
    vif_df = pd.Series([variance_inflation_factor(const.values, i) 
               for i in range(const.shape[1])], 
              index=const.columns).to_frame()
    
    vif_df = vif_df.sort_values(by=0,
             ascending=False).rename(columns={0: 'VIF'})
    vif_df = vif_df.drop('const')
    vif_df = vif_df[vif_df['VIF'] > thresh]    print 'Features above VIF threshold:\n'
    print vif_df[vif_df['VIF'] > thresh]
    
    col_to_drop = list(vif_df.index)
    
    for i in col_to_drop:
        print 'Dropping: {}'.format(i)
        df = df.drop(columns=i)
    
    return df# Check your dataframe for multicollinearity
# The values are just examplescalculate_vif_(df)Features above VIF threshold:

                 VIF
feature_1   6.298834
feature_2   5.423374Dropping: feature_1
Dropping: feature_2

Given the results above, feature_1 and feature_2 should be removed from the dataframe.

The Final Result Before We Model

Feature engineering is now complete and the final dataframe should be ready to train the model.

Below is just an example of what the df should be looking like. Two major features you need remove from dataframe are client ids (c_id) so you’re able to retrieve specific clients later on in the Flask app, and whether or not they have churned (churn?) which will be our y later on. Keep the rest of the features as dependent features in the df.

If a client has churned y=1 and if they are still active y=0.

print dfc_id   churn?   feature_3  feature_4  feature_5   feature_6 ...
1      1        1          0          106256.54   9090
2      0        1          0             502.22   240
3      0        0          3             800.99   120.2
4      0        1          4              12.45   1456
5      0        0          2             432.02   222.22
6      1        1          1             726.90   123.99# Save variables for use later
y = df['churn?']
ids = df['ids']
df = df.drop(['churn?', 'ids'])

Building The Keras Model

For the model, I chose a sequential Keras model, with one input layer and a single output of the percent to churn. The model is defined by a linear stack of layers — defined on creation — with various activation functions for each layer.

Let’s go over some of the major points of building a sequential Keras model.

Units: The number of neurons for each layer of the model.

Input / Output Dem: A positive integer, which is dimensionality of the input/output space. E.g. if the churn model has 21 features, the unit would be 21 for the inputs for the first layer and 1 feature for the final output layer which is the percent probability of churn.

Activation: A “weighted sum” of the input to the neurons on a layer, to decide if it should be “fired” or not. Depending on the function used, you’ll have different results for your neural net.

Because I’m doing a classification problem, I want to use a sigmoid function for the final layer, but not for each layer. If I do this there will be a vanishing gradient added to the model. Not good.

Because the derivative of the sigmoid function will always be smaller than one, adding more and more layers with this activation function will produce weighting values that converge to zero quite quickly. This is also a problem with tanh activation functions.

This means if we use sigmoid for each activation, our first layer will map to a much larger input region than our second, third, fourth, etc. This results in inputs from the first layer that have little effect on the final layer.

It’s best to find activation functions that fit the type of problem trying to be solved, and experiment during the hyperparameter tuning phase.

Dropout: To prevent overfitting, we remove some of the data every-time we move to a new layer. This can be applied to visible or hidden layers. Good practice is to use a small dropout value of 20% — 50%, as too low a value will not have a substantial effect on results.

Loss / Cost Function: This function representing the cost (loss) the model pays for an inaccurate prediction. For linear regression the classic is mean squared error (MSE), and for our classification model we’ll use binary cross-entropy also known as log loss.

The cross-entropy loss is fairly standard for binary classification, as it provides fast learning when y_hat (predicted values) differ significantly from our y (labels or true values). As well, this function especially penalizes the optimization of the network when highly confident guesses — both for success and failure— are wrong.

If you’re interested in seeing how some loss functions don’t quite look the way you think they would, check out the site lossfunction. You won’t feel so bad then yours comes out a bit … wonky. I’ll visualize this model’s loss function below.

Optimizer: These algorithms will maximize — or maximize depending on the problem — the above cost function, given a set of inputs. Consider what occurs when gradient descent is used when training a neural network, as the minimum found along the gradient will determine the weight update for each neuron during back-propagation.

The choice of a optimizer will determine how updates occur on the weights, and ultimately how we classify our inputs to produce an output percentage for our model.

Metrics: How do we know the model is correct and how do we score it? Keras provides a few out of the box metrics such as mean average, accuracy, binary accuracy, categorical accuracy, etc. While similar to a loss function, the results are not used when training the model.

Testing the Model

Because I’m trying to classify clients if they are at risk of churning or not, I’ve used a sigmoid activation function as my output layer. I’ve also used a high dropout because I don’t want to overfit the model, along with randomly splitting the dataset into a train and test set.

I also wanted to implement a ROC as the output metric, as well as just looking at the accuracy. There is a great post on Stack Overflow here that provides the sample code for creating ROC logs at each epoch of testing.

import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras import regularizers
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import normalize, StandardScaler# Input features are 'df', and outputs classified as 0 and 1 are 'y'# Standardize data for testing
scaler = StandardScaler().fit(df)
std_df = scaler.transform(df)# Normalise data for testing
norm_df = normalize(df, norm='l2') # Why l1 and l2?# Split data
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.4, random_state=1075)# Define keras model for easier swapping of layers
# and tuning hyper parametersdef keras_model(loss, optimizer, metrics):
    '''
    Keras model for classification. This is just an example,
    as each model should be tuned for your specific use case.
    '''    
    model = Sequential()    # Add and subtract layers, activation functions, and dropout
    model.add(Dense(units=50, activation='relu',
                              input_dim=len(df.columns)))
    # Add more layers here ...
    model.add(Dense(1, activation='sigmoid'))
    
    # Compile the model
    model.compile(loss=loss,
                  optimizer=optimizer,
                  metrics=metrics)    return model

By defining the Keras model as a function with loss, optimizer, and metrics, as parameters you’re able to quickly tune different parts of the model making it deeper or wider on the fly. Save each model as a different function so you’re able to quickly compare outputs.

E.g. if you add more layers — making it a deeper model — replicate the function and name it keras_model_extra_layers(), and save the new model.

# Create multiple modelsdef keras_model_three_layers(loss, optimizer, metrics):
  ...def keras_model_extra_wide(loss, optimizer, metrics):
  ...def keras_model_added_foo(loss, optimizer, metrics):
  ...

Let’s move on to training the model, and determining how good it is.

class roc_callback(Callback):
    '''
    Create ROC curve for Keras model during the training of each
    epoch
    '''
    def __init__(self, training_data, validation_data):
        self.x = training_data[0]
        self.y = training_data[1]
        self.x_val = validation_data[0]
        self.y_val = validation_data[1]    def on_train_begin(self, logs={}):
        return    def on_train_end(self, logs={}):
        return    def on_epoch_begin(self, epoch, logs={}):
        return    def on_epoch_end(self, epoch, logs={}):
        y_pred = self.model.predict(self.x)
        roc = roc_auc_score(self.y, y_pred)
        y_pred_val = self.model.predict(self.x_val)
        roc_val = roc_auc_score(self.y_val, y_pred_val)
        print '\rroc-auc: {} - roc-auc_val: {} \n'.format(str(round(roc,4)), str(round(roc_val,4)))
        return
    
    def on_batch_begin(self, batch, logs={}):
        return    def on_batch_end(self, batch, logs={}):
        return# Create our model to evalute
model = keras_model('binary_crossentropy', 'rmsprop', ['accuracy'])# Fit the model
keras_model.fit(X_train, y_train, 
   validation_data = (X_test, y_test), 
   epochs=100, batch_size=1500, verbose=1, 
   callbacks=[roc_callback(training_data = (X_train, y_train),     
   validation_data=(X_test, y_test))])# You should see output similar to thisTrain on 2212 samples, validate on 1475 samples
Epoch 1/100
2212/2212 [==============================] - 1s 577us/step - 
loss: 0.6875 - acc: 0.5502 - val_loss: 0.6631 - val_acc: 0.6637
roc-auc: 0.8536 - roc-auc_val: 0.8494

Before we move on to the ROC, let’s look at the accuracy and loss of the model. To plot, these re-run the mode but this time save the output.

# Save the model's output
history = keras_model.fit(X_train, y_train, validation_data ... )

The history object stores all of the accuracy and loss metrics at each epoch for us to graph. I’ve defined accuracy_loss_graph() which takes the history object as a parameter, and plots the graphs.

What can you learn from each of them?

%matplotlib inline
import matplotlib.pyplot as pltdef accuracy_loss_graph(history):
    ''' 
    Create an accuracy and loss graph for the train and test
    data sets
    '''
    plt.style.use('ggplot') # Make it look nice
    fig, axs = plt.subplots(1,2)
    fig.suptitle('Accuracy and Model Loss')
    fig.set_size_inches(18, 8)    # Accuracy
    plt.subplot(1, 2, 1)
    plt.plot(history.history['acc'])
    plt.plot(history.history['val_acc'])
    plt.title('model accuracy')
    plt.ylabel('accuracy')
    plt.xlabel('epoch')
    plt.legend(['train', 'test'], loc='upper left')    # Loss
    plt.subplot(1, 2, 2)
    plt.plot(history.history['loss'])
    plt.plot(history.history['val_loss'])
    plt.title('model loss')
    plt.ylabel('loss')
    plt.xlabel('epoch')
    plt.legend(['train', 'test'], loc='upper left')    plt.show()accuracy_loss_graph(history)

What is this telling us? Accuracy (left) and loss (right)

For the left graph, we can see the difference between our training and validation accuracy. The closer our test line is to the training line, the less overfitting there is. The father they diverge, the more overfitting apparent in the model. If we started to see more overfitting, we would want to increase regularization of our data — stronger L2 weight penalty, add more dropout, etc. — or put in more data for the model to crunch.

Another case which would need to be corrected, is when the validation accuracy perfectly tracks the training accuracy. This indicates the model needs more features to work with, so go back to feature engineering and increase the input parameters of your model.

What you want to see is a converging between the two lines.

What about the right loss graph? Why is the training loss higher than the testing loss? This is a sign of slight overfitting. And, this is even with dropout added to the model. However, because the gap between the train and test lines are not massive, I will keep this model.

However, if this was a larger project or the gap between training and test was substantially different, I would try and tune the hyper-parameters to get a better fit.

Here is the final epoch output for the Keras model. At an validation accuracy of 85.29% it looks like we did well with the model. Nice.

Epoch 100/100
2212/2212 [==============================] - 0s 7us/step - loss: 0.4148 - acc: 0.8431 - val_loss: 0.3797 - val_acc: 0.8529
roc-auc: 0.8949 - roc-auc_val: 0.8940

But wait, is accuracy even the right measure here?

Accuracy can be thought of as how ‘correct’ the model is.

However, accuracy can be misleading if it is the only metric considered during model selection. It might seem desirable to select a model with high accuracy, as 85% might seem like a great result! However, a model with lower accuracy might have greater predictive power.

Why is that?

Consider a problem with large classifier imbalances — where 95% of the results are 1s and 5% are 0s — your model might be able to predict the value of the 1s but not do well at the 0s.

This is called the accuracy paradox, where “… predictive models with a given (lower) level of accuracy may have greater predictive power than models with higher accuracy.”

Because of this, metrics like precision and recall may be better at evaluating the model.

Let’s take a look at the previous scores, with validation accuracy at 85.29% and validation ROC score was 89.40%.

Are those good?

Take the ROC curve below. If I had a skewed dataset rife with classifier imbalances, I would see quite different values for the ROC AUC and the upcoming Precision-Recall (PR) curve AUC.

We can use these results above as baselines when looking at our PR graph.

# Visualize the ROC curve
from sklearn.metrics import roc_curve
from sklearn.metrics import auc# Get metrics to graph
auc_keras = auc(fpr_keras, tpr_keras)
y_pred_keras = model.predict(X_test).ravel()
fpr_keras, tpr_keras, thresholds_keras = roc_curve(y_test, y_pred_keras)def create_roc_graph():
    ''' 
    Create an ROC graph for different estimators
    '''    plt.figure(figsize=(16,10))
    
    # Set baseline
    plt.plot([0, 1], [0, 1], 'k--')
    
    # Map different classifiers
    plt.plot(fpr_keras, tpr_keras, 
    label='Keras (area = {:.3f})'.format(auc_keras))    # Label graph
    plt.xlabel('False positive rate')
    plt.ylabel('True positive rate')
    plt.title('Receiver Operator Characteristic (ROC) Curve')
    plt.legend(loc='best')
    plt.show()create_roc_graph()

A ROC curve on its own is not a good visual illustration for highly imbalanced data, because the X axis of the ROC curve, which is the False Positive Rate ( False Positives / Total Real Negatives ), does not drop drastically when the Total Real Negatives is huge. Therefore if we had an imbalanced data set, we would not know just from this graph.

Whereas Precision ( True Positives / (True Positives + False Positives) ) is sensitive to an increase in False Positives and will not be impacted by more real negative values added to the denominator. This would be shown in the PR Curve.

Moving on to the PR Curve

Now that we have a baseline to work with, I’ll construct the PR curve.

The PR curve will help us answer the question “What is the probability that this is a real classification of an unknown object, when my classifier says it is?”. Since I’m modelling churn this seems like a good metric to understand.

First, I’ll look at the a confusion matrix as well as the sklearn package classification_report which will give us a better idea of how our model’s precision and recall scores for each class.

from sklearn.metrics import classification_report, confusion_matrixtarget_names = ['0', '1'] # If a target has churned or notprint 'Confusion Matrix'
confusion_matrix = confusion_matrix(y_test, rmsprop_model.predict_classes(X_test))
tn, fp, fn, tp = confusion_matrix.ravel()print confusion_matrix
print 'True Negative: {}\nFalse Positive {}\nFalse Negative {}\nTrue Positive {}\n'.format(tn, fp, fn, tp)
print 'Classification Report'
print classification_report(y_test, 
                            rmsprop_model.predict_classes(X_test), 
                            target_names=target_names)# You should see output similar to thisConfusion Matrix
[[374 122]
 [ 95 884]]True Negative: 374
False Positive 122
False Negative 95
True Positive 884

Classification Report
             precision    recall  f1-score   support

          0       0.80      0.75      0.78       496 *
          1       0.88      0.90      0.89       979 **

avg / total       0.85      0.85      0.85      1475# *  374 + 122 = 496
# ** 95 + 884 = 979

Let’s take a look at what all these values mean:

Precision: When our model predicts yes, how often is it correct? This is shown by the equation P(y=1 | y_pred=1)

Recall: When our actual value is actually yes, how often does it predict yes? This is shown by the equation (y_pred=1 | y=1)

F1-score: The harmonic mean of precision and recall, where 1 is perfect precision and recall and 0 is the worst.

We can see the averages of our scores are all 85%, however our model is better and classifying 1s — clients who have churned — and not 0s. So while our two classes are imbalanced based on the number of observations in the dataset, but not overly imbalanced. I can from the classification report we’re not falling into the accuracy paradox.

We can further confirm this with an PR curve.

# Try percision recell curve
from sklearn.metrics import precision_recall_curve# Get values for PR Curve
y_pred = rmsprop_model.predict_proba(X_test)
precision_keras, recall_keras, thresholds_keras = precision_recall_curve(y_test, y_pred)
area_keras = auc(recall_keras, precision_keras)
print "Area Under Curve: %0.2f" % area_kerasdef precision_recall_curve(area_keras):
    '''
    Plot the percision-recall curve for a classifier
    '''
    plt.figure(figsize=(15, 10))
    plt.plot(recall_keras, precision_keras, label='Precision-Recall for Keras Model')
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    plt.ylim([0.0, 1.05])
    plt.xlim([0.0, 1.05])
    plt.title('Precision-Recall: AUC=%0.2f' % area)
    plt.legend(loc='best')
    plt.show()precision_recall_curve(area)

The PR curve is providing a high AUC of 93%, I can be assured our model is imbalanced and will predict well.

So Which One To Use? ROC? PR?

Use ROC when detecting either classes detection is equally important, and you want equal weight to both classes prediction ability. However, it will become an untrustworthy metric especially the data is imbalanced or skewed.

A classifier which optimizes ROC, may not also optimize the area under the PR curve. With high imbalance of observations, consider focusing on the maximizing the PR area.

I also wanted to look into how different models would perform Vs. the Keras model.

I made two different Keras models, one with an ADAM optimizer and one with a RSMPROP optimizer. And, a vanilla logistic regression and a vanilla random forest classifier.

Actually, it was quite disappointing as the random forest classifier ended up performing better than both versions of the Keras model I made. These things happen, so the lesson here is don’t get too attached to your model as something else might come along outperforming it.

However because the second part of the article is about using a Keras model in production, I will continue to use it for the rest of the project. The comparisons between models are for illustrative purposes and the better understand how to select a model.

And now for the PR curve to see how our original model stacks up.

Looks like the Keras models has again lost to the Random Forest. But there is another metric to look at.

Scoring The Model(s)

However as this is a classification model, how good our model is comes down to correctness of outcomes. Therefore, I need to choose a metric that maximises some utility function. Let’s look at a Brier score.

With scoring, look at the number of times that a specific probability was provided and compare it with the actual proportion of times the event occurred. If the actual percentage was substantially different from the stated probability, we have poor model.

The best possible Brier score is 0, while the worst score is 1. Scores which hover around the middle — between .4 and .6 — are hard to interpret. What are the four models above scoring at?

from sklearn.metrics import brier_score_lossprint 'Keras RMSPROP brier score: {}'.format(round(brier_score_loss(y_test, y_pred_rms, pos_label=1), 4))
print 'Keras ADAM brier score: {}'.format(round(brier_score_loss(y_test, y_pred_adam, pos_label=1), 4))
print 'Logistic Regression brier score: {}'.format(round(brier_score_loss(y_test, y_pred_logr, pos_label=1), 4))
print 'Random Forest brier score: {}'.format(round(brier_score_loss(y_test, y_pred_rf, pos_label=1), 4))# ResultsKeras RMSPROP brier score: 0.1146
Keras ADAM brier score: 0.1154
Logistic Regression brier score: 0.1315
Random Forest brier score: 0.0996

Looks like our random forest is again scoring better than the Keras model(s).

Actually Predicting Something …

Now that I have a model I’m mostly comfortable with, I need to actually predict something. This is the key part of the project after all. In this case, I want to see what the percentage of churning for the client who is in row 2222.

At the beginning of the article I saved the client ids to a dataframe — ids = df[‘ids’] — this is so we can find which client is associated with a certain row.

# Create a dictionary of client_ids and the row they are on
# Keys are client ids, values are row index
client_dict = dict(map(lambda t: (t[0], t[1]), enumerate(ids)))# Which client is on row 2222
print client_dict.get(2222)
>> 7153

Use a nested array within .iloc and return the values of a specific row to see what the result would be if those feature were trying to determine churn.

# Return the values of a specific row to feed into our mode.
model.predict(df.iloc[[2222]].values)>> array([[.7888]], dtype=float32)

So the client 7153 has a 78.88% chance of churn.

Looks like everything is working right now, the model is trained and evaluated, so I can move on to making the web app in Flask so the model can be used by others.

Save the model as churn_model.h5 and … finally … this part of the project is done.

# Save the model in your root directory
model.save('churn_model.h5')

If you want to look at additional pre-trained Keras models for other projects, check out this repo here. Sometimes the data is not available for what you need to train, so might as well use what is public to save time and frustration.

Final Thoughts

That was a lot to go over, but I hope you have a better understanding of how to approach a data science project from the start: what to think about, how to get your data, and the initial steps in modelling. I wanted to make sure I went over some of the minor details and evaluation steps which could get missed during the setup and modelling process.

As always, I hope you’ve learned something new. The final repo for the project will be posted in the second part, and part 2 will be posted soon.

Additional Reading

How to Predict Churn: A model can get you as far as your data goes - Blendo

In the latest post of our Predicting Churn series articles, we sliced and diced the data from Mailchimp to try and gain…

www.blendo.co

How We Built Our Machine Learning Model for Churn Prediction

Want to learn about the inner workings of our machine learning model for predicting app churn? Our data scientists…

www.urbanairship.com

What makes predicting customer churn a challenge?

Staying on top of customer churn is an essential requirement of a healthy and successful business. Particularly, most…

medium.com

https://www.quora.com/How-do-I-develop-model-for-customer-churn