Using ML to Predict the Lifecycles of New Customers

Published in

Future Vision

5 min readMay 7, 2019

Intro & Motivation

When analyzing processes like churn and customer lifecycles, it’s so tempting to just export your existing customer base, whip up a few features like average_order_value, order_frequency and n_total_orders, throw it in a model and call it a day. Your model probably performs pretty good too! But using ‘present-tense’ data to predict a ‘past-tense’ outcome doesn’t make much sense. Plus, it stinks of survivorship bias and data leakage!

What if you were able to estimate a brand-new customer’s lifetime, just by looking at their first order? Is this even possible with such little data? This is known as the “cold start problem” — to solve it, we’ll need to figure out a way to best leverage our severely limited data. This tutorial will walk you through estimating the lifetimes of new customers using a Gradient Boost Regressor.

Shameless plug: click here to check out the interactive webapp version of my model — enjoy the 90's HTML and CSS.

A few notes before we begin

Big thanks to RockTape for letting me use their data for this project. RockTape is a sports-medicine business that works mainly with Chiropractors, Physical Therapists and other medical professionals.

RockTape uses BigCommerce to host their web store. All of the data I’m working with was simply exported as CSVs.

It’s important to note that your customer and subscriber exports will only reflect the most up-to-date information for your customers. We don’t have access to historical data, unfortunately, so we make the assumption that there has not been any historical changes that will affect our results (such as a customer_group changing, or unsubscribing from our newsletter).

Let’s Get To It!

Load in all your data as Pandas dataframes. Parse your dates, check your data types, all that good stuff. Set the index of your customer dataframe to be your unique customer IDs.

I’m working with data from three “sources” within BigCommerce — a customer export, an order export, and a marketing export.

Take a look at my GitHub Repo for the complete source code of my function! I’ll do a high-level overview in case you’re trying to adapt this process to a different eCom platform:

1. Initialize an empty Pandas dataframe with the features you’d like to consider and reindex it to match your customer dataframe’s index. Think about what features you’d like to use. I’m using: avg_price_item, order_value, ship_total, subscriber_newsletter, used_coupon, affiliation and customer_group, as well as time_as_customer (our target). Take care to ensure your features adhere to the cold-start model. Don’t include features like n_total_orders or order_frequency (leakage!!!!!).

2. Set up a for loop to loop through all your customers. Skip customers with no purchase history.

3. Use boolean indexing to pull up all orders for a customer, and grab your time_as_customer value. Now, mask this list so we are only considering the first, initial purchase.

for customer in custy_df.index.values:     mask = order_df[order_df['Customer ID'] == customer]     if len(mask) == 0: #skip customers with no purchases
          pass     else:
          time_as_customer = (mask["Order Date"].max() - custy_df['Date Joined'][customer]).days          mask_zero = mask.head(1)          ### Continue extracting features using mask_zero ###

4. Using your mask, compile the rest of your features. Write them into your Pandas dataframe.

Sweet! Raw feature dataframe assembled! Here’s what I’m working with:

A few small transformations need to be done before we can throw it in the model. They’re in the Transform class.

Log your price features (for me, this needs to happen on avg_price_item, order_value, and ship_total)

2. Binarize any categorical features you may have (I binarized my affiliation and customer_group features)

Nice, this is the final version of our dataframe!

Because we are good, data-driven analysts, we’re going to do a test-train split. I implemented my own Splitter class, see GitHub for reference. Our target is time_as_customer.

Time 2 Model

I’m using the SKLearn implementation of the GradientBoostingRegressor with the following hyperparameters:

n_estimators = 150learning_rate = 0.05max_depth = 10max_features = ‘sqrt’

For more information about hyperparameters, please see the SKLearn GBR documentation.

The GBR works well here because it is one of the most robust machine learning tactics we have access to. GBR starts out with a “dumb” model, then fits on the residuals/errors to steadily improve. We’re working with pretty limited data and asking for a specific number back, so we likely have some gigantic residuals. Letting the GBR sort it out seems to work pretty well! (Note: I also tried with a Random Forest and a Linear Regression, GBR had the greatest reduction in mean squared error.)

Results and Conclusions

We score our model against the baseline. In this case, my baseline is equivalent to guessing the average lifetime for every single customer. For me, the model performs about 15–17% better than the baseline, which is by no means perfect, but a pretty marked improvement when it comes to identifying customers who have “low” or “high” lifetimes!

If you’re using my implementation, the model.score function will output a partial dependency plot of your top features. PD plots are useful because you can clearly see how each feature relates to the target (is it linear? Is it positively correlated?).

My data shows that Chiropractors tend to have higher lifetimes than other factions. Using a coupon with your first order extends customer lifetime. Buying pricey items seems to decrease customer lifetime. Free shipping does not correlate with a longer lifetime either, interesting!

I hosted a simplified version of my model as an interactive webapp! Please take a look, I pay AWS good money to host this bad boy :) Follow my friend, Elliott’s, tutorial if you’d like to learn how to use Flask to make your own!

Want cool Future Vision merch? Check out our store here