Predictive Repurchase Model Approach with Azure ML Studio

Fatih Buyukbas
Nov 28, 2020 · 5 min read
Image for post
Image for post
Photo : Unsplash.com

I will try to explain an approach of how to score a customer’s intent of repurchasing a consumer goods or automobile etc. in Azure ML Studio.

First, we should check our data quality and decide the time intervals we will use in our predictive model. We can use the last 5 years’ sales and customer data for our data gathering task. We can gather 4-year data for all customers based on their transactions, invoice amount, campaign responses, and their RFM (Recency, Frequency, Monetary) data points. We can then flag the customers, who repurchased last year as our target customers.

Image for post
Image for post
Modeling Process — https://www.dataiku.com/stories/models/

There are some important data points you should calculate and design. For example, invoice amount per invoice type, brand ownership, count of goods per type, and customer touchpoints data you can use in your modeling. Call Center calls NPS score and lost sales opportunity counts are also important inputs for your modeling. Customer’s digital footprints are also very important data points for increasing model accuracy. I also added product counts for every product type for each customer. I prefer to gather the necessary data in numeric format. That is why we should format some data points in numeric format, for example, I set ‘1’ for Business Account and ‘2’ for Person Account to use in Model. It is important to be able to collect data from different systems and create a holistic view of the customer for this kind of work.

I think another important and tricky part is to know business and market dynamics in your company’s sector. You should have a sense that which data can create a difference in your model success. Also, you should get enough information from sales, marketing and even finance departments to understand what drives a loyal customer and what drives long-lasting customer relations. It will ease your job of interpreting your insights and results of your model.

After collected data, I made some data tests to ensure the data is not incorrect. After that, I performed EDA (Exploratory Data Analysis)part with Python. I used the pandas' profile report package for the EDA part and I found it very useful. After you installed data, you can write pp.profilereport (df) and you will get all details about your data like general dataset Infos, Variable types, Warnings for each data point, and detailed information for each variable like min, max, mean, distinct count, missing data and zeros percentage.

Image for post
Image for post
Python EDA

This gives also correlations between data points in Spearman and Pearson metrics. We can see the most correlated data points with our target point, which is the flag field shows customers last year purchased again. I flagged 20 most correlated data points here to try in the modeling phase.

I uploaded data to Azure ML Studio to perform the modeling phase. It is easy to use, it is a drag and drop tool to create predictive models. As a result, we gathered data, performed EDA and now we know our data better. Since we create our target data field as a flag (1 or 0), our modeling problem is now a two-class problem and I will use all two-class algorithms in Azure ML Studio to get the best results.

Image for post
Image for post
Azure ML Studio

After uploading data, I realized that my target data is too few to create a good model. I mean repurchased customers are too few for our model. That is why I used the oversampling method. In ML Studio it is called SMOTE. You can assign a percentage for increasing the target field in this function.

Image for post
Image for post
SMOTE
Image for post
Image for post
Azure ML Studio

After that, we are ready to set models and compare the results. I tried all of the two-class algorithms to find the most accurate results like SVM, Neural Network, Decision Forest, Boosted Decision Tree, etc…

Image for post
Image for post
SVM Results

Above you can see an example result score for SVM Model. For our problem, It is important to check the F1 score. Accuracy generally stays at 0.95–0.98 level. In our problem, it is normal because we have many zeros in our target data field. I am comparing the F1 Score on result pages. We have the best result in the Two-class neural network algorithm. We got 0.73 F1 scores.

In a statistical analysis of binary classification, the F1 score (also F-score or F-measure) is a measure of a test’s accuracy. It considers both the precision p and the recall r of the test to compute the score: p is the number of correct positive results divided by the number of all positive results returned by the classifier, and r is the number of correct positive results divided by the number of all relevant samples (all samples that should have been identified as positive). The F1 score is the harmonic mean of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0. (Wikipedia)

As a result, we can get customer scores as CSV and plan special actions/campaigns for these customers. Through this work, you can increase lead and invoice conversion for your campaigns. You can use your model with Web services in your business systems if you are building a real-time analytics structure.

Analytics Vidhya is a community of Analytics and Data…

Sign up for Analytics Vidhya News Bytes

By Analytics Vidhya

Latest news from Analytics Vidhya on our Hackathons and some of our best articles! Take a look

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Fatih Buyukbas

Written by

Analytics Vidhya
Fatih Buyukbas

Written by

Data Analytics | Customer Analytics | Data Science| Martech | Digital Transformation

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store