Lead Scoring for Customer Segmentation: What it is and how we did it

Published in

Moosend Engineering & Data Science

27 min readMay 28, 2021

By nature, sales, and businesses in general, have an urgency to get as many leads as possible as quickly as possible. Therefore, they have to manage their time and contacts to achieve good numbers. However, either due to lack of time or because of the large number of leads, this is not always the case. Quite a few times time is wasted on leads that have little to no potential of becoming customers. That’s where Lead Scoring — and Business Value in general — come into play. Lead Scoring helps the sales rep to discern and focus on leads that are more probable to show an interest in the brand and eventually become clients.

What is Lead Scoring

Lead scoring is a methodology usually used in sales and marketing for ranking leads, potential, and existing customers in order to determine whether they’re ready to buy a product, interact with a brand, or engage with the business in general. The score is based on criteria that have to do with the interest the customers have shown in the business, their behavioral patterns, and their engagement with the business.

There are different ways of ranking customers, like using the Stars System (with 0 being the lowest and 5 the highest), or assigning them to a class/group, like “High”, “Mid”, or “Low”, always according to their score.

Where and why is it used?

Lead scoring is more than a marketing methodology, and different teams can benefit from it if used correctly. First and foremost, it enables the Sales Team to pay attention and use resources to the leads that are truly in a ready-to-buy situation but need a little push. Secondly, it allows the Marketing Team to identify the campaigns and channels that work better and bring in potential customers that are more likely to buy or engage. Last but not least, in eCommerce, this means more purchases which results in more revenue with a more consistent customer base.

Why Did We Do It

The reason we took up this project was to build a tool to help our clients differentiate their audience/customer base according to their behavior and level of engagement. That way, clients will have a better understanding of their customers, will be able to segment their audience even more, and further personalize their experience. For instance, using Lead Scoring, a client can identify low engagers, motivate them, or even reward high engagers as a token of appreciation and loyalty.

Current State of Affairs in the Lead Scoring World

Lead Scoring can be broken down into categories — or types — according to what we choose to score and how we choose to do it. Also, there are many different approaches and methods or algorithms that can be used, some of them being widely known, while others not so much. Whichever method or algorithm one chooses to use, there is the possibility to use it as is or try to make any changes someone wants according to the business’ needs.

Types of Lead Scoring

What we Score

In terms of what we score, there is Implicit Lead Scoring. This generally involves scoring leads based on user behavior and metrics, e.g., their actions, the pages they visit, and the interests they show. The other type of lead scoring, depending on what we score, is Explicit Lead Scoring. This involves trying to match leads to clients who have already bought. To do that, we compare demographic data and other information we have available about them.

How we Score

When it comes to how we score, there are the methods of Traditional and Predictive Lead Scoring. On the one hand, in Traditional Lead Scoring, each lead is given a score using predetermined criteria composed of the aforementioned implicit and explicit data. On the other hand, in Predictive Lead Scoring, an algorithm is used to predict behavior based on historical data. Their main difference is that in Traditional Lead Scoring, we have to assign values to each criteria manually, while in Predictive Lead Scoring, the score is updated based on the history and behavior of the prospect.

RFM / RFE in Theory

RFM stands for “Recency, Frequency, Monetary” and is a marketing tool used for identifying a company’s best customers. By numerically scoring a customer in these three categories, “good” clients are expected to score high in all three of these categories, giving them a high overall score/rank.

Recency

Recency: This metric shows how recently a customer was active or made a purchase. The less the time elapsed since the customer’s last activity, the higher the score of said customer.

Frequency

Frequency: This category shows how often a customer purchases or engages in a particular period of time. Obviously, customers who score high in this category seem to be more loyal, as their actions of engagement are more frequent.

Monetary

Monetary: Also known as “Monetary Value”, this metric measures how much a customer has purchased or spent with the company over a specific period of time. Usually, the more someone spends, the better score they get in this category.

Methodology

The RFM factors are used to reasonably predict whether a customer will purchase again or not. Here’s how we achieve that: at first, we rank each customer by each of these factors individually — let’s say in the range of 1 to 5. So, we end up with each customer having a triad of values — with 5 being the best score and 1 being the worst for each factor — that show how said customer fares in each one. After that, we calculate the average of these values to find the final RFM score of the customer. One can also use a weighted average, depending on the nature of the business and the factor considered to be more important. Having the final score at our disposal, we can decide how to segment the customer base to discern the high-quality ones from those less likely to interact with the brand.

RFE

For Non-eCommerce customers, where there is no way to measure Monetary Value, there is RFE. RFE is an alternative to RFM that, instead of the “Monetary” component, uses the “Engagement” component of the customer, meaning the level of activity of this customer, e.g. number of clicks, page views, etc. Otherwise, the calculation and scoring process stays the same as in “traditional” RFM.

Our Approach to RFM

We decided to have a take on the RFM method, but we needed to work around some issues in the way we approach this. The main problem was that RFM uses some standard variable to classify a population, such as the Monetary Value. However, we didn’t have purchase data on this occasion, so we had to find a way to substitute the main variables RFM uses with the data we had in our disposal and classify the customers using these “custom” factors.

Recency

For the first variable, Recency, we tracked the date of the last action of each client. Then, we calculated the time between the last date the customer made an action and the last day of the training period. That way, each customer would be paired with a number that would indicate the number of days passed since the last action the customer took, e.g. (Customer1, 4). That means the smaller the number, the more recent the action, and thus, the better their Recency score should be.

Frequency

For Frequency, we counted the number of times each customer made an action during the time interval of the training period. Now, we have another pair of customer and number (Customer1, 36), which indicates the number of actions this customer made during the training period. This is pretty straightforward, the bigger the number, the more actions the customer took, so the bigger the Frequency score should be.

Monetary (Score)

As we mentioned before, we don’t have purchase data in order to score the customer for Monetary Value; however, we have an equivalent. What we consider conversion here is a certain action and not a purchase; therefore, we don’t need to calculate the revenue a customer brings to the business. We can use the score we assign the customers, as it is an indicator of the actions said customer made. So, the third pair of customer and number we get (Customer1, 28) indicates the score the customer achieved according to the action they took during the training period.

Classification per Variable

If we group these pairs by customer, we get a triad of numbers for each customer, which indicates the customer’s score in all three variables (Customer1, 4, 36, 28). Because these numbers have different ranges, we have to transform them in some way. For each variable, we take the distribution of the scores and classify them by the quadrant they belong to. In the case of Recency, the scores that belong to Q1 are the best ones because, as we said, the more recent the action, the better. For the other two variables, the better scores would be in Q4. We assume that the best quadrants are mapped with the number zero (0) for each variable, while the worst with the number three (3). This way, all variables have values that range between 0 and 3.

Final Results

After this mapping, the case of our “example customer” will be something like (Customer1, 0, 1, 1), which we turn into a pair of customer and number, like (Customer1, 011), with three zeros (000) indicating the best scoring customers, and a triad of threes (333) indicating the ones that engaged the least. Now, taking the distribution of these numbers, we can classify our customers into four classes. We map the quadrants as we did earlier, with Q1 being the quadrant with the best scoring customers while Q4 has the customers with the lowest activity during the training period. In the end, we map the classes in the range of 0 to 3, with 0 being the worst and 3 the best. This method helps us differentiate between the customers who would have similar scores and maybe would overlap or be mistakenly placed in the wrong quadrant.

Our Custom Model

We split our client base into two categories: eCommerce and Non-eCommerce. We made this decision based on the fact that there is a major difference in how each type of client measures customer engagement. The eCommerce ones acknowledge customer engagement mainly through their purchases. The rest measure engagement through traffic and certain actions, e.g., site visits, form completions, file downloads, etc.

This tells us that each category has a different endgame, a different state that is considered a conversion. Clients in the eCommerce categories want to find the customers that are more likely to make a purchase, while Non-eCommerce clients want to identify customers that are more likely to interact with their product. This interaction could be reading an article, opening and clicking a newsletter, or visiting a specific page in their site.

eCommerce Clients

Because eCommerce clients want their customers to purchase, we need to predict the probability of a specific customer doing so. This can easily be achieved through the use of our company’s Recommender Systems. A Recommender System calculates the probability of a customer purchasing certain products based on the customer’s purchase history and preferences. If the probability of purchasing a product is over a certain threshold, the system recommends the product to the customer for purchase. If we take the weighted average of a customer’s probability for each product, we get the probability of this customer purchasing in general. This metric can be used as a ranking tool in and of itself, thus helping us classify customers according to who is more likely to purchase a product.

Since there is a way to segment the customers of eCommerce clients, the approach I’m about to describe is about trying to predict high-quality customers for clients in the Non-eCommerce category.

Non-eCommerce Clients

The Data

At first, we decided to build a relatively simple and basic model that evaluates clients according to their campaign opens and clicks. We got all of the customer actions of each campaign and put them in a file, including information on who performed each action, when it was made, and finally, the type of action (open or click).

Training Period and Scoring

For our training period we used a time interval of 2 months, training on campaign data and statistics of randomly selected clients, assigning 1 point for each open and 5 points for each click the clients’ customers made. Of course, clicking is less probable than opening a campaign, which happens more frequently. Therefore, we assumed that the ratio of 1 click every 5 opens, or that a click is 5 times more rare or stronger than an open, is a good starting point.

Aggregating and Classification

After we formatted the data a little to add the respective score next to each action, we split the data into two sets: training and testing. In the training set, we grouped the data by customer and got the total score of each customer for all actions they made. Then, we normalized these total scores to get the final score, which was in the range between 0 and 1, and classified the customers in classes. There were three different Engagement Classes, depending on whether the customer’s final score was over .66, between .33 and .66, or under .33.

Testing

Having these first calculations at our disposal, we wanted to test them in the testing set, which is a set of the same customers we have in the training period. The difference is that the customers’ actions in this testing period were chosen in a time interval of one month, following the end of our training period. Using the same scoring and classification method as the one we used during the training period, we checked whether the customers who have been assigned in a specific class would end up again in that same class in the testing set. We had to pay special attention to the high scoring class, as it is the one that indicated the high-quality customers.

Observations

Looking at these first results, we observed that we had some problems. First of all, we didn’t have enough customers in the High Engagement classes, and as a result, we couldn’t conclude whether the model was performing well or not. Also, the Low Engagement classes included a considerable percentage of the customer base we trained the model on, so the suspicion of overfitting made its appearance. We immediately understood that we had to identify these issues and find solutions or, at least, countermeasures.

Problems Encountered

Of course, as in every initial approach, not everything goes as expected or planned. From the very beginning, we encountered some problems that we had to solve or, at least, find an alternative way to deal with.

Data Skewness

One of the problems we encountered was ending up with classes whose populations were drastically different because of the skewed distribution that characterizes the data. Skewness means that most data points gather on one end of the distribution, meaning they have very low or very high values. Ideally, we want our data to be as close to the Normal Distribution as possible. However, this is rarely the case, and ours was not an exception. There are ways to overcome the problem of skewed distributions and the issues they cause, but the main idea behind them is to transform our data — which wasn’t useful at all — with the kind of data we had to work with.

Class Underpopulation

Campaigns are characterized by skewness because not every customer opens a campaign that was sent to them, and even less click on said campaign. This results in them scoring low in our algorithm, which means most customers tend to gather at the lower end of the score distribution. In our case, this resulted in high engagement classes being underpopulated because of the majority of data points being on the lower end of the distribution and thus producing weird results in said classes or in general.

Class Overlapping

Also, that same skewness of the distribution causes some of the low engagement classes to overlap due to the high number of instances and their minimal scores. This means that because of the low-scoring customers being too close in terms of scoring, the low engagement class these customers will end up in is unclear. That way, we ended up with sets of classes that were short one or two classes, making it impossible to have useful results in order to find patterns or correlations or make correct decisions.

Outliers and other Scoring Issues

The previous issue also showed us that our scoring method had flaws and needed to change. In the case of low-engaging customers, their large numbers and very low score resulted in the problem, as mentioned above, where they weren’t always placed in the correct class. Moreover, we didn’t set limits in scoring, which means that some customers achieved too high a score because of their high activity, without said activity being of high value, necessarily. For example, a customer who opened many campaigns but never clicked was assigned a higher score than another customer who had clicked a few campaigns, even though they didn’t open as many campaigns as the first.

Also, the lack of limits allowed some customers to score too high because of the high number of actions they had made, thus making them outlier points in our data and altering the results or losing possibly valuable information.

Tweaking

After encountering the problems above and observing the first results, we decided to shake things up and try some well-known — and some not so popular — techniques that would improve our model and our results. As someone can imagine, not all things we tried showed the expected results and weren’t taken into account.

Enhancements

Training Data Refinement

First of all, we extended the time interval of the training period to 4 months, to have even more information available. Also, we removed customers who chose to unsubscribe from receiving campaigns because they would not interact with the clients’ campaigns anymore. As a result, we would not give any more information and add value to our model.

Scoring Revision

Following that, we changed the scoring method we used. We decided to score by considering the unique opens and clicks per campaign. So, if a customer received 10 campaigns during the training period, they would be scored for no more than 10 actions. This way, we set limits to our scoring method and avoided having a big deviation between customer scores, thus solving the problem with the outliers. One more issue this revision solves is the case when a customer that only opens campaigns but has high activity, scores higher than a customer that clicks campaigns but has low activity, although the actions of the latter are of higher value. This issue is solved entirely by using the enhancement that follows next.

Time Decay

We also decided to apply a time decay method to the customers’ scores to give more weight and value to the most recent customer activities. What time decay does is reduce the score of an action, after a certain time interval passes since said action.

For example, let’s assume we have two customers who opened a campaign during the training period but opened it with a 10-day difference. Opening a campaign means 1 point, so both customers start with 1 point for said action. We begin to count the days that have passed since the action was made, and every time a predetermined period passes (e.g., every seven days), we cut the score of the action in half. When we reach the present day, we will have the true score for each customer for this particular action. Of course, the customer that made the most recent action will have a higher score value.

Before, when we didn’t have the time decay method applied in our scoring method, these two customers would have the same score assigned for these actions. This is based on the notion that a customer who has recently engaged with a brand, is more likely to engage again. In our case, a customer that has interacted with the most recent campaigns sent to them is more likely to interact with the next one they receive.

Moreover, the same importance is given to the campaigns themselves: The older the campaign a customer interacts with, the less its impact on the final score.

Of course, the total score of a customer is an aggregate of these final scores per action, but these depend on how much time has passed since the action and at which interval we choose to cut the score. Therefore, we experimented with different periods and tried reducing the scores after seven (7), ten (10), and fifteen (15) days had elapsed since each activity or action.

Quantile Classification

Now, in terms of the distribution and classification issues we had to solve, we decided to classify the final scores according to the quartiles of their distribution. This means that we take all the scores, sort them, find the values (quartiles) that separate each of the four quadrants (quantiles) from the ones next to them and divide the scores to classes using these values as margins. This way, we have classes that don’t overlap with each other and have similar populations. Also, we ended up with five (5) classes, one each for the resulting quantiles, and one (1) for the customers with no activity at all.

We chose this method to classify the customers because some clients have customers who are more active and engage more than other clients’ customers do, which would cause an imbalance problem. For example, the customers of a client who sends 3 campaigns a month will be placed in the High Engagement classes with 3 opens or clicks. This is not the case for a client who sends 3 campaigns per week. Out of all of their customer base, the customers who interacted with these campaigns 3 times will probably be placed in the Low Engagement classes; on the contrary, the ones in the high Engagement classes will have over 10 interactions.

Prediction Method

Lastly, we decided to change what we predict in the testing period and how we do it. Instead of predicting if the customers will result in the same class as in the training period, we tried to predict whether they would make a certain action in the next campaign they would receive, regardless of when they received it. Regarding that, we also decided to have five (5) training windows. We took the training and testing periods and rolled them forward by seven (7) days, and we did that four (4) times. That way, we had similar results times five, but with different time intervals for the training and testing period. In the end, we calculated the weighted average of these five results of the five runs to get the final results of the model with the current settings and parameters.

Didn’t Make the Cut

Some things didn’t work out as we would have liked, or at least didn’t give us any more information, even though we had pretty high hopes for them, especially for the one we describe more elaborately.

Industry-Score Correlation

At first, we tried to identify patterns or any kind of correlation between the customers’ industry and their scores. For example, we ran the model for many customers, after an elaborate selection of them was made, according to some of the Industries we had available in our system. After that, we grouped these clients according to the Industry they belong to and tried to see if a particular Industry scores better in general or compared to the rest.

Custom Fields Research

What is a Custom Field

Custom Fields are, as the name implies, fields in a client’s database that are not part of a template but are custom made and sometimes unique per client, although there are some that are widely used among businesses. The client creates Custom Fields and uses them to serve their individual needs for information about their customers. Usually, Custom Fields are populated with values when the customers fill in forms or complete their user profile. Therefore, it is common for many of those Custom Fields to have a lot of NULL values.

Choosing Custom Fields

To find the best Custom Fields for each customer, we need to implement a feature selection method. Feature selection is a method to select those features of your data that contribute the most to your prediction output. Because each client has different Custom Fields, both in numbers and names, we need to identify the best ones to use in each case. That’s why we used a couple of techniques and algorithms to help us in this task.

NULL Values

Because of the unique characteristics that Custom Fields have, we had to single out the ones that would actually give us information about a customer’s profile and/or behavior. Therefore, after many tests and trials, we decided not to consider those Custom Fields whose population has less than 25% of their values equal to NULL.

Cardinality

Moreover, we checked Custom Fields in terms of cardinality. Cardinality is the diversity of values a Custom Fields has in comparison to the population. For example, the Custom Field “Telephone Number” has high cardinality because all customers have a different telephone number. This means that we cannot group customers according to this Custom Field and expect to see a pattern. For that reason, we decided to keep the Custom Fields, whose variation of values is under 20% of the population.

Clustering

After we keep the Custom Fields with the requirements mentioned above, which could give us valuable information, we use clustering algorithms to determine any clusters or patterns created among the customers. One way was to try and find a finite number of clusters, while the other was to try and explore if there was any natural density between the Custom Fields that would help us identify clusters.

k-NN

One algorithm we used for the case of a finite number of groups was k-NN (k-Nearest Neighbors), in which we give a number as input, and the algorithm tries to group the data in clusters of a number that is equal to the number we gave. Unfortunately, k-NN didn’t work because there is no absolute number of clusters, resulting in us not finding the correct number to give as input to the algorithm.

DBSCAN

Another algorithm used to examine the density between the Custom Fields was DBSCAN (Density-based spatial clustering of applications with noise). DBSCAN is a density-based clustering algorithm that takes no parameters into account and groups together data points closely packed together and have many neighbors, marking the points whose nearest neighbors are too far away as outlier points. However, DBSCAN didn’t produce any fruitful results as there was no correlation between the Custom Fields, and therefore could not classify the customers, or classified too few of them.

Country/Region as a Factor

In the case of the Region/Country of a customer, we tried to see if actions made in a certain Region or Country would result in better scores or give us a correlation. We checked if customers that performed an action from a certain Region or Country scored higher than others who did so in another region. Also, we tried to add Region and Country as a factor in the aforementioned Custom Fields to enhance the calculation we made there and try and extract more information. However, this attempt, too, was not fruitful.

Combining Models

After running tests for many clients in both the RFM and our Custom models and comparing their results, we observed that the RFM model was better in some classes while, in others, the Custom Model was giving better results. So, we decided to build a Mixed Model that calculates both models’ results but ends up forming the classes depending on the model that scored better for said class.

Statistical Analysis

Another method we tried was to enact a statistical analysis of the clients and their campaigns to find a pattern in one of these metrics. Some of the metrics were the Open, Click-Through, Click-to-Open, and Bounce rates, the number of Unsubscribed customers, the number of Complaints, Segmentation percentage, and the number of Campaigns the client sent. However, these statistics didn’t give us any insights regarding the behavior of the customers or the clients’ performance.

Honorable Mentions

A few honorable mentions are the following: We tried to use the exact amount of campaigns that were sent per customer (but didn’t have the data available for the time period we were interested in), or keep only the customers that had made more than one (1) actions in the 4-month training period (but didn’t produce better results as we reduced the data and lost information). Moreover, we examined if we could find any look-alikes between the customers, meaning building a customer profile based on other customers we already know converted. But to no avail, these attempts didn’t give the results we would have liked them to.

Testing Framework

The need to determine which of the aforementioned models performed better pushed us to test each model with many combinations of parameters and settings. This testing would give us some insights into which parameters to use for these models to be even better and the method that would better serve our needs according to our clients’ data and profiles.

First Attempt

Our first approach was to assign a score to the customers depending on their actions in a certain time interval and then classify them in classes according to their score. Next, we used the same scoring and classification methods as we did in training, trying to see if we would get similar results per class. In a perfect world, the results of the testing period would resemble the results of the training period, and all would be well. But as we mentioned earlier, that wasn’t the case and, along with some problems we encountered, we decided to change our approach entirely.

A New Approach

The new approach didn’t change the way we train our data. We score the customers according to certain actions and classify them depending on their score (or the distribution of their scores after the improvements we applied). However, we needed to change the way we thought about predicting the behavior of a customer. Therefore, we decided to predict not the class in which the customer will be after a certain time interval passes, but whether a customer will interact with the business or product in a certain way, considering said customer belongs in a particular class.

Training Data

At first, we use our Model to classify the customers of each client in Classes, depending on the Engagement Score they achieved during a certain time interval T. The Engagement Score is calculated according to how much the customers interacted with the campaigns they were sent during said time period T. These interactions could be based on whether the customers opened a campaign they were sent, clicked it, or did both. Having this opportunity, we tested our models with all possible types of interactions.

Expected Performance

After we generate the Classes, we calculate the Engagement Probability of each Class, meaning the probability of a customer that belongs in a specific class to make a certain action. That’s the Expected Performance of each Class and is calculated by dividing each customer’s score by the total number of campaigns the customers belonging to each Class received.

Engagement Metric

Before we calculate the Test Performance of our algorithm, which is checking if the customers in each Class engaged with the next campaign they received, we need to choose which interaction counts as engagement. The choices were the same as in the training period: opening a campaign, clicking it, or doing both. Of course, we decided to try all of them to have a clear image of the possible results.

Test Performance

The way we calculate the Test Performance for a Class is by dividing the unique clicks of the customers in the Class by the number of said customers. Because all customers are tested on whether they engaged with the next campaign they received or not, each customer’s score would be 0 or 1. It has no meaning to apply the time decay method here, as we only care if the action itself happened, not when it happened.

Error Rate

By comparing the Expected Performance and the Test Performance, we can see how well our algorithm predicted the customers’ actions. We take the absolute difference between Expected Performance and Test Performance for each Class and then calculate the weighted average of these results, which gives us the Error Rate of the Model. In the end, we take the average of the five Error Rates that were calculated after the five runs we programmed, getting the final performance rate of our Model with the selected settings and parameters.

Expectation Through Probability

The above methodology is based on the assumption that the higher the class a customer belongs to, the more probable it is that this customer will make a specific action the next time they receive a campaign. For a customer to belong in a high class means that they achieved a pretty high score. But to achieve a high score, a customer has to have high activity, especially in the most recent campaigns.

In the marketing world, if a customer has engaged recently with a brand or product, it is quite probable that they will engage again with the brand or product in the near future. Therefore, the higher the class a customer belongs to, the higher the probability of making a certain action. In short, we reasonably predict that a customer will make a certain action based on an expected behavior we have observed previously, monitoring and measuring it with a chosen metric.

The metrics we had at our disposal were opens and clicks of the sent campaigns. Generally, a customer clicks a campaign fewer times than opening one. This means clicks are a more “valuable” action, especially considering that customer clicking is often viewed as conversion for many clients. In some instances, we also used both metrics in our scoring method to see how they would affect the training and the testing performance results.

Lastly, the customers’ interest varies depending on the client and their product. For example, there will be different interest and activity for the campaigns of a News Publisher and different for those of a Swimwear Retailer. This means that some customers will lose interest almost immediately in some customers’ products, while the appeal will be nearly consistent in other cases. So, this method of testing helps us cope with this as well.

Results and Conclusion

After all was said and done, we had to lay everything on the table and choose the model that gave us better results along with the best combination of parameters and settings. Of course, all the testing, trials, and errors we made along the way contributed to us reaching the final form of our implementation.

Results

After enhancing our Custom Model with the improvements mentioned earlier, building the Combination Model, and putting them together with our approach to the RFM model, we tested them all again and again, trying a wide variety of combinations of parameters and settings.

Training Interval

First of all, we had to choose the time interval of our training period. Between two (2) and four (4) months, we decided to go with the latter, as it would give us more information about the customers’ behavior and activity while at the same time it wasn’t slowing us down performance-wise.

To Decay or not to Decay

Also, regarding the time decay method we applied in our models, we had to check whether it was actually giving the expected boost to them. After running the models with and without including the time decay method in the implementation, it was clear as day that the time decay method was an integral part of our models’ performance. As we mentioned earlier, time decay is based on the factor that a customer who has recently engaged with a brand is more likely to engage with the brand again. In our case, a customer with this kind of activity and behavior should be assigned a better score than a customer who hasn’t interacted lately. That way, not only do we make our algorithm perform even better, but we also manage to have a more fair customer score.

Days of Decay

After we decided to include the time decay method in our implementation, a different question arose. We had to choose the time interval that would reduce the score — using the time decay method — after elapsing. This meant we had to test after how many days we would choose to cut the score in half. We decided to test three different time periods: seven (7), ten (10), and fifteen (15) days. After using all three values in the models and testing them with all the different settings and parameters, we observed that we had better results using the fifteen (15) days as the input than we had when using the other two.

Opens vs. Clicks vs. Both

One of the most important things we had to decide upon was the metric we would use when training and scoring the customers. We trained with customers who only opened a campaign, with those who just clicked it, and customers who made either of those actions. At this point, let’s keep in mind that when we talk about opens and clicks, we mean the unique opens and unique clicks a customer made to all the campaigns they received. As far as performance is concerned, when training with customers that either opened or clicked a campaign, we didn’t get the results we would’ve hoped for, given the added value the customers’ scores had.

Opens vs. Clicks, The Final

Regarding the other two metrics, we had to test them in tandem with the metric we would choose for the next action. This means that when we trained with opens, we would try to predict if the customer would open the next campaign, while when we trained with clicks, we would try to predict whether the customer would click the campaign. In the end, training and predicting with clicks proved to be the better performing parameters. However, because the option of training and predicting with opens was pretty close regarding the final results, we decided to implement a solution. This would allow our clients to choose whether they want to classify their customers using clicks or by training the model with unique opens and predicting the next open, thus using the second-best set of metrics in terms of performance.

The Chosen One

The question remains, though: Which model performed better even after optimizing the parameters and settings? That is our Custom Model. It went toe to toe with our RFM approach; however, it produced better results, especially in the High Engagement classes, which made it come on top and show the best results among all the models we tested. What is really important is the aforementioned outstanding performance in the High Engagement classes because this is where the high-quality customers are. After testing over 25 clients, with all combinations of the aforementioned parameters and settings, our Custom Model achieved an Error Rate of under 4%, which means it predicts accurately with over 96% accuracy.

After all the trials and errors, studying and implementing, we managed to build a model that reasonably predicts whether a customer is bound to interact with a product or business soon with high accuracy, and classify that customer according to the probability of that prediction. This gives our clients the ability to segment their customers and act differently depending on this classification. This will be possible with our Custom Model, which scores the customers according to their activity the last 4 months, refactors the scores when a 15-day time interval has passed, and tries to predict whether the customer will click the next campaign they will receive with over 96% probability.

Lead Scoring for Customer Segmentation: What it is and how we did it

What is Lead Scoring

Where and why is it used?

Why Did We Do It

Current State of Affairs in the Lead Scoring World

Types of Lead Scoring

What we Score

How we Score

RFM / RFE in Theory

Recency

Frequency

Monetary

Methodology

RFE

Our Approach to RFM

Recency

Frequency

Monetary (Score)

Classification per Variable

Final Results

Our Custom Model

eCommerce Clients

Non-eCommerce Clients

Problems Encountered

Data Skewness

Class Underpopulation

Class Overlapping

Outliers and other Scoring Issues

Tweaking

Enhancements

Didn’t Make the Cut

Testing Framework

First Attempt

A New Approach

Training Data

Expected Performance

Engagement Metric

Test Performance

Error Rate

Expectation Through Probability

Results and Conclusion

Results

Training Interval

To Decay or not to Decay

Days of Decay

Opens vs. Clicks vs. Both

Opens vs. Clicks, The Final

The Chosen One

Written by Dimitris Apostolopoulos