Lead Scoring using Machine Learning

Babak Abbaschian
12 min readJul 2, 2023

--

If your sales organization has a long list of leads waiting for them to call, you’re lucky! But only if you have a long list of sales reps to call all those leads. If not, I’m sorry you’re pissing off many potential customers!
And a lot of times, if you have the army to call, there are a lot of leads that are not qualified, and after spending some of your valuable rep time on them, they have to move on to the other lead that is now colder than the day before.
If these problems resonate with you, you have to Score your leads!

A sales rep looking into the leads!

Lead scoring is a process in sales and marketing to prioritize leads based on their likelihood of converting into customers. It assigns a numerical score to each lead based on specific criteria and behaviors, helping reps to focus on leads more likely to convert.

The lead scoring process is typically based on three major feature groups:

  1. Lead/Customer Profile: Usually, the customer profile tells a lot about the possibility of a lead getting converted, think about in the case of a B2B, the size of the company, their annual revenue, the industry, also the job title of contact, how close is the lead contact to the final decision maker for deals and contracts. So it is crucial to cluster our customers in different groups/profiles and create a map of our various ideal customer profiles or client personas. After building the map, we will need to systematically measure the association of the new lead to those profiles.
  2. Lead Activity: The other criterion to gauge the quality of a lead is the amount of activity, from the number of fields filled in a lead form to the number of communications, link clicks, and downloads. Simply put, the more activity and engagement from the lead, the higher the level of interest and chance of conversion.
  3. Lead ROI: Another measure in deciding how to prioritize leads is how much revenue they will generate for us. For example, a deal that brings half a million dollars vs. a deal that brings 20 thousand dollars is definitely different, and the time needed to nurture and convert them is different. As much as the other two criteria were about the customers only, this measure ties the customer to our business objectives and sales strategies. In some situations, depending on the ROI time, we may go with the former lead, but other times, we might go after the latter.

Each category mentioned above is a group of different indicators that, when aggregated together, paint the complete picture of a lead quality and its score. Traditionally, the weights and the interaction between all these indicators were built through time, trials and errors, market and customer analysis, and the art of sales. Nowadays, using machine learning, we automate a lot of these processes, but the steps are generally the same. We have to analyze the customer, the product movement, and the market through time and A/B testing a lot of schemas to get to an optimized formula.

In the end, the higher lead score indicates a lead that is more likely to convert, and therefore, sales teams can prioritize their efforts and focus on leads with higher scores. This helps optimize sales productivity, as sales representatives can concentrate on leads with a higher probability of converting, leading to more efficient and effective sales processes.

Lead scoring can be done by explicitly assigning weights to different features, lead attributes, and activities. But generally, it is automated using customer relationship management (CRM) systems or marketing automation platforms. The automations allow for real-time scoring updates based on lead behavior and triggers, ensuring that lead scores remain current and responsive to lead engagement changes.

From an algorithmic and operational standpoint, we can divide the lead scoring techniques into the following categories:

  1. Lead Attribute Scoring: This technique involves assigning scores based on the leads’ explicit information. It includes factors such as job title, company size, industry, and specific actions taken by the lead, such as filling out a contact form, requesting a demo, or attending a webinar.
  2. Lead Profile Scoring: Demographic scoring focuses on lead characteristics such as job title, company size, industry, location, and other demographic information. Certain demographic factors may be more relevant and influential in determining the likelihood of a lead becoming a customer. In the case of a lead being a business, we must also consider factors related to the lead’s organization or company, such as industry, annual revenue, number of employees, geographic location, and technology stack. These factors help evaluate a lead’s fit and potential value based on the target market and ideal customer profile.
  3. Lead Activity Scoring: We assign higher scores to the leads with higher activity in this technique. This method considers factors such as website visits, page views, email opens, click-through rates, content downloads, social media engagement, and other digital interactions that indicate the lead’s level of interest and engagement.
  4. Rep Assessment Scoring: This is a qualitative measure based on the rep’s experience, usually formulated with a binary risk scoring schema, and involves subjective assessments and evaluations of leads based on factors such as lead source quality, sales rep’s feedback, industry knowledge, or specific qualitative criteria set by the organization. It incorporates human judgment and expertise to complement other scoring techniques.

Often a combination of these methods is used to create the final picture of lead qualifications. And it used to need financial analysts and a lot of trial and error to build the weights and the formulas for the rules/points-based and scorecard models to calculate the scores. Once the formula was built, companies could integrate it with their CRM systems and calculate the lead scores on the fly. However, this system still required a team of financial analysts to review the results constantly and update their programs. This was a costly operation that a few industry segments could justify using, e.g., the lending industry.

One of the early examples of automation in lead scoring and the use of machine learning also started in the same financial industry by GE Capital. However, the major CRM providers as the primary lead management systems also started to work. For example, Salesforce started to market its regression-based lead scoring system as AI-based lead scoring. And quickly, everyone else started to look into machine learning as the promised cheapest solution for lead scoring, as it removes the manual human calculation aspects and replaces them with repetitive computer programs. Machine learning can be used to score leads in a more automated and data-driven manner.

Two of the beauties of using Machine Learning in Lead Scoring is the fact that Machine learning can find hidden complex seasonality factors in the data. For example, if customer engagement has two nonidentical harmonic factors that generate a complex response of ups and downs in sales, it can be detected in many modern classifiers.

The other beauty of using Machine Learning is its capability of capturing nonlinear correlation between decision factors that, in manual methods, except with complex statistical operations, won’t be discovered.

Hidden seasonality due to combination of two features with different periods.

To apply machine learning to your lead scoring problem, you have to take the following steps:

  1. Data Analysis and Feature Engineering: Machine learning algorithms can analyze large volumes of data to identify patterns and relationships between lead attributes and conversion outcomes. By analyzing historical lead data and customer conversion data using machine learning algorithms, you can identify the correlation between your features and the desired outcome of a lead getting converted. This helps minimize the number of features you want to feed your model and sometimes uncovers important features for seasonality.
  2. Building and training the model: Once the relevant features are identified, we can select an algorithm such as logistic regression, decision trees, random forests, gradient boosting, SVMs, or even neural networks to build the model. Then we can train and test the model using historical data.
  3. Predicting is scoring: Using our trained model, we can predict the likelihood of a lead getting converted. That likelihood can be used directly as the score or applied to other transformations based on our scoring schema to generate the final score.
  4. Continuous Learning: This is an important step. The performance of the model you have trained on historical data can quickly deteriorate as your historical data becomes prehistorical. Our customers, market conditions, and all the features building the model are constantly changing. We must constantly monitor our models and ensure they are getting retrained once in a while as a fresh batch of lead turnarounds becomes available.

Let’s review some published research on real-world examples of Machine Learning in Lead Scoring.

Organizations can automate the process by leveraging machine learning for lead scoring, improving accuracy, and scaling their lead management efforts. It enables more efficient allocation of sales and marketing resources, identifies high-value leads, and increases the overall effectiveness of lead nurturing and conversion strategies.

As I mentioned before, GE was one of the pioneers in using machine learning for lead management and lead generation. And the “if-then” nature of decision trees directly translates to early binary features that used to be used for lead scoring, e.g., if the customer has emailed more than one time, the score goes up by x. So it could be natural to see the decision trees as early drivers of lead scoring.

In their 2013 paper, Aggour and Hoogs propose a successful example of using a decision tree-based model in lead scoring. They describe the implementation of a system called Lead Triggers, which automates the collection and analysis of GE company information to identify actionable sales leads for sales representatives using a two-class decision tree to detect the combinations of financial metrics and values over time presented in positive cased and not shown in negative cases. The system has three core components: information fusion, knowledge discovery, and information visualization. It extracts data from various sources, fuses it into meaningful information, and mines that information for sales leads based on expert-defined and statistically derived triggers. The system has a web-based interface provides sales reps access to company information and leads in one location. Using Lead Triggers has significantly improved sales reps’ performance, increasing their productivity by 30–50%. In 2010 alone, Lead Triggers provided leads on opportunities worth over $44 billion in new deal commitments for GE Capital Americas. The system has transformed how sales reps gather intelligence, improved their productivity, and brought consistency and effectiveness across the sales force.

One of the most used methods in lead scoring is the random forest. As decision trees can capture nonlinear relationships, the forests made of many decision trees can take them to the next level. In a research published in 2020, Başarslan and Argun propose a customer acquisition schema in the banking industry based on lead scoring. The authors compare models using several classification algorithms to estimate potential bank customers based on a dataset obtained through telemarketing. Various classification algorithms such as Decision Tree, Naive Bayes, K-nearest neighbors, Logistic Regression, Random Forest, and Adaptive Boosting are used to create these models. The dataset is divided into training and test sets using K-fold Cross Validation and Holdout methods to ensure consistent model performance. Evaluation metrics such as Accuracy, Precision, Sensitivity, and F-measure are used to assess model performance. The results indicate that the Random Forest algorithm performs best in Accuracy and F-measure, Naive Bayes excels in Precision, and the AdaBoostM1 algorithm performs well in Sensitivity.

Nowadays, such an extensive analysis and model comparison can be made quite easily using AutoML frameworks focusing more on business needs rather than model selection.

Another family of approaches that seem natural in calculating a “Score” are regression-based methods. And here we’re talking categorically, so, Logistic regression will be our next method.

D’Haen and Van den Poel proposed a lead scoring method called “B2B prospect prediction” to assist sales representatives in acquiring customers in a B2B environment. The proposed framework consists of three phases. In Phase 1, predictions are generated using the nearest neighbor method based on information from the current customers. Phase 2 incorporates information on companies that did not become customers, utilizing logistic regression, decision trees, and neural networks to optimize the model. In Phase 3, the results from Phases 1 and 2 are combined into a weighted list of prospects. The study’s conclusions highlight the algorithm’s ability to streamline the customer acquisition process in a B2B environment. The output is a ranked list of prospects, enabling sales representatives to focus on higher-quality leads more likely to convert into customers. The authors suggest that the tool could be used to negotiate with data vendors to selectively purchase prospects. Furthermore, the iterative view of the customer acquisition process presented in the study highlights the need for documentation and analysis to continuously improve customer acquisition strategies.

Talking about Machine Learning classification and no mention of SVMs?

In 2019, Eitle and Buxmann published a review comparing lead scoring implementations using Support Vector Machine, CatBoost, Random Forest, XGBoost, and Decision Trees. This research paper presents a model designed to assist sales representatives in the software industry in managing their complex sales pipelines. By incorporating machine learning and business analytics into lead and opportunity management, the model reduces arbitrariness and provides data-driven qualification support. The study develops three models trained and tested using real-life data extracted from a company’s CRM. According to the results of this work, CatBoost and Random Forest algorithms outperform other supervised classifiers like SVM, XGBoost, and Decision Tree. The study also reveals the challenges in predicting sales outcomes at the early lead stage compared to separately analyzing the lead and opportunity phases. Additionally, the paper introduces an explanation functionality to explain the decision-making process for individual predictions.

After mentioning a group of supervised learning, we can also mention using unsupervised learning for lead scoring. This brings us to a survey published earlier in February 2023 by Wu et al. This paper discusses the importance of lead scoring in lead management and mentions the lack of a comprehensive literature review and classification framework dedicated to this topic.

The paper categorizes lead scoring models into traditional and predictive approaches, with the latter utilizing data mining and machine learning techniques. The study conducts a systematic literature review and analyzes 44 relevant studies to identify metrics for measuring the impact of lead-scoring models on sales performance. Additionally, the study identifies several challenges in implementing lead scoring models, including the time-consuming nature of manual scoring, the limited effectiveness of simplistic models, resistance to change, and reliance on low-quality or outdated data.

To overcome these challenges, the study recommends leveraging AI-based lead scoring models, employing more sophisticated models that consider both seller and buyer perspectives, creating dynamic/flexible scoring models, and utilizing industry-specific and up-to-date data sources. The paper concludes by emphasizing the importance of lead scoring in improving inside sales performance and the need for more advanced and adaptive models. It highlights the growing interest in predictive lead-scoring models driven by advancements in computational capabilities, the availability of large sales datasets, and the adoption of remote selling.

Summary of predictive Lead scoring models presented in Table 5:

Table 5 shows that in most studies, multiple models have been applied to the problem.

This last paper is an excellent example of the depth of machine learning algorithms used in Lead scoring. Depending on the size of your testing data and the number of features you select, you can almost use any classification algorithm, even deep learning, if you have enough testing data.

In conclusion, lead scoring is a critical process in sales and marketing that prioritizes leads based on their likelihood of converting into customers. Traditional methods of lead scoring involved manual calculations and trial-and-error approaches, but automation has become more prevalent with the advent of machine learning. Machine learning algorithms analyze data, identify patterns, and build models to predict lead conversion likelihood. Machine learning provides opportunities to uncover hidden patterns and nonlinear correlations, making lead scoring more effective and data-driven. Techniques such as decision trees, random forests, logistic regression, boosting, and support vector machines are frequently used for lead scoring. Continuous learning and monitoring of models are essential to maintain accuracy and adapt to changing customer and market dynamics.

As we discussed, by leveraging machine learning for lead scoring, organizations can automate processes, improve accuracy, allocate resources efficiently, and enhance lead nurturing and conversion strategies. Adopting AI-based lead scoring models, utilizing sophisticated and adaptive models, and integrating industry-specific and up-to-date data sources are recommended for future advancements in lead scoring.

Thank you for taking the time and reading this article. I’d be more than happy to hear from you.

--

--

Babak Abbaschian
Babak Abbaschian

Written by Babak Abbaschian

Leader, technologist, and data scientist with 15+ years experience in AI/ML, and data. Known for strategic leadership, innovative solutions, and research.