Enhancing Efficiency through Lead Prioritisation in Online Moving Services

Imagine your CEO walks into the office, visibly concerned. He shares a real headache — not knowing which untouched leads in the pipeline could turn into profitable ventures.

Varun Tyagi

Published in

Operations Research Bit

12 min readFeb 16, 2024

Introduction

It’s a challenge keeping him up at night, especially considering the leads are costly, and every untouched lead represents a drain on the company’s resources. Now, as the data leader, you don’t shy away from a problem; you see it as an opportunity. So, you decide to dig into the data, carefully sizing up each lead. Your approach? Craft a scoring system that reflects their potential for conversion. It’s a practical move, highlighting the essence of being a data leader — addressing problems and steering towards success, all while ensuring cost-effectiveness and efficient resource utilisation.

In the ever-evolving landscape of online services, efficiency is paramount for success. For an online moving company, the ability to prioritise leads effectively can make a substantial impact on business outcomes. In this blog, we delve into the intricacies of lead prioritisation, exploring a Python code that utilises synthetic data and machine learning techniques to optimise the identification of high-conversion leads. Let’s break down the code step by step to understand how each section contributes to this vital process.

The Significance of Lead Prioritisation

When a moving company is inundated with leads from online inquiries, phone calls, purchasing leads through different platforms, self-generated leads, referrals, social media, and own website, identifying the most promising opportunities becomes paramount. Lead prioritisation enables businesses to focus their resources on leads with the highest likelihood of conversion. This not only saves time and effort but also maximises revenue by targeting the most valuable prospects.

Importing Libraries

In any data-driven project, the first step is to import the necessary libraries. This section lays the foundation for the subsequent code, bringing in powerful tools for data manipulation, machine learning, and visualisation. The numpy and pandas libraries empower efficient data handling, while scikit-learn provides machine learning capabilities. Faker is used to generate synthetic data, and visualisation is enhanced through matplotlib and seaborn. Techniques to handle class imbalance are facilitated by imblearn.

import numpy as np
import pandas as pd
import random
from faker import Faker
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.utils.multiclass import unique_labels
import matplotlib.pyplot as plt
import seaborn as sns
from imblearn.over_sampling import RandomOverSampler
import string

Setting Seed for Reproducibility

Ensuring reproducibility is crucial in data science experiments. Setting seeds for random number generation fosters consistency across different runs of the code. These lines establish a baseline for reproducibility, aiding in the validation and comparison of results.

np.random.seed(0)
random.seed(0)

Function to Generate Lead ID

As every company has a specific lead id, generating unique identifiers for leads is fundamental for tracking and analysis. The generate_lead_id() function accomplishes this by creating a 12-character alphanumeric lead ID. The function leverages the string and random modules to create diverse and unique lead IDs.

def generate_lead_id():
    characters = string.ascii_letters + string.digits
    return ''.join(random.choice(characters) for _ in range(12))

Generating Fake Lead Data

As I cannot share the actual company data, we will leverage faker library to generate a synthetic dataset. As creating a synthetic dataset is a crucial step for training and evaluating machine learning models, let us create one for training the model. The code generates 1000 fake leads, each comprising various attributes. Each lead is represented by a list, making the dataset versatile and comprehensive.

fake = Faker()
num_samples = 1000

lead_data = []
for _ in range(num_samples):
    lead_data.append([
        generate_lead_id(),  # 12-digit alphanumeric lead id
        fake.date_between(start_date='-30d', end_date='today'),  # Date of the lead
        fake.date_between(start_date='today', end_date='+30d'),  # Date of moving
        fake.random_int(min=400, max=2000),  # Surface area of the apartment (in square meters)
        fake.random_int(min=5, max=100),  # Volume of the furniture (in cubic meters)
        fake.random_element(elements=('North', 'South', 'East', 'West')),  # Region of the lead
        fake.random_element(elements=('Germany', 'France', 'Sweden')),  # Country
        fake.random_element(elements=('Social Media', 'Referrals', 'Website', 'Phone', 
                                      'Immoscout24', 'Immowelt', 'Demenagement', 'Ebay', 
                                      'Wunderflats', 'Wg-gesucht', 'HousingAnywhere')),  # Source of the lead
        fake.random_element(elements=('Low', 'Medium', 'High')),  # Urgency of the move
        fake.random_element(elements=(0, 1))  # Binary conversion rate (0 or 1)
    ])

Creating a DataFrame from Lead Data

To manipulate and analyse the generated data, it is structured into a Pandas DataFrame with appropriately named columns. The DataFrame provides a tabular representation of the synthetic leads, facilitating exploratory data analysis

lead_df = pd.DataFrame(lead_data, columns=[
    'Lead ID', 'Date of the lead', 'Date of moving',
    'Surface area of the apartment', 'Volume of the furniture',
    'Region of the lead', 'Country', 'Source of the lead', 'Urgency of the move', 'Conversion Rate'
])

Data Preprocessing: Unveiling Patterns for Optimal Analysis

In data science, ensuring that data is in the right format and structure is paramount for meaningful analysis. The code initiates this crucial step by converting date columns to datetime objects, extracting relevant information, and pruning unnecessary features. By converting dates to datetime objects, the code enables the model to interpret temporal patterns more effectively. The creation of new features, such as Day of the weekand Days until movingfurther enhances the dataset’s richness. Dropping Lead IDand the original date features reduces redundancy and streamlines the dataset for analysis.

# Convert date columns to datetime objects
lead_df['Date of the lead'] = pd.to_datetime(lead_df['Date of the lead'])
lead_df['Date of moving'] = pd.to_datetime(lead_df['Date of moving'])

# Extract relevant information from the date features
lead_df['Day of the week'] = lead_df['Date of the lead'].dt.dayofweek
lead_df['Days until moving'] = (lead_df['Date of moving'] - lead_df['Date of the lead']).dt.days

# Drop the original date features and 'Lead ID'
lead_df = lead_df.drop(['Date of the lead', 'Date of moving', 'Lead ID'], axis=1)

One-Hot Encoding Categorical Features: Bridging the Gap for Model Compatibility

Machine learning models often require numerical input, necessitating the transformation of categorical variables into a suitable format. The code achieves this through one-hot encoding, converting categorical features into binary vectors. This transformation ensures that the model can effectively interpret and leverage categorical data, paving the way for accurate predictions.

lead_df_encoded = pd.get_dummies(lead_df, columns=['Region of the lead', 'Country', 'Source of the lead', 'Urgency of the move'])

Splitting the Dataset into Training and Testing Sets: Nurturing Model Generalisation

Dividing the dataset into training and testing sets is a fundamental practice in machine learning. The code accomplishes this using the train_test_split function from scikit-learn. The separation into training and testing subsets enables the model to learn patterns from one subset and evaluate its performance on an independent dataset, gauging its ability to generalize to new, unseen data.

X = lead_df_encoded.drop('Conversion Rate', axis=1).values
y = lead_df_encoded['Conversion Rate'].astype(int).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Handling Class Imbalance: Striking Balance for Model Robustness

One of the challenges in machine learning is dealing with imbalanced datasets, where one class significantly outnumbers the other. In the context of lead prioritisation, this could mean having a disproportionately low number of converted leads. The code addresses this issue by employing the RandomOverSampler from the imblearn library. The oversampling technique generates synthetic samples for the minority class, ensuring a more balanced representation in the training data. This approach contributes to a more robust model that can effectively learn patterns from both classes.

oversampler = RandomOverSampler(sampling_strategy='minority', random_state=42)
X_train_resampled, y_train_resampled = oversampler.fit_resample(X_train, y_train)

Standardising Features: Aligning Variables for Optimal Model Training

Standardisation is a crucial preprocessing step that ensures all features are on a similar scale, preventing certain variables from dominating others. The code employs the StandardScaler to standardise the features. This process involves removing the mean and scaling to unit variance, creating a standardised dataset that enhances the stability and convergence of the machine learning model.

scaler = StandardScaler()
X_train_resampled_scaled = scaler.fit_transform(X_train_resampled)
X_test_scaled = scaler.transform(X_test)

Building and Training a Neural Network Model: Unleashing the Power of Machine Learning

Now that the data is preprocessed, it’s time to construct and train a machine learning model. The code utilises the MLPClassifier from scikit-learn to build a neural network. This neural network has two hidden layers with eight and four nodes, respectively. The max_iter parameter defines the maximum number of iterations for optimisation, and warm_start enables incremental training.

model = MLPClassifier(hidden_layer_sizes=(8, 4), max_iter=1000, random_state=42, warm_start=True)
model.fit(X_train_resampled_scaled, y_train_resampled)

Making Predictions and Evaluating Model Accuracy: Gauging Model Performance

The trained model is put to the test by predicting conversions on the test set, and its accuracy is evaluated. The accuracy_score function compares the predicted labels with the actual labels in the test set, providing a percentage that represents the model's accuracy. This metric gives insights into the model's overall performance.

predictions = model.predict(X_test_scaled)
accuracy = accuracy_score(y_test, predictions)
print(f"Model Accuracy: {accuracy * 100:.2f}%")

Evaluating Model Performance by Country: A Granular Perspective

Understanding how well a model performs across different countries is crucial for tailoring marketing and lead prioritisation strategies. The code iterates through each unique country in the dataset, assessing the model’s accuracy and presenting a confusion matrix for deeper insights. This section offers a nuanced view of the model’s performance, allowing stakeholders to tailor their approach based on the unique characteristics of each country. Note that the overall accuracy of the model might differ from the individual accuracies at the country level. The reasons are as follows:

1. Class Imbalance

As mentioned above, there may be an imbalance in the distribution of the target variable (Conversion Rate), where one class (converted or not converted) significantly outnumbers the other. This class imbalance can impact the overall model accuracy.

2. Varying Class Distributions by Country

When evaluating accuracy by individual countries, the class distribution for the target variable may differ from the overall dataset. In some countries, there might be a more balanced distribution or a higher prevalence of one class over the other.

3. Model Sensitivity to Class Imbalance

Machine learning models, including neural networks, can be sensitive to imbalanced datasets. If the model is trained on a dataset where one class is much more frequent than the other, it may develop a bias toward the majority class. This bias can affect its ability to accurately predict the minority class.

4. Impact of Resampling Techniques

The use of resampling techniques, such as oversampling the minority class, may improve the model’s performance on the minority class in specific countries. However, these techniques might not necessarily lead to a uniform improvement across all countries. Therefore, even after we used oversampler class to balance the dataset, it had only some amount of effect on the accuracy.

5. Regional Disparities

Differences in lead characteristics, user behaviors, or market conditions across countries may lead to variations in the predictive performance of the model. The model might perform better in countries where the data distribution aligns more closely with the training data.

6. Generalisation Challenges

The model may face challenges in generalizing well to diverse patterns exhibited by different countries. If the model is overfitting to certain patterns present in the training data but not representative of the overall dataset, it may lead to a discrepancy in accuracy.

7. Evaluation Metric Choice

Accuracy, while a common metric, might not be the most suitable in imbalanced datasets. Other metrics such as precision, recall, or F1 score might provide a more nuanced understanding of the model’s performance, especially when dealing with class imbalances.

8. Model Complexity

The chosen neural network architecture and hyperparameters can influence how well the model adapts to different patterns in the data. The complexity of the model may impact its ability to generalize across diverse country-specific patterns.

Basically, the discrepancy in overall model accuracy and individual accuracies by countries highlights the importance of understanding the nuances of the dataset, the impact of class imbalances, and the need for thorough evaluation metrics.

countries = lead_df['Country'].unique()
for country in countries:
    # Extract data for the specific country
    country_mask = lead_df['Country'] == country
    X_country = lead_df_encoded[country_mask].drop('Conversion Rate', axis=1).values
    y_country = lead_df_encoded[country_mask]['Conversion Rate'].astype(int).values
    
    # Scale features for the specific country
    X_country_scaled = scaler.transform(X_country)
    predictions_country = model.predict(X_country_scaled)
    
    # Print accuracy for the specific country
    accuracy_country = accuracy_score(y_country, predictions_country)
    print(f"Accuracy for {country}: {accuracy_country * 100:.2f}%")
    
    # Display confusion matrix for the specific country
    cm = confusion_matrix(y_country, predictions_country)
    labels = unique_labels(y_country, predictions_country)
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=labels, yticklabels=labels)
    plt.title(f'Confusion Matrix for {country}')
    plt.xlabel('Predicted Label')
    plt.ylabel('True Label')
    plt.show()

Generating Another Set of Fake Lead Data for Prediction: Simulating Real-world Scenarios

To test the model’s predictive capabilities on new, unseen data, the code generates another set of synthetic leads. This simulates real-world scenarios where the model encounters fresh leads for which it hasn’t been trained. These synthetic leads mirror the structure of the original dataset, ensuring consistency in the types of information the model will encounter.

fake_lead_data = []
for _ in range(100):
    fake_lead_data.append([
        generate_lead_id(),  # 12-digit alphanumeric lead id
        fake.date_between(start_date='-30d', end_date='today'),  # Date of the lead
        fake.date_between(start_date='today', end_date='+30d'),  # Date of moving
        fake.random_int(min=400, max=2000),  # Surface area of the apartment (in square meters)
        fake.random_int(min=5, max=100),  # Volume of the furniture (in cubic meters)
        fake.random_element(elements=('North', 'South', 'East', 'West')),  # Region of the lead
        fake.random_element(elements=('Germany', 'France', 'Sweden')),  # Country
        fake.random_element(elements=('Social Media', 'Referrals', 'Website', 'Phone', 
                                      'Immoscout24', 'Immowelt', 'Demenagement', 'Ebay', 
                                      'Wunderflats', 'Wg-gesucht', 'HousingAnywhere')),  # Source of the lead
        fake.random_element(elements=('Low', 'Medium', 'High')),  # Urgency of the move
    ])

Preprocessing New Fake Lead Data: Maintaining Data Consistency

The newly generated fake lead data undergoes the same preprocessing steps as the original dataset. This ensures that the model is exposed to a consistent format and structure when making predictions. These preprocessing steps ensure that the model is fed with data in a consistent and standardised manner, allowing it to make accurate predictions.

# Convert fake_lead_data to DataFrame
fake_lead_df = pd.DataFrame(fake_lead_data, columns=[
    'Lead ID', 'Date of the lead', 'Date of moving',
    'Surface area of the apartment', 'Volume of the furniture',
    'Region of the lead', 'Country', 'Source of the lead', 'Urgency of the move'
])

# Convert date columns to datetime objects
fake_lead_df['Date of the lead'] = pd.to_datetime(fake_lead_df['Date of the lead'])
fake_lead_df['Date of moving'] = pd.to_datetime(fake_lead_df['Date of moving'])

# Extract relevant information from the date features
fake_lead_df['Day of the week'] = fake_lead_df['Date of the lead'].dt.dayofweek
fake_lead_df['Days until moving'] = (fake_lead_df['Date of moving'] - fake_lead_df['Date of the lead']).dt.days

# Drop the original date features and 'Lead ID'
fake_lead_df = fake_lead_df.drop(['Date of moving', 'Lead ID'], axis=1)

# Convert categorical features to one-hot encoding
fake_lead_df_encoded = pd.get_dummies(fake_lead_df, columns=['Region of the lead', 'Country', 'Source of the lead', 'Urgency of the move'])

# Drop the 'Date of the lead' column before scaling
fake_lead_df_encoded = fake_lead_df_encoded.drop(['Date of the lead'], axis=1)

# Standardise features by removing the mean and scaling to unit variance
fake_lead_X_scaled = scaler.transform(fake_lead_df_encoded.values)

Making Predictions on the New Set of Fake Leads: Unveiling Predictive Power

With the model now trained and the new set of synthetic leads prepared, it’s time to assess the model’s predictive prowess on unseen data. The code utilises the trained model to predict the probability of conversion for each lead in the new set. This step generates a probability score for each lead, representing the likelihood of conversion. These scores offer valuable insights for prioritising and focusing efforts on leads with a higher likelihood of conversion.

fake_lead_probabilities = model.predict_proba(fake_lead_X_scaled)[:, 1]

Displaying Results: Informed Decision-Making

The final section of the code is dedicated to presenting and organising the results in a coherent manner. The code creates a DataFrame, fake_lead_results, containing lead ID, country, lead creation date, and the calculated probability of conversion for each synthetic lead. This DataFrame is then sorted by country and probability of conversion, providing a clear and organized view of the synthetic leads with the highest potential for conversion. This information empowers the online moving company to make informed decisions on resource allocation and prioritise leads that are more likely to convert.

fake_lead_results = pd.DataFrame({
    'Lead ID': [lead[0] for lead in fake_lead_data],
    'Country': fake_lead_df['Country'],
    'Lead Creation Date': fake_lead_df['Date of the lead'],
    'Probability of Conversion': fake_lead_probabilities
})

# Sort the results DataFrame by "Probability of Conversion" in descending order for each country
fake_lead_results_sorted = fake_lead_results.sort_values(by=['Country', 'Probability of Conversion'], ascending=[True, False])

print(fake_lead_results_sorted)

Conclusion: Elevating Lead Prioritization with Data Science

In this exploration of lead prioritisation for online moving services, we’ve journeyed through the various steps involved in building and evaluating a machine learning model. From generating synthetic data to training the model and predicting outcomes, each section plays a crucial role in optimising the prioritisation of leads.

The ability to predict conversion probabilities empowers businesses to allocate resources efficiently, focus on high-potential leads, and ultimately enhance overall operational efficiency. As the online moving industry continues to evolve, integrating data science methodologies can be a game-changer for staying ahead in a competitive landscape.

By embracing these techniques, businesses can not only streamline their lead prioritisation processes but also gain deeper insights into customer behavior, regional trends, and factors influencing conversion. As technology continues to advance, the fusion of data science and business strategy becomes increasingly vital for organisations seeking a competitive edge in the digital era.

Lead prioritisation isn’t just a data science task; it’s a strategic imperative for businesses looking to optimise their operations and drive sustainable growth. Through the lens of this Python code, we’ve glimpsed the potential of leveraging synthetic data and machine learning to revolutionise the way online moving services identify and prioritise their leads. As industries continue to evolve, the marriage of data-driven insights and strategic decision-making will undoubtedly be the driving force behind success in the digital landscape.

Code

Lead Prioritisation