Pharmaceutical Sales Rep call optimization using Graph Theory, Kmeans Clustering and Prediction of Prescription Conversion using XGBoost

Published in

Shikhar’s Data Science Projects

15 min readDec 16, 2022

Business Context and Principles Involved

After years of considerations, research, and development, once a product line has been developed and given the go-ahead from respective government institutions to be brough into market, every company is banking on their sales representatives to make the product a success by generating sufficient revenues and profits and a pharmaceutical company is no different. When the FDA (Food and Drug Administration) finally gives its approval to sell a drug, the pharmaceutical company is on the lookout for an aggressive strategy that would maximize its revenues and profits. For this, they require an optimized GTM strategy that their sales team can work on and bring in maximum profits.

A sales representative of a pharmaceutical company usually receives a list of targets each day to give a sales call. Since there are less sales representatives and a very high number of physicians (i.e., customers), a sales representative usually has multiple options to decide on which customers to pick. However, optimized selection of the customers/physicians is not the only problem or challenge the sales representatives face, there are other set of essential ingredients involved in the sales cycle which require an equal focus:

Customer Segmentation: A sales rep must be provided with a sophisticated segmentation of customers, this is essential to understand things such as the level of effort or persuasion required for the customer, the types of products a particular segment is more likely to buy, etc.
Customer Targeting: Often, a company would want to visit or put effort on customers that are more likely to buy their products. For this, a targeting exercise is conducted to shortlist those customers who have a higher probability of buying the product. The likelihood measure can depend on various factors such as historic sales, average customers, etc.
Product Detailing: A sales representative would also require an understanding of how much time he/she may allocate to different product categories in order to optimize revenue and profit. For example, he/she is visiting a customer in segment A, he/she would want to focus more of the sales pitch on products that a customer belonging to segment A is more likely to buy.
Determining the Optimal Path: To optimize variables such as time and cost of movement (a pharmaceutical sales rep usually travels with a company provided car, and we all know how expensive gas is now! ☹), and to maximize revenue and profit, the sales rep must be provided with an optimal path of covering the customer list. This not only includes which customers to cover, but also in what order they should be covered.

Overview of the Project Objectives

As both the team members of this group (Nikhil and Shikhar) have worked in Analytics and Data Science for Pharmaceutical Companies. We would be utilizing our market knowledge and concepts learned in the CFA class and try to define an optimized sales strategy for pharmaceutical companies. In this project, we will be taking in some assumptions and benchmarks from our previous domain knowledge of pharmaceutical marketing.

The list of algorithms and concepts we would be leveraging to solve the problem are mentioned below:

Customer Segmentation: The K-Means machine learning algorithm would be leveraged to segment various physicians

Customer Targeting: We would be leveraging the XGBoost Classification machine learning algorithm to predict the probability of success of a sales call based on some supervised data

Product Detailing: We would be using Linear Programming to identify how a sales representatives should allocate their time during a sales call to detail different products that the company has

Determining the Optimal Path

a) Using Graph Theory to identify the optimal path, Bellman Ford algorithm

b) Utilizing Has Path algorithm to check whether a path exists in between two nodes in a networkx graph

c) Using XGBoost model to improve the optimal path (removing those nodes/customers which are likely to not yield any sales and improving the path accordingly

Introduction to the DataSet

We will be using the data from NPI Registry i.e. a government maintained portal to store the primary addresses of all the physicians. In this project, we have used the data for physicians in the city of Chicago. The list of physicians and their addresses can be updated based on the client’s requirement later.

Given below is a snippet of the website and the kind of data that would be generated:

Note: Rest of the data related to their sales, Number of patients visits, Rx, etc. have been picked up from an anonymous source. Such datasets are usually available only on payment. Pharmaceutical companies buy these datasets from data vendors such as IQVIA, Flatiron, Komodo Health, Symphony, etc. which is very easily available. For our project we will be using a dummy data to illustrate how this model can work in future when adapted with the real data.

Introduction to the DataSet

List of columns and datatypes

import pandas as pd
import numpy as np
import time
import requests
import urllib.parse

pd.set_option('display.max_columns', None)

npi_info= pd.read_csv('npi_info.csv')

npi_info.head()

Data dictionary

NPI ID: this is the unique identifier or ID of each physician (also called an HCP)

Name: this is the name of the HCP/physician

Address_Try_1, Address_Try_2, Address_Try_3: as we would be utilizing web scraping to derive the corresponding latitude and longitude to an address we have broken down the address into 3 columns to simplify the process (if we can’t scrape the lat and long for try 1, try 2 is a more simplified version of the same address, and try 3 is the most simplified)

Phone: this is the contact number of the HCP

Primary Taxonomy: this represents the speciality of the particular physician such as internal medicine, dentist, pathology, etc.

Average Patients per day: the average number of patients visiting a particular physician each day

Product1_Rx, Product2_Rx, Product3_Rx: these columns represent the number of prescriptions a physician generates for the different products pertaining to the pharmaceutical company

KOL: this is a flag that represents if a particular physician is a key opinion leader (1) or not (0); a KOL holds a lot of influence in his/her particular taxonomy

Physician’s Age: represents the age of the physician and helps determine the years of experience

Organization Control: a number representing the level of control a physician has on the organization; higher the number, higher the control

# of visits per year: a number representing the number of visits a pharmaceutical rep visits the physician each year

Target: a flag representing if a physician would sell our product (1) or not (0)Project Execution

Project Execution

(1) Web scraping the latitudes and longitudes

NPI Registry only provides us with the address in text form. We currently do not have the latitudes and longitudes, which we will require to find the optimum path and for segmentation in the latter part of the project. Therefore, we web scrape the latitudes and longitudes from an open-source website.

We use the following line of codes to execute the scraping:

#Writing a function to get latitudes and longitudes
def get_lat(address):
    url = 'https://nominatim.openstreetmap.org/search/' + urllib.parse.quote(address) +'?format=json'
    response = requests.get(url).json()
    if len(response)==0:
        return "NA"
    else:
        return (response[0]["lat"])

def get_long(address):
    url = 'https://nominatim.openstreetmap.org/search/' + urllib.parse.quote(address) +'?format=json'
    response = requests.get(url).json()
    if len(response)==0:
        return "NA"
    else:
        return (response[0]["lon"])

npi_info['Lat']=np.nan
for i in range(0,npi_info.shape[0]):
    
    lat_temp = get_lat((npi_info['Address_Try_1'].iloc[i]))

    if lat_temp=="NA":
        lat_temp=get_lat((npi_info['Address_Try_2'].iloc[i]))
        
    if lat_temp=="NA": 
        lat_temp=get_lat((npi_info['Address_Try_3'].iloc[i]))
    
    if lat_temp=="NA": 
        lat_temp="NA"
    
    npi_info['Lat'].iloc[i]=lat_temp

#loop for long
npi_info['Long']=np.nan
for i in range(0,npi_info.shape[0]):
    long_temp=get_long((npi_info['Address_Try_1'].iloc[i]))
#     npi_info['Lat'].iloc[i]
    if long_temp=="NA": 
        long_temp=get_long((npi_info['Address_Try_2'].iloc[i]))
        
    if long_temp=="NA": 
        long_temp=get_long((npi_info['Address_Try_3'].iloc[i]))
    
    if long_temp=="NA": 
        long_temp="NA"
    
    npi_info['Long'].iloc[i]=long_temp

npi_info=npi_info[npi_info["Lat"]!="NA"] #removing those customers for which we could not find the latitudes and longitudes
#in real life, the user will need to manually find the latitude and longitudes of those customers where the scraper could not find
npi_info[["Lat","Long"]]=npi_info[["Lat","Long"]].apply(np.longdouble)
for i in range(0,npi_info.shape[0]):
    npi_info["NPI_dist_"+str(i)]=np.nan

(2) Creating a distance matrix to compute the distance between each customer

For future graph theory application, we would need a distance matrix. Each address / customer will serve as a node in the graph diagram, and we would need the distance for the edges of the graph.

We leverage the ‘GEOPY’ library and import ‘Distance’ to compute the distance between two sets of latitudes and longitudes.

from geopy import distance
def dist(Olat,Olon, Dlat,Dlon):
    start = (Olat, Olon)
    end = (Dlat, Dlon)
    return(distance.great_circle(start, end).miles)
dist(lat)

for x in range(0,npi_info.shape[0]):
    for i in range(0,npi_info.shape[0]):
        slat=(npi_info["Lat"].iloc[x])
        slong=(npi_info["Long"].iloc[x])
        elat=(npi_info["Lat"].iloc[i])
        elong=(npi_info["Long"].iloc[i])
        npi_info["NPI_dist_"+str(x)].iloc[i]= dist(slat,slong,elat,elong) #*factor

(3) Predicting the probability of success for a visit using XGBoost Classification

We can conduct market research assessment on a set of 100–200 physicians to understand the effectiveness of the sales call. Based on this effectiveness a target value of 1, 0 would provide us with a supervised learning of whether the sales rep’s visit would yield prescriptions of our client’s product in future or not. Therefore, we use this supervised target from market research to train a prediction model which will be able to take up the customer’s data and would be able to predict on that day whether the sales call will yield into their product’s prescription in future or not.

from numpy import asarray
from sklearn.preprocessing import MinMaxScaler

# define min max scaler
scaler = MinMaxScaler()
# transform data
scaled = scaler.fit_transform(npi_info[['Avg Patients / day','Total Rx','Product1_Rx','Product2_Rx','Product3_Rx', "KOL", "Physician's Age" , "Org_Control","# of Visits/year"]])
print(scaled)

import numpy as np
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(scaled, npi_info['Target'], test_size=0.33, random_state=42)

from numpy import loadtxt
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

model = XGBClassifier()
model.fit(X_train, y_train)

from sklearn.metrics import confusion_matrix
y_pred=model.predict(X_test)
import seaborn as sns
sns.heatmap(confusion_matrix(y_test,y_pred),annot_kws={
                'fontsize': 16,
                'fontweight': 'bold',
                'fontfamily': 'serif'
            }, annot=True)

from sklearn.metrics import f1_score
from sklearn import metrics

f1_score(y_test,y_pred)

Model parameters and assessment

We use a test size of 33% and the remaining data is used to train the model

The following confusion matrix is obtained
We obtain an accuracy of ~88.37% from the XGBoost Classifier
* Based on this XGBoost model, we can predict with confidence whether the sales call will yield into further prescriptions or not. For future purposes, the client/pharma company can use this model for the customers which they do not have the Target Variable (Success/Not) and use that in the optimal path analysis to be cautious on which customer to visit and which not to.

(4) Using KMeans to segment the customers into different segments

Here, we segment the HCPs into different clusters to identify their traits. We have various features of these HCPs available, such as their total Rx(prescriptions in a month), prescriptions for our product 1,2,3, etc. We leverage the KMeans clustering algorithm to identify the optimum number of clusters and bucket them.

The elbow method helps us identify ‘5’ clusters to be the most optimal solution.

npi_info['Sales']=np.nan
for i in range(0,npi_info.shape[0]):
    npi_info['Sales'].iloc[i]=np.random.randint(100,10000)
npi_info.head()

from sklearn.cluster import KMeans
import numpy as np

from sklearn.preprocessing import MinMaxScaler
scaler=MinMaxScaler()
X=scaler.fit_transform(npi_info[['Avg Patients / day','Total Rx','Product1_Rx','Product2_Rx', "KOL", "Physician's Age" , "Org_Control","# of Visits/year"]])

from sklearn.cluster import KMeans
from sklearn import metrics
from scipy.spatial.distance import cdist
import numpy as np
import matplotlib.pyplot as plt

distortions = []
inertias = []
mapping1 = {}
mapping2 = {}
K = range(1, 30)
  
for k in K:
    # Building and fitting the model
    kmeanModel = KMeans(n_clusters=k,random_state=0)
    kmeanModel.fit(X)
  
    distortions.append(sum(np.min(cdist(X, kmeanModel.cluster_centers_,
                                        'euclidean'), axis=1)) / X.shape[0])
    inertias.append(kmeanModel.inertia_)
  
    mapping1[k] = sum(np.min(cdist(X, kmeanModel.cluster_centers_,
                                   'euclidean'), axis=1)) / X.shape[0]
    mapping2[k] = kmeanModel.inertia_

fig = plt.figure(figsize=(15,8))


plt.plot(K, distortions, 'bx-')
plt.xlabel('Values of K')
plt.ylabel('Distortion')
plt.title('The Elbow Method using Distortion')
plt.show()

km=KMeans(n_clusters=5,random_state=0)
y_predicted = km.fit_predict(X)
y_predicted

from sklearn import decomposition, datasets
pca = decomposition.PCA(n_components=2)

# Fit the PCA and transform the data
X_std_pca = pca.fit_transform(X)

PCA_3_Axis = pd.DataFrame(X_std_pca)
PCA_3_Axis.head()

fig = plt.figure(figsize=(15,10))

plt.scatter(PCA_3_Axis[0], PCA_3_Axis[1], c = y_predicted, s=150)

plt.xlabel('PC1')
plt.ylabel('PC2')

plt.show()

Our clusters look like this:

summaries_1=npi_info.copy()
summaries_1['segment']=y_predicted
summaries_1.groupby(['segment']).mean().iloc[:,1:10]

(5) Using Linear Programming to identify the optimal time to spend on the product mix of the company during a sales call

The pharmaceutical company has three products, where each of the product provides different revenues per call. We assume the revenue received per call and use Linear Programming to find the optimal amount of time that needs to be spent on each product. This judgement is usually taken by the representatives, but we will use linear programming to optimize the revenue and find the split of time to spend on each call.

The below mentioned algorithm was leveraged to compute the solution:

from pulp import * 
LP = LpProblem(name="Max Efficiency", sense = LpMaximize)

A=LpVariable("A",0)
B=LpVariable("B",0)
C=LpVariable("C",0)


revenue_per_hr_A = 7000
revenue_per_hr_B = 6500
revenue_per_hr_C = 6000



# Initializing variables
LP += (A+B+C) <= 1

LP += (revenue_per_hr_A*A+revenue_per_hr_B*B+revenue_per_hr_C*C) >= 10000

LP += A <= 0.5
LP += B <= 0.4
LP += C <= 0.4


# Obj Function
LP += (revenue_per_hr_A*A + revenue_per_hr_B*B + revenue_per_hr_C*C)

#solve 
LP.solve()


for var in LP.variables():
    print("% Spend of time per call: ", var.name, " =", var.varValue)

(6) Using graph theory to identify the most optimal path — trade off between cost of sales call vs ROI of the call

Here we would be leveraging two graph theory algorithms (1) Bellman-Fords algorithm to calculate the optimal path with a trade-off between rep cost and revenue and (2) Has-Path algorithm to check weather there is already an existing path

Let’s visualize the customers as nodes in a graph diagram, using weights as the rep cost and edges as the distance between the addresses.

import folium

n = folium.Map(location=[41.9, -87.69], tiles="OpenStreetMap", max_zoom=14, min_zoom=10)

# add marker one by one on the map
for i in range(0,npi_info.shape[0]):
    folium.Marker(
      location = [npi_info['Lat'].iloc[i], npi_info['Long'].iloc[i]],
      popup = i,
      icon = folium.DivIcon(html=f"""<div style="font-family: courier new; font-size:large; color: blue">{'•'}</div>""")
   ).add_to(n)

# Show the map again
n

import networkx as nx
import matplotlib.pyplot as plt
import numpy as np

# npi_info= pd.read_csv('final_npi_info.csv')

distance_matrix=np.array(npi_info.iloc[:,19:-1])
distance_matrix
distance_matrix.shape

Visualizing the customers as nodes in a graph diagram, using weigths as the rep cost and edges as the distance between the addresses

G = nx.from_numpy_matrix(distance_matrix)

pos = nx.spring_layout(G)
fig = plt.figure(1, figsize = (200, 200), dpi = 10)
nx.draw(G, pos, with_labels = True, node_size = 8000, node_color = 'r')

Assuming: The rep starts his/her day from customer 0 / Node 0 for simplicity.

We leverage the below functions to define the helper functions to aid our calculation of the shortest path.

#defining constants
nodes = npi_info.shape[0]
wage = 200 #dollars/hour
avg_speed = 40 #mph
fuel_cost = 5 #Dollars pg
distance_scale = 10 #multiplier for probabilities to get distance between nodes
vehicle_mileage = 25 #mpg
avg_workday = 8 #hours
rep_node = 0 #start node of the rep
target_nodes = 5 #daily number if targets
profit = 0 
tolerance = 0.8

#defining helper functions

def calc_rep_cost(distance, time_in_session):
    total_time = distance/avg_speed + time_in_session
    net_spends = total_time*wage + distance/vehicle_mileage * fuel_cost
    return net_spends

def avg_profit(roi, net_spends):
    return roi - net_spends

def get_weight(st, ed, edge = None):
    rep_cost = calc_rep_cost(distance_matrix[st][ed], 1)
    return rep_cost

#returns tuple of type (value, node)
def get_max_node(dict):
    return max(zip(dict.values(), dict.keys()))

#defining the roi of each HCP
roi = list(npi_info['Product1_Rx']*100)
roi_main = roi[:]

#Success visit function which predicts the success rate of a visit based on XGBoost Classifier
def success_visit(node_id):
    return (model.predict(scaled)[node_id]) #predicting the success using the XGboost trained model

The path generator functions work by creating shortest path between rep node and all other nodes. The weight of each node is subtracted by the node’s ROI, and actual ROI subtracting reps’ cost is generated, based on the max actual roi, a node is chosen by the function where the rep can visit. Predictions from the XGBoost Classification model are then used to predict whether the visit will be successful or not, if yes, then the node is travelled else a new node is chosen.

Bellman_ford: shortest path algorithm has been used to generate the path between the rep node and all the other nodes.

Has_path: This function has been used to check whether a path exists in between two nodes in a networkx graph.

#Function to generate the path of a rep
def generate_path(G, roi, rep_node = 0, target_nodes = 5, use_prediction = True):
    targets = {}
    roi[rep_node] = 0
    visited = set()
    visited.add(0)
    profit = 0
    while len(targets) < target_nodes:
        weights, path = nx.single_source_bellman_ford(G, rep_node, None, get_weight)
        max_roi = {}

        for i in range(nodes):
            if not nx.has_path(G, rep_node, i):
                continue
            if i in visited:
                continue
            max_roi[i] = roi[i] - weights[i]


        max_profit, max_node = get_max_node(max_roi)
        max_fallback_profit, max_fallback_node = max_profit, max_node

        while len(max_roi) != 0:
            if success_visit(max_node) or not use_prediction:
                profit += max_profit
                targets[max_node] = max_profit
                roi[max_node] = 0
                rep_node = max_node
                visited.add(max_node)
                break
            else:
                del max_roi[max_node]
                if len(max_roi) != 0:
                    max_profit, max_node = get_max_node(max_roi)

        if len(max_roi) == 0:
            profit += max_fallback_profit
            targets[max_fallback_node] = max_fallback_profit
            roi[max_fallback_node] = 0
            rep_node = max_fallback_node
        
    return (targets, profit)
    
print(generate_path(G, roi[:], use_prediction = False))
print(generate_path(G, roi[:], use_prediction = True))

#Generating the path of a rep without using prediction
targets, profit = generate_path(G, roi[:], use_prediction = False)

path = list(targets.keys())
path.insert(0, 0)
path_edges = list(zip(path,path[1:]))

fig = plt.figure(1, figsize = (200, 200), dpi = 10)
nx.draw_networkx(G, pos, with_labels = True, node_size = 6500, node_color = 'r')
nx.draw_networkx(G.subgraph(0), pos, node_size = 15000, node_color = 'g')
nx.draw_networkx_edges(G,pos,edgelist=path_edges,edge_color='r',width=20)
plt.axis('equal')
plt.show() 

print("Optimal path without using XGBoost predictor --> ", list(generate_path(G, roi[:], use_prediction = False)[0].keys()))

The path generator functions works by creating shortest path between rep_node and all other nodes. The weight of each node is subtracted by the node’s roi, and actual roi subtracting reps’ cost is generated, based on the max actual roi, a node is chosen by the function where the rep can visit. XGBoost predictor is then used to predict whether the visit will be successful or not, if yes, then the node is travelled else a new node is chosen.

#Generating the path of a rep using prediction
targets, profit = generate_path(G, roi[:], use_prediction = True)

path = list(targets.keys())
path.insert(0, 0)
path_edges = list(zip(path,path[1:]))

fig = plt.figure(1, figsize = (200, 200), dpi = 10)
nx.draw_networkx(G, pos, with_labels = True, node_size = 6500, node_color = 'r')
nx.draw_networkx(G.subgraph(0), pos, node_size = 15000, node_color = 'g')
nx.draw_networkx_edges(G,pos,edgelist=path_edges,edge_color='r',width=20)
plt.axis('equal')
plt.show() 
print("Optimal path using XGBoost predictor (Improved) --> ", list(generate_path(G, roi[:], use_prediction = True)[0].keys()))

Business Insights

Segmentation: Segment 1 and 3 consist of KOL i.e. Key Opinion Leaders mainly, therefore they would be primary targets for the sales reps. Further, they have high patient counts/day, therefore they’re expected to be more profitable.

2. Accuracy of prediction: The prediction model will be predicting with 81% accuracy, the probability of success of a sales call. Therefore, the sales representatives can take a call if they want to visit the physician or not.

3. Linear Programming: Sales representatives should spend around 50% of their time on explaining the product A, 40% of their time in explaining the product B and the rest 10% in explaining the product C.

4. Graph: Based on the current data, for the first day of the sales visit, if the representative decides to visit 5 physicians, then the best 5 physicians and the most optimal path to cover them would be HCP # 13 → 57 → 31 → 161 → 24.

5. Optimizing Graph using XGBoost: If we were to optimize this 5 best physician and their path using the predictive XGBoost function, then the new rendered path would be 57 → 24 → 83 → 7 → 65.

Conclusions:

1. We have used Web Scraping to extract the latitudes and longitudes ourself for the list of physician/customer addressess obtained from the open source National Provider Identifier (NPI) Data.

2. We will be able to help the sales force of a pharmaceutical company to segment their target customers into 5 segments using K-Means

3. The proposed training model on XGBoost will predict the probability of success of a visit for the future/upcoming data. This knowledge will further be used in graph theory to find optimal path.

4. We will be able to provide guidance on the effort split for product A, B and C that the sales representative will have to make on a sales call using Linear Programming.

5. We have used Bellman Fords algorithm to identify the shortest path which the sales representative can take in a day, in order to optimize the revenue and minimize the travel cost and time.

6. We have used Has_Path algorithm from the nx library to check whether a path exists in between two nodes in a networkx graph.

Pharmaceutical Sales Rep call optimization using Graph Theory, Kmeans Clustering and Prediction of Prescription Conversion using XGBoost

Overview of the Project Objectives

Business Insights

Conclusions:

Written by shikhar kanaskar