Designing a purchase propensity model for your website

Published in

The Deep Hub

10 min readJul 29, 2024

There are more than 5.44 billion internet users worldwide as of April 2024, amounting to 67.1% of the global population as per Statista data, according to Datareportal. With such a huge percentage of your customers interacting with our ecommerce platform, it’s only a matter of time before we start using the immense wealth of data that is generated here, to up our prediction game and start making our organization some money.

Introduction:

I work as an ecommerce and marketing analytics manager at an Indian Consumer durable company. We have our presence offline as well as online. The ideation of this exercise started when the head of data science and my boss one day while drinking coffee asked me “Well we have so much browse data of a user, which we dont have in our stores. But, what do we actually do with this data?”

Problem statements:

We have only 1% conversion rate on our website.
With so many anonymous(non-logged in)users visiting a website, there is no anchoring mechanism(like a customer id) to build a 360 degree view using which we can optimize and personalize the user journey.In our case only 12.5% sessions leads to logins, thus we have only a fraction of customer data which we can leverage for lifecycle management initiatives.

Background analysis and research:(BAR)

In a parallel study we found for our website the weightage of purchase drivers is as follows

This tells us of all the factors that a customer considers while buying, Price and Product information(including Reviews and Ratings) weigh much higher than the delivery options.Can we leverage these 2 ‘aha factors’ to solve for problem 1 and 2?

2. In a kmeans clustering exercise we had done earlier we found out that there is one cluster(C1) which contains 3% of our population, and it outranks conversion % of any other cluster by 4%. Further, there is one other cluster(C2) which is closest to C1 in the feature space(we will come to the features in a bit), however it’s conversion rate is 1%. 2 features which makes C2 distinct(apart from conversion rate) is

a. Time spent on the site is 8–10 mins(compared to 4–6 mins of C1)
b. Number of products seen and searches performed are marginally higher(4.2 to 3.5 and 2.8 to 2.5). Also they mostly come from Organic channels.

Intuitively we can say this cluster represents the ‘Consideration’ class. However, we will let our model establish this later. This is a hypothesis for now.

3. 36% of sessions where a conversion happens are the 1st session for a visitor id. Which means for 64% sessions we will have data of a previous session.

Objective

Can we increase the conversion rate of this ‘Consideration cluster’ using the incentives of attractive pricing and information accessibility? Given that this cluster consist of around 25% of our monthly user base of 10 L, if we can even increase conversion by 1- 1.5% we would get around 500–1000 additional orders. Then given our AOV is around 5000 conservatively we are looking at an incremental revenue of 25–50 lakhs in a month.Considering COGS and other charges we should be able to generate a bottom line of 6–8 lakhs(7–8k USD).

Functional architecture:

Once a visitor has spent 6 minutes(referenced from BAR 2.a and we experimented with this number by setting numbers between 4–6 mins), our model will ingest all the current and previous session(if exists) data, and classify the if the session has the potential to conver.If yes:

A login form will pop up to nudge customer login, so that we get the customer id information, which we will use for re targetting if the customer id does not purchase in the session. If the customer logs in they will get immediate 2% or 200 INR cashback whichever is less and a qr code to the user to visit a nearby store and get personalized consultation on products.

Feature exploration/discovery:

For the experiment that I performed I broke down the features into 3 categories

Features already associated with the user when they arrive at the website.
Features which get associated as and by they start navigating the website
Features associated with what they had done in the previous(n-1) session.(if that exists).(ref pt. 3 in BAR)

I ended up exploring these features:

To be noted these are all features which we can mine from session based ids(like a marketing or webanalytics cloud id). In other words we are not going to associate our customer ids at all, thus making this analysis independent of the need for the customers to login.

Also, at this stage we had to exhaustively brainstorm with the Marketing and the Business teams to understand intuitively what could be some powerful features associated with a visitor journey that can predict purchase intent.

Feature engineering

Once we have a laundry list of features(the longer the better) we started to narrow down to those 8–10 features at max which we would ultimately build our model on. Typically we cannot have a model with 15–20 input variables as this will lead to sparsity and curse of dimensionality issues. Also, we decided that 2 years data would make a lot sense in terms of giving a wholistic view of customer journey across scenarios.We performed the following feature engineering and EDA tasks at this point.

a. Univariate analysis: This helped us get an understanding of how each of the features are described- what are the medians,quartiles, distribution etc. It also gave a clear picture of outliers in our data(which had to be removed), the biases that may exist(if those need to be balanced for better training), the type of binning we would want to do in case we would want to convert a categorical variable to continuous variable(in case we are planning to run a model which works better with continuous variable like SVM/logisitc regression etc.). One thing we found helpful here is having a visual cue to all of these information thus we plotted a lot of Histograms, box plots, scatter plots, performed Weight of Evidence analysis etc .

b. Bivariate analysis: We tried to understand the relationsips of our features in pairs. We got insights about the multicollinearity that may exist between these pairs which may lead to overfitting, and lead to increased standard errors etc. We did a Variance Inflation Factor analysis to see multicollinearity as well. We also performed Chi squared tests for Independence of variable to understand correlation between the features and output variable. To be noted here,correlation does not conclude causation.

c.Multivariate analysis: We performed Principal component analysis to yet again gauge the feature importances and visualize the distribution of data in lower dimension space. We also performed a DB scan clustering(we had already done clustering on sample data). This was to validate if our clusters changed a lot when done on last 2 years data, and if we were able to establish some newer clusters.While some new clusters did emerge the theme of data remained the same, with the ‘Consideration class’ cluster still holding true to it’s centroid values.

With all of this we arrived at some transformed, cleaned features which we would train our model on. Its also important to note at this point we had to explain your modelled features to our business users. We had a ‘framework of model walkthorugh’- which was a way to yet again validate if our features are synchronized with the business intuions and keep the stakeholders vested. This was a brainstorming call with very constructive debates. Some of the features we had to revisit as per feedback, while most features we were able to get a buy in either because they made sense or the data was backing us up.

Building the model:

This was largely an iterative exercise and honestly had a bit of an overlap with the feature engineering steps. Some of the more visually intuitive models gave us a hint on how can we refine some of the features. We invoked most of the existing models and ran it on our data, performed hyperparameter tuning, checked model scores like precision, recall,f1-score,ROC-AUC score and repeated these steps with different models until we reached a winning score.Since our output variable(conversion) was biased(with 0s outnumbering 1s),we wanted to focus on ROC-AUC as our northstar KPI.

After performing all of this we had a xgBoost based model which was able to give a decent performance and had .78 ROC-AUC score. Important to note here that we had to take business POVs for FP/F1 tradeoffs(if we want to increase our False Positives in order to detect higher True postives).

Solution architecture:

We had to showcase our model to the business stakeholders and our Cxo s. This was a call wherein we presented all the insights we had derived while we worked with the model to the wider audience, how we would want this model to actually solve a business problem and most importantly what we can and CANNOT do with this model. This was then taken up as an item in the Product backlog and the objective was to deploy our pickle file on our Azure services which would be where we would host the model. We were to use the Adobe Analytics Real time API to stream our data into Azure end points .We had to work with our SDEs, ML and Data engineers to design the data APIs, design data tables and schemas , perform the data transformations, create stored procedures and other prerequisites to be able to deploy our model into a cloud solution so that it is accessible as a webservice .The 3 favorite options were AWS/GCP/Azure. Since we were already using Azure as our cloud partner, our solution architecture kind of looked like below .You may have to do a cost benefit analysis of all 3 systems keeping in mind that you are dealing with live streaming data which could be in TBs. For us we were streaming around 1gb data daily

Cost benefit analysis

Once we had the solution architecture in place we had to do a Cost benefit analysis. We took an incremental profit of 25 lakhs(around 6–8 k usd)as our target/expected value realisation. One point to consider here was how would you attribute a conversion to the model. In our case we agreed that if we had flagged a session as probable purchase and the visitor had interacted with the resultant nudge(login) and then purchased(either logged in and ended up purchasing in the same session or exited and then purchased via the retargetted email) then we would attribute this to our model.Now on the cost part we breakdown our total cost of model into infrastructure(all the cloud services we were using)cost and resource(man power )cost. While most of our infra cost was a more or less recurring cost, resourcing cost was a mix of fixed(development) and recurring(maintainance and monitoring) cost. Sharing again a very rough cost estimate calculation template assuming you deploy model on Azure.

Accordingly you will get a report of your cost benefit analysis. Sharing a rough template of the cost and profit generated. As we can see we are projected to breakeven on the project in around 9 months.

Deployment and Monitoring

Once we had designed the solution architecture and gotten an approval on the CBA, we deployed the model, first to dev, then pre prod and then production. Parallely the data piepelines between our Adobe service and storage cloud were deployed. In this stage we were mostly helping out by writing the test cases and acceptance criteria for the ux and the data teams. There were a few issues in terms of how our model was behaving in real time data. For instance there were issues with data latency and the login pop up was not showing up since the visitor was already clicking on something else. These required some software patching on our backend services. Otherwise, the service was working as expected.

For monitoring we leveraged Azure’s Dataset monitors to detect datadrifts and Metrics tracking to detect change in f1 score,recall values etc. So far, we haven’t had any datadrift as such.

Apart from this we also track how many potential conversion did our model predict, how many logged in and how many actually converted. We tag the customer ids generated from the cases where prediction and log in happens and track their journey for the next 14 days. If they convert either in store or online we attribute the revenue to the model.

Conclusion

So yeah that was about my experience as an Analytics Manager on deploying a Purchase probability model. This whole exercise took us about 6 months to productionize(including all the research) and we broke even on this model in about 8 months. This model is currently on our website and generates net profit in the range of 4–7(6-8k USD)lakhs per month. Thanks for reading. Please share a clap if you liked it.