Stories by Abhinav Sharma on Medium

Classification machine learning using Adobe Analytics data

Abhinav Sharma — Thu, 17 Aug 2023 12:09:02 GMT

Applying machine learning for advanced analysis on clickstream data.

Problem Statement

We take an airline as an example. But this use case is relevant to any e-commerce site where there is a classification problem.

An airline retail site wants to optimize the booking journeys in real-time for customers based on their travel intent. They offer hotel, cabs, and experiences (ancillary) cross-sell during flight sales and want to increase this cross-sell conversion. If they have user propensities to ancillary purchases, they can optimize the flight summary page or even the confirmation page to include relevant offers around cross-sell.

Tool Kit

The airline has clickstream tracking via Adobe Analytics and content management via Adobe Target. There is s CRM CDP but is not connected to online tracking via unified ID. This limits our visibility to historical customer booking behavior.
There is a robust data layer that captures all values ultimately passed into Adobe Analytics.

Solution Approach

We will try to solve this problem in 4 steps

We will build a classification model on the booking details of all customers. We get booking details from the Adobe Analytics data warehouse.
Ideally, this model output should be directly actionable but since that’s not possible here (lack of centralized customer data warehouse), we will instead analyze feature importance from the model above using SHAP values.
We will next build a cluster model (unsupervised learning) from the SHAP values to determine feature values critical in purchase propensity.
These values can then be utilized directly in Adobe Target to build customized experiences for the cohort with higher propensity.

This solution is relevant to any e-commerce use case where there is online data collection (form, widgets, etc.) leading to customer transactions.

Step 1: Classification using XGBoost (or any other classification model)

We run a classification model to observe whether the model scores better than the baseline evaluation metric. This ensures that the features are relevant for prediction and that there is a trend hidden in their interactions.

Any classification model would work but you will generally find that an ensemble model would be most efficient.

ensemble model like XGBoost would return permutation-based feature importance

Step 2: Feature importance using SHAP values

SHAP feature importance is an alternative to permutation feature importance. Permutation feature importance is based on the decrease in model performance when a feature is removed. SHAP is based on the contribution of a feature value to the prediction in different coalitions.

Given that SHAP values work on impact on prediction rather than model output, hence is best suited to determine the importance of features working in coalition and suits real-life scenarios the best.

SHAP values, once generated, can be visualized through string of plots, thereby further helping in interpretation.

SHAP values to explain the predicted ancillary purchase of an individual flight search. The baseline — the average predicted probability — is -2.314. This search has a lower predicted probability of -4.06. Purchase-increasing effects such as a travel duration of 5 days are offset by decreasing effects such as only one adult passenger.

SHAP Summary Plot —

The summary plot combines feature importance with feature effects. Each point on the summary plot is a Shapley value for a feature and an instance. The position on the y-axis is determined by the feature and on the x-axis by the Shapley value. The color represents the value of the feature from low to high.

SHAP summary plot. A low number of days for travel duration increases the predicted propensity to purchase, and a large number of days decreases the propensity. Your regular reminder: All effects describe the behavior of the model and are not necessarily causal in the real world.

Step 3: Clustering SHAP values using kMeans

Clustering bookings by their SHAP values leads to isolating groups with high propensity to purchase and we can observe the values of respective features for those groups.

Cluster dataset with mean values of all the features for each cluster. The cluster with the highest mean for the target variable represents the cohort with the highest propensity.

We observe from here that segment 3 has the most propensity to ancillary purchase (a_purchase mean = 0.45). The top 3 important features have the following mean values -

Travel duration (9.26) ~ 9 days
Adult passengers (1.71) ~ 2 adults
Time to travel (122.14) ~ 122 days

We can also use quantile thresholds to understand the range of values in the segments and based on that recommend rules for Adobe Target activation.

Step 4: Adobe Target (or any other audience) activation

Observing both quantiles and mean, we can recommend the following thresholds for Adobe Target experience activation (remember this is dummy data so values are not expected to make sense) -

Bookings/Searches with a travel duration of less than 10 days
Number of adult passengers more than 2
time to travel more than 30 days

This is a very good set of criteria that isolates an audience/ cohort with the best possible chances to purchase ancillary products. Adobe Target can create custom experiences for these audiences while keeping a general experience for the rest of the visits. For example, an extra page highlighting ancillary products or destination-themed content can be tested for this cohort in flight selling flow without risking the usual experience for general users. Based on observation, experiences can be scaled to a wider audience.

Code Repo in my Github here.

A/B Test MythBusters

Abhinav Sharma — Thu, 17 Aug 2023 09:59:00 GMT

A/b testing is a decision-making support and research methodology which is an integral part of the product design lifecycle. Product managers should use testing outcomes to determine not only the look and feel of their products but their viability as well.

The following examples explain the importance of a/b testing at different stages of the product lifecycle -

Introduction — New features are launched for a controlled set of actual users (eg. 1% of users) and the performance data is compared against the existing cohort to determine if there is any quantifiable improvement. This also gives the opportunity to determine any technical issues or bugs in the product. Most organizations already do this as an enhanced form of beta testing.

Example: Fictious online retailer called AMart has launched a new feature allowing users to save their shopping lists which can then be easily passed on to the trolley.

Growth — Add-ons or redesigns are religiously put through a/b testing pipelines to get data-led insights (rather than hunches) around their performance.

Example: PMs at AMart need to improve add to trollies from the Shopping list feature. They notice that the ‘Add to Trolley’ button is only provided at the end of the page and perhaps for longer lists, users find it difficult to locate. They want to test having this CTA both at the top and end of the page.

Maturity — for established products, the focus shifts on retention and hence any re-branding or upgrades again go via a/b testing pipelines to validate efficiency. This is also where a/b testing evolves into targeting and personalization.

Example: Users with shopping lists are perceived to be wanting easy shopping regularly. AMart PMs hypothesize that an auto-generated weekly/ monthly shopping list based on their shopping history would further make their shopping easier. This hypothesis obviously needs validation via testing but also needs further intelligence around targeting specific cohorts.

A/b testing is the most accurate way to quantitatively measure an impact of a change in a product. However, it can only highlight what changed and by how much. Any insights around the ‘why’ of it, will need traditional qualitative research methods like user interviews and surveys, etc.

Myth 2: We have sample size calculator; we don’t need a data scientist.

In my experience, I have seen statistics as the biggest casualty in the whole a/b testing process. For a successful and productive testing program, a data scientist is a must on board. A good data scientist will be a champion of internal business KPIs. They will know the variance for these KPIs and would understand how seasonality and other external factors (eg. Campaigns) affect them. Hence, they will be in the best position to give the most accurate pre and post-assessment of the tests. This avoids a lot of hit-and-trial costs for the organization.

A company using agile methodology usually has a feature development team that comprises of product managers, software developers, designers, and data scientists. They all have a role to play in the ab testing process (highlighted in the figure above). Data scientist engages right at the onset giving the right guidance in terms of measurement and feasibility of KPIs. They then set up the test and avoid common mistakes like peeking at false positives during the run period.

Myth 3: Statistical significance is the only statistic that is required.

I have seen on several occasions where teams treat statistical significance as the ultimate goal of the tests. The whole exercise of sample size calculation is centered around ensuring that the tests have a fair chance of reaching statistical significance. While this is partially true, testers rarely tune other variables in the sample size formula.

The most important of them is the Minimum detectable effect or MDE. Most often, I have seen test engineers going with default values for MDE or using hunches to go with values like 10%, 5%, etc.

What is minimum detectable effect (MDE)?

In traditional hypothesis testing, the MDE is essentially the sensitivity of your test. In other words, it is the smallest relative change in conversion rate you are interested in detecting. For example, if your baseline conversion rate is 20%, and you set an MDE of 10%, your test would detect any changes that move your conversion rate outside the absolute range of 18% to 22% (a 10% relative effect is a 2% absolute change in conversion rate in this example).

excerpt taken from Optimizely Sample Size Calculator.

MDE should be more personal than we often treat it to be.

Assuming a normal distribution of measurements of any metric, we expect most values to lie within one standard deviation of the mean. Hence, we should treat results from any two samples as different only when they are at least beyond one standard deviation (of the mean). I rarely see this insight properly reflect in the sample sizing.

A good estimate starts with the historical trend of your population metric. Filtering out anomalies, use the trend to get the mean and standard deviation for the metric. The relative difference between mean and S.D. is what you should consider as your minimum detectable effect. It is okay to have value more than S.D. depending on the bandwidth and flexibility of your test conditions. But having a totally disproportionate figure to this difference is unnecessary starching your experiment to unreasonable statistical tests. This ultimately translates to cost in terms of time and resource wastage.

Myth 4: There is a standard sample size formula that fits any KPI.

The traditional proportion-based sample size estimator churns out the total number of visits/users required for each variation in the test. This is useful if we want to measure KPIs around actions of the users like purchase and clickthrough etc. The ‘action’ in this case is binary in nature and the metric is treated as binomial.

However, recently, and particularly with the onset of mobile apps, aggregate KPIs are also very in demand. These are continuous metrics that are then aggregated at a time or user level to measure their values. Examples include Daily Active Use (DAU) or Revenue per user (RPV) etc.

The sample size is estimated by using variations of one or a two-tailed Z-test for both categories of metrics. The major difference is that for continuous variables, the sample size is the number of days rather than the actual volume. This is achieved by replacing probabilities with standard deviation in the formula for continuous metrics.

Example of proportion-based sample size calculato r (for binomial metrics)

Example of mean-based sample size calculator (for continuous metrics)

Myth 5: Lift from an a/b test will remain forever.

It is unsafe to assume that the changes you make based on a/b test lift, will last forever. This is especially true for revenue or conversion-related metrics.

For example, let’s say you observe a statistically significant lift of 10% from variation at the end of the test. You roll out the variation. For how long you should continue to expect a 10% lift from the change?

This is a complex question to answer and requires some analysis and familiarity with business but it’s wise to count on the fact that the effect of the variation will decay with time and eventually will remain only a proportion of what it was.

Usually, it is recommended to attach monthly, quarterly, and yearly lift estimates with your test plan. These estimates should incorporate a decay model instead of being linear in nature.

This is again where data scientist has a role to play. If the revenue per conversion is known for the target test, a data scientist can use a confidence interval around the MDE as a lift estimate to prepare an incremental revenue impact around the test activities.

Originally published at https://www.linkedin.com.

My reflections on issues around use of AI (as of 2022)

Abhinav Sharma — Thu, 31 Mar 2022 10:24:59 GMT

Image source: Pixabay | No attribution required

In this short reflective essay, I will talk about issues and concerns around use of AI in various industries. I will start with highlighting some areas of concern and then follow it up with my own reflections and experiences. I will conclude by expressing what I think should be the future course in terms of AI adoptability and use.

Can AI make value judgements?

We are living in a unique time in history where AI engineers are suddenly at the frontline of tackling complex spiritual questions. Consider this classic philosophical dilemma that is quite relevant in design of a self-driving car —

A car is self-driving with owner sleeping on the back seat. Suddenly two playing kids jump on the road in front of the car. The only way to save the two kids is to swerve to the side and fall of the cliff and kill the owner of the car. What should the car do?

The response would most likely be syntactically coded in its system meaning that whatever we program as the right approach is guaranteed to be implemented. Firstly, it means that it is responsibility of an engineer (maybe supported by a team of philosophers) to get a deterministic answer to such a complex question. Secondly, given the deterministic (and in any case controversial) nature of the solution, it opens the possibility of exploitation and abuse. This is very different from how a human being would respond to such scenarios. Philosophers can debate about the best way to deal with such situations and come up with a solution. But the interesting thing about us humans (including those philosophers) is that when faced with real life scenarios requiring quick decisioning, we often respond erratically and care little about the philosophy that we preach.

It could be argued that it is such randomness (or perhaps intuitiveness) that keeps this complex world in balance.

This might be a subjective example, but it effectively highlights the problem and issues we face as we adopt more AI reliant systems in our daily lives. Today, AI is fast replacing human involvement in fields like justice, medicine, and security. Effectively we are replacing probabilistic systems with high precision deterministic systems. As such, its paramount that the systems are built by the experienced and the data that is fed into the system to train itself should be of highest quality.

The issue with data

Anyone living with United Kingdom would relate to the fact that UK’s National Health Service doesn’t link its healthcare records together as a matter of standard practice. This means that if I visit an NHS hospital for certain issue, they won’t have explicit visibility to any of my previous visits to my GP.

Artificial intelligence has lot of potential in disease diagnosis and pattern detection. However, if NHS avails a state-of-the-art AI solution for disease detection and prediction, it will mostly be ineffective given nature of the data about patients which is not centralized, collated, and collected properly.

The point is, that so far, we haven’t been able to build a solid data ecosystem for most of our AI platforms across industries. And hence, we get to hear frequent news about biased or erratic AI projections. These incidences further diminish public’s faith in adopting these technologies. On the flip side, it could be argued that such centralized data collection is unethical and infringes the privacy rights of the patients.

Privacy Issues

So far, organizations, both large and small, have had little incentive to build privacy protected AI systems. Data breeches are becoming more common in recent years with little fallout for the companies responsible.¹

It is still not a common practice to hire a dedicated data governance officer within organizations who actively use AI applications that concerns with personal data storage. The psychographic profiling of Facebook users and its influence in 2016 USA presidential elections (Cambridge Analytica scandal) is one of the most popular examples that highlighted the concerns around privacy implications of AI use.

‘Mystery’ algorithms

Recent developments with deep learning and neural nets have made AI led predictions far more accurate. However, it has also made the system a black box because while it’s possible to interpret traditional ML and AI algorithms, deep learning algorithms are more complex and quite often, cannot be interpreted as to why or what features led to certain predictions. The interpretability issue affects people’s trust on deep learning systems. It is also related to many ethical problems, e.g., algorithmic discrimination.²

Personal reflections

Last few years have seen focussed initiatives that address some issues with AI. While the public is more sensitive to their privacy and data sharing rights, the organizations are swift to respond to public sentiments by giving more options to their customers. AI is coming of age in this era and the outlook is very promising.

I work as a data analyst in retail. Being familiar with some of the prominent use cases of AI in retail, I can comfortably say that at least in this industry, AI has so far shown more promise rather than concerns. Artificial intelligence has allowed marketers to process millions of data points together and get insights that can then be applied at demographic or group level. Sales and customer experience management have also benefited immensely from use of AI. With AI, it’s possible to apply heavy computation on large scale connecting several data points together. This is something which is beyond human calibre and was mostly unexplored before advent of AI and machine learning.

Use cases of AI in retail like hyper-personalization, allocation and replenishment and pricing and markdown etc. have been very popular lately and have also been widely adapted by the industry at every level.

The issues that can come up is more around ‘how’ rather than ‘what’ of applications of AI in retail. There is a growing concern about privacy of users and whether retailers should process demographic and behavioural data of their customers. Handling of personal identifiable information (PII) is subject to various regional laws (eg GDPR for Europe). These laws ensure citizen right to privacy but are not prevalent across the globe. Additionally, there is a risk that given need to large data ecosystem to train AI platforms, some large corporations start with obvious advantage in this area and could create a space of monopoly in this field. Amazon is a good example to keep an eye on in the following years.

Looking ahead

Hannah Fry mentions a trick in her fabulous book ‘Hello World’ that can help us spot the junk algorithms —

whenever you see a story about an algorithm, see if you can swap out the buzzwords, like ‘machine learning’, and ‘artificial intelligence’, and swap in the word ‘magic’. Does everything still make grammatical sense? Is any of the meaning lost? If not, I’d be worried that something is wrong. Because — long into the future — we’re not going to ‘solve world hunger with magic’ or ‘use magic to write the perfect screenplay’ any more than we are with AI.³

Acknowledging that algorithms aren’t perfect, any more than the humans are, might just have the effect of diminishing any assumptions of their authority. With that sort of limitations, AI systems should also be part of usual audit process that are applied to financial processes for example. Trained auditors should test these systems against potential risks of biases and publish the results on regular basis. Also, there is still huge potential to develop new skills through the human-machine symbiosis and address the missing middle gap.⁴

I already experience such skill gap even within the organizations that I have worked so far — there are AI creators (data scientists, machine learning engineers etc.) who create or code technical AI solutions. But they create most of their solutions for other group who are AI users. This group mostly comprises of business managers with little or no knowledge of how the solutions work and are mostly interested in the outcomes. Most often, creators and users don’t talk to each other and don’t even understand each other’s language. We need a link between them who can serve as a bridge between these groups and ensure that both these groups work towards goals that are more aligned to a common success objective. In other words, we need AI translators. This is going to be a most sought-after skill in the coming future.

References:

1 — https://www.isaca.org/resources/news-and-trends/isaca-now-blog/2021/beware-theprivacy-violations-in-artificial-intelligence-applications

2 — A Survey on Neural Network Interpretability — Yu Zhang, Peter Tiňo, Aleš Leonardis, Ke Tang | https://arxiv.org/abs/2012.14261

3 — Hanna Fry, Hello World ISBN 978–1–784–16306–8

4 — https://thedatalab.com/news/human-machine-combatting-myths/ further references Human + Machine by James Wilson & Paul Daugherty

Must know ML techniques for digital analysts: Part 3— Recommendation Engines

Abhinav Sharma — Mon, 28 Sep 2020 14:17:57 GMT

Introduction to machine learning techniques that can help you to optimize your digital analytics value chain

As a digital analyst for your organization, you are not supposed to write ML code for recommendation engines. However, even ML engineers who write these codes would rely on you in terms of visibility of how these engines are performing on the site. Additionally, the digital analytics team also influences the content placement and A/B testing of content (or different ML algorithms). So, some inherent knowledge of how recommendation engines actually work could be a valuable asset for you.

Amazon’s recommendation to me when I was browsing a certain book. Can you guess which one? Image by author

Recommendation engines — an introduction

The poster child for the advantages of a recommendation engine would companies like Amazon, Netflix, Tinder, etc. Your amazon homepage is the finest example of recommendations in practice. Rolling through the homepage you can find sections like ‘Inspired by your shopping trends’, ‘Related to items you’ve viewed’ etc. all of them being powered by recommendation engines in the background. Recommendation engines are not just limited to presenting similar products on the eCommerce pages. The engines have a more profound use of presenting relevant and personal (right) content to the right customers at right time using the right channel.

Beyond mapping products that were bought together, the use of recommenders can be extended to -

Make recommendations based on customer demographics.
recommend based on a similarity between customers.
recommend based on product similarity.
recommend based on historical purchase profile of customers.

Use cases for recommendation systems are not just limited to eCommerce but widely exist in domains like pharmaceuticals, finance, and Travel.

Recommendation engine types

image by author

Each of the techniques shown in the diagram may be used to build a recommender system model. Let’s briefly explore the various recommendation engine categories.

Content-based filtering, recommends items by comparing product attributes and customer profile attributes. The attributes of each product are represented as a set of tags or terms — typically the words that occur in a product description document. The customer profile is represented with the same terms and built by analyzing the content of products that have been seen or rated by the customer. Typically, the content-based filtering method provides a list of top N recommendations based on some similarity scores.

Collaborative filtering filters information by using the recommendations of other people. Given a database of user ratings for products, where a set of users have rated a set of products, collaborative filtering algorithms can give ratings for products yet to be rated by a particular user. This leverages the neighborhood information of the user to provide such recommendations. The underlying premise of the collaborative filtering algorithm is that if two users agree on ratings for a large set of items, they may tend to agree for other items too.

Collaborative filtering can be further classified into:

Memory-based: In this method, user rating information is used to compute the likeness between users or items. This computed likeness is then used to come up with recommendations. This differentiates from content-based recommendations where item/user metadata is used to calculate similarity scores rather than their feedback (eg. ratings).
Model-based: Data mining methods are applied to recognize patterns in the data, and the learned patterns are then used to generate recommendations. We have already covered a popular technique of association mining using the apriori algorithm in Part 1.
The latent factor approach leverages matrix factorization techniques to arrive at recommendations. Recently these methods have proved themselves superior to item-based and user-based recommender systems. This was one of the winning solutions in the famous Netflix recommendation competition.

Finally, Hybrid filtering is the system where we combine more than one type of recommendation system to come up with final recommendations.

Singular-value decomposition approximation, most popular items, and SlopeOne are some other popular techniques that may be employed to build recommendation systems. Further learning on recommendation engines could be in the direction of exploring and studying these rarely-used techniques and applying them to real-world problems.

Example Implementation

For standalone implementation, both Python and R have packages that include the most popular recommendation techniques bundled together. Recommenderlab package in R is a one-stop-shop for building recommendation engines and provides awesome functionality to convert datasets into the required format and train/test ensemble of models. The recommender function contains the option to include parameters for different similarity scores that we want to choose.

As an example for this article, we will take a more organic approach and try to implement a content-based recommendation for a news aggregator website.

Consider the following use case -

A news aggregator wants to solve the following problem: When a customer browses a particular article, what other articles should we suggest to him? The challenge is we don’t have any information about customer preferences. We are either looking at the customer for the first time or we don’t have any mechanism set up yet to capture customer interaction with our products/items.

We will be pulling up data from a news aggregator dataset from UCI public repository.

when a user is browsing a particular news article, we need to give him other news articles as recommendations, based on:

The text content of the title of the article he is currently reading
The publisher of this document
The category to which document belongs
The polarity of the document (something we will calculate based on the text content of the title)

Polarity identification algorithms use text mining to get the document's opinion. We will use one such algorithm to get the polarity of our texts.
We need multiple similarity measures for this use case:

Cosine distance/similarity for comparing words in two documents
For the polarity, a Manhattan distance measure
For the publisher and category, Jaccard’s distance

Refer GitHub for the complete markdown of this project.

https://medium.com/media/9e219d3dc73c57abca11850760f29df4/href

Here is the break down of the code -

The first part of the code (up to line 62) focusses on data wrangling, and then refining the dataset to contain only a fraction of data (for scaling down the project) and further focussing only on the top 100 publishers with the maximum number of articles.

Similarity Index

We use a bag-of-words representation of all article titles to measure their similarity scores. We use cosine distance as a similarity scoring measure because it is non-invariant to changes in the magnitude of values and will change if there are changes in the article.

The process involves:

using tm package in R, creating a document term matrix (dtm) of all the articles.
use it to measure cosine distance between articles and return a document matrix.

# cosine distance
sim.score <- tcrossprod_simple_triplet_matrix(dtm)/(sqrt( row_sums(dtm^2) %*% t(row_sums(dtm^2)) ))

sim.score[1:10,1:10]

image by author

Search

In this section, our aim is to filter the top 30 articles based on cosine similarity matches. So for example our current document is article ID 16947 titled “UPDATE 1-Ukraine crisis worries hammer German investor morale” by Reuters, then we pick up the top 30 matching articles based on cosine distance.

# merge title.df and other.df with match.refined:
match.refined <- inner_join(match.refined, title.df,by = "ID")
match.refined <- inner_join(match.refined, others.df,by = "ID")

head(match.refined)

image by author

Polarity Scores

We leverage sentimentr package in R to measure sentiments of the top articles that we have collected. A score of -1 indicates that the sentence has a very negative polarity. A score of 1 means that the sentence is very positive. A score of 0 refers to the neutral nature of the sentence.

# update the match.refined data frame with the polarity scores:
match.refined$polarity <- sentiment.score$sentiment
head(match.refined)

image by author

Jaccard’s distance

We also use publisher and category similarity from the current article by measuring it via the Jaccard index. The Jaccard index measures the similarity between two sets and is a ratio of the size of the intersection and the size of the union of the participating sets. Here we have only had two elements, one for publisher and one for the category, so our union is 2. The numerator, by adding the two Boolean variables, we get the intersection. We also calculate the absolute difference (Manhattan distance) in the polarity values between the articles in the search results and our search article. We do a min/max normalization of the difference score,

# Jaccard's distance
match.refined$jaccard <- (match.refined$is.publisher + match.refined$is.category)/2

# Manhattan distance
match.refined$polaritydiff <- abs(target.polarity - match.refined$polarity)

range01 <- function(x){(x-min(x))/(max(x)-min(x))}
match.refined$polaritydiff <- range01(unlist(match.refined$polaritydiff))

image by author

Fuzzy logic ranking

Finally, we use these 3 scores to apply the fuzzy rule to the list and get rankings for our top 30 articles. Based on the interaction between the linguistic variables cosine, Jaccard, and polarity, the ranking linguistic variables are assigned different linguistic values. These interactions are defined as rules. Having defined the linguistic variables, linguistic values, and the membership function, we proceed to write down our fuzzy rules.

# The get.ranks function is applied in each row of match.refined to get the fuzzy ranking. Finally, we sort the results using this ranking.
get.ranks <- function(dataframe){
  cosine =  as.numeric(dataframe['cosine'])
  jaccard = as.numeric(dataframe['jaccard'])
  polarity = as.numeric(dataframe['polaritydiff'])
  fi <- fuzzy_inference(ranking.system, list(cosine = cosine,  jaccard = jaccard, polarity=polarity))
  return(gset_defuzzify(fi, "centroid"))
  
}

match.refined$ranking <- apply(match.refined, 1, get.ranks)
match.refined <- match.refined[order(-match.refined$ranking),]
match.refined

image by author

This brings us to the end of our design and implementation of a simple fuzzy-induced content-based recommendation system.

There is more to the story —

Part 1 — Association Analysis

Part 2 — Customer Lifetime Value

Must know ML techniques for digital analysts: Part 3— Recommendation Engines was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Must know ML techniques for digital analysts: Part 2 — Customer Lifetime Value

Abhinav Sharma — Wed, 16 Sep 2020 13:09:52 GMT

Must know ML techniques for digital analysts: Part 2 — Customer Lifetime Value

Introduction to machine learning techniques that can help you to optimize your digital analytics value chain

image by author

As pointed out in Part 1 where we covered the concept of Association Analysis, analytics is increasingly becoming more augmented and relational. As a digital analyst, it's better for us to be aware of the connection between different data sources and be ready to produce much more than just descriptive analytics.

With this necessity in mind, we will cover another smart digital analytics concept — Customer Lifetime Value.

Calculating Lifetime Value

Customer lifetime value (CLV) is the discounted sum of future cash flows attributed to the relationship with a customer. CLV estimates the ‘profit’ that an organization will derive from a customer in the future. The CLV can be used to evaluate the amount of money that can reasonably be devoted to customer acquisition. Even without considering the dollar value, CLV modeling still helps in identifying the most important (aka. profitable) customer segments which can then receive different treatment in terms of the acquisition strategy.

The taxonomy of CLV models depends much on the nature of the business. If the business has a contractual relationship (eg. subscription model) with customers then the most important issue in these situations is retaining customers over time, and survival analysis models are used to study the time until a customer cancels. These models are sometimes also referred to as ‘gone for good’ models because the models assume customers who cancel the service will not return.

A simplistic CLV formula for a typical ‘gone for good’ models -

The CLV formula multiplies the per-period cash margin, $M, by a long-term multiplier that represents the present value of the customer relationship’s expected length:

CLV = $M [r / 1 + d-r]

where r is the per-period retention rate and d is the per-period discount rate.

The other main class of CLV models is called ‘always a share.’ These models do not assume that customer inactivity implies the customer will never return. For example, a retail customer who does not buy this month might come back next month.

It is these types of CLV models that are most relevant for an online retail business setup where users engage with the business at will, such as in e-commerce stores in which users might make purchases at any time.

In this article, I elaborate on one of the models that can be utilized to predict future CLV based on the customer’s historical transactions with the business. The model that is described in this series is best applied to predict the future value for existing customers who have at least a moderate amount of transaction history.

Buy Till you Die Probabilistic models for CLV

The BTYD models capture the non-contractual purchasing behavior of customers — or, more simply, models that tell the story of people buying until they die (become inactive as customers).

There are 2 models widely used -

Pareto/NBD
BG/NBD

The BG/NBD (Beta Geometric Negative Binomial Distribution) model is easier to implement than Pareto/NBD and runs faster (I am intentionally being naive). The two models tend to yield similar results. These models assume probabilistic distribution of rates at which customers make purchases and the rate at which they drop out. The modeling is based on four parameters that describe these assumptions.

Statistical introduction to the assumptions is important but out of the scope of this article. A high-level idea to keep in mind is that these models assume that the interactions of the customers with your business should be at their own will (in other words, random for your interpretation). If you have any influence (campaigns or promotional offers) in acquiring your customers, then that sort of historical data is not ideal to fit in these models.

Example Implementation

Let's consider the following scenario -

A retailer is re-strategizing their CPC ads. They want to understand who their most profitable customers are and what is the demography of this ideal customer. Eventually, they want to better target their CPC ads to get the most profitable customers possible. They have provided you with the last couple of years of transaction data and expect and output in terms of three customer buckets — high, medium, and low value. They expect to feed this customer value identifier back to their data lake and understand more about associated demographies for each bucket.

We apply the model on a public dataset (details in citation). The dataset has transactions for the year 2010–11. We will first be preparing the data to shape it in the expected ‘event log’ format for the model. An event log is basically a log of each customer’s purchase with a record of revenue and timestamp associated with it. We will then be using a probabilistic model to calculate CLV. The data is enough to extract Recency, Frequency, and Monetary (RFM) values. The solution here uses an existing BTYD library in R.

Please refer my GitHub for complete code -

https://medium.com/media/0f847b0104e38813f796da617e2dbdd4/href

Leaving out the upload and data cleaning part (where we convert the dataset into event log format), here is the breakdown of the remaining code —

Weekly transaction Analysis

Methods elog2cum and elog2inc take an event log as a first argument and count for each time unit the cumulated or incremental number of transactions. If the argument first is set to TRUE, then a customer’s initial transaction will be included, otherwise not.

op <- par(mfrow = c(1, 2), mar = c(2.5, 2.5, 2.5, 2.5))
# incremental
weekly_inc_total <- elog2inc(elog, by = 7, first = TRUE)
weekly_inc_repeat <- elog2inc(elog, by = 7, first = FALSE)
plot(weekly_inc_total, typ = "l", frame = FALSE, main = "Incremental")
lines(weekly_inc_repeat, col = "red")
# commualtive
weekly_cum_total <- elog2cum(elog, by = 7, first = TRUE)
weekly_cum_repeat <- elog2cum(elog, by = 7, first = FALSE)
plot(weekly_cum_total, typ = "l", frame = FALSE, main = "Cumulative")
lines(weekly_cum_repeat, col = "red")

image by author

Further, we need to convert the event log into a customer-by-sufficient-statistic (CBS) format. The elog2cbs method is an efficient implementation for the conversion of an event log into CBS data.frame, with a row for each customer. This is the required data format for estimating model parameters. Argument T.cal allows one to calculate the summary statistics for a calibration and a holdout period separately.

Instead of realistic calibration and holdout, I would like to use T.cal to sample only transactions before the holiday shopping for 2011, where there is an incremental spike. This will keep the estimated parameters realistic for future predictions.

calibration_cbs = elog2cbs(elog, units = "week", T.cal = "2011-10-01")
head(calibration_cbs)

image by author

The returned field cust is the unique customer identifier, x the number of repeat transactions (i.e., frequency), t.x the time of the last recorded transaction (i.e., recency), litt the sum over logarithmic intertransaction times (required for estimating regularity), first the date of the first transaction, and T.cal the duration between the first transaction and the end of the calibration period. The time unit for expressing t.x, T.cal and litt are determined via the argument units, which is passed forward to method difftime, and defaults to weeks. Only those customers are contained, who have had at least one event during the calibration period.

cust: Customer id (unique key).
x: Number of recurring events in calibration period.
t.x: Time between first and last event in calibration period.
litt: Sum of logarithmic intertransaction timings during calibration period.
sales: Sum of sales in calibration period, incl. initial transaction.
first: Date of first transaction in calibration period.
T.cal: Time between first event and end of calibration period.
T.star: Length of holdout period.
x.star: Number of events within holdout period.
sales.star: Sum of sales within holdout period.

Estimating the parameter values of the BG/NBD process.

# estimate parameters for various models
params.bgnbd <- BTYD::bgnbd.EstimateParameters(calibration_cbs) # BG/NBD
row <- function(params, LL) {
names(params) <- c("k", "r", "alpha", "a", "b")
c(round(params, 3), LL = round(LL))
}
rbind(`BG/NBD` = row(c(1, params.bgnbd),
BTYD::bgnbd.cbs.LL(params.bgnbd, calibration_cbs)))

##        k     r alpha     a     b     LL
## BG/NBD 1 0.775 7.661 0.035 0.598 -29637

Predicting on holdout period

# predicting on holdout
calibration_cbs$xstar.bgnbd <- bgnbd.ConditionalExpectedTransactions(
params = params.bgnbd, T.star = 9,
x = calibration_cbs$x, t.x = calibration_cbs$t.x,
T.cal = calibration_cbs$T.cal)
# compare predictions with actuals at aggregated level
rbind(`Actuals` = c(`Holdout` = sum(calibration_cbs$x.star)),
`BG/NBD` = c(`Holdout` = round(sum(calibration_cbs$xstar.bgnbd))))

##         Holdout
## Actuals    4308
## BG/NBD     2995

Comparing the predictions at an aggregate level, we see that the BG/NBD ‘under predicts’ for the dataset. That is attributed to the high jump in transactions during the holdout period (Nov and Dec 2011). The aggregate level dynamics can be visualized with the help of bgcnbd.PlotTrackingInc

nil <- bgnbd.PlotTrackingInc(params.bgnbd,
T.cal = calibration_cbs$T.cal,
T.tot = max(calibration_cbs$T.cal + calibration_cbs$T.star),
actual.inc.tracking = elog2inc(elog))

image by author

In case testing the model, we can use the holdout period to calculate MAE

# mean absolute error (MAE)
mae <- function(act, est) {
stopifnot(length(act)==length(est))
sum(abs(act-est)) / sum(act)
}
mae.bgnbd <- mae(calibration_cbs$x.star, calibration_cbs$xstar.bgnbd)
rbind(
`BG/NBD` = c(`MAE` = round(mae.bgnbd, 3)))

##          MAE
## BG/NBD 0.769

Parameters for gamma spend

Now we need to develop a model for the average transaction value for a customer. We will use a two-layered hierarchical model. The average transaction value will be Gamma distributed with shape parameter p. The scale parameter of this Gamma distribution is also Gamma distributed, with shape and scale parameters q and $$, respectively. Estimating these parameters requires the data to be in a slightly different format than the cbs format we used for the BG/NBD model. Instead, we simply need the average transaction value and the total number of transactions for each customer. This is easily obtained using dplyr notation on the elog object.

spend_df = elog %>%
    group_by(cust) %>%
    summarise(average_spend = mean(sales),
              total_transactions = n())

## `summarise()` ungrouping output (override with `.groups` argument)

spend_df$average_spend <- as.integer(spend_df$average_spend)
spend_df <- filter(spend_df, spend_df$average_spend>0)
  
  head(spend_df)

image by author

Now let’s plug this formatted data into the spend.EstimateParameters() function from the BTYD package to get the parameter values for our Gamma-Gamma spend model.

gg_params = spend.EstimateParameters(spend_df$average_spend, 
                                       spend_df$total_transactions)
  gg_params

## [1]   2.619805   3.346577 313.666656

Applying the model to the entire cohort

With all the parameters needed to understand the transaction and average revenue behavior, we can now apply this model to our entire cohort of customers. To do so, we will need to create a cbs data frame for our entire data set (i.e., no calibration period). We can make use of the elog2cbs() function again, but omit the calibration_date argument. We can then calculate expected transactions and average transaction value for the next 12 weeks for each customer.

customer_cbs = elog2cbs(elog, units = "week")
customer_expected_trans <- data.frame(cust = customer_cbs$cust,
                                      expected_transactions = 
                                        bgnbd.ConditionalExpectedTransactions(params = params.bgnbd,
                                                                              T.star = 12,
                                                                              x = customer_cbs[,'x'],
                                                                              t.x = customer_cbs[,'t.x'],
                                                                              T.cal  = customer_cbs[,'T.cal']))
customer_spend = elog %>%
  group_by(cust) %>%
  summarise(average_spend = mean(sales),
            total_transactions = n())

## `summarise()` ungrouping output (override with `.groups` argument)

customer_spend <- filter(customer_spend, customer_spend$average_spend>0)
customer_expected_spend = data.frame(cust = customer_spend$cust,
                                     average_expected_spend = 
                                        spend.expected.value(gg_params,
                                                             m.x = customer_spend$average_spend,
                                                             x = customer_spend$total_transactions))

Combining these two data frames together gives us the next three month’s worth of customer value for each person in our data set. We can further bucket them into high, medium, and low categories.

merged_customer_data = customer_expected_trans %>%
  full_join(customer_expected_spend) %>%
  mutate(clv = expected_transactions * average_expected_spend,
         clv_bin = case_when(clv >= quantile(clv, .9, na.rm = TRUE) ~ "high",
                             clv >= quantile(clv, .5, na.rm = TRUE) ~ "medium",
                             TRUE ~ "low"))

merged_customer_data %>%
  group_by(clv_bin) %>%
  summarise(n = n())

image by author

Combining historical spend and forecast together and saving it as an output csv —

customer_clv <- left_join(spend_df, merged_customer_data, by ="cust")
head(customer_clv)
write.csv(customer_clv, "clv_output.csv")

image by author

Plot of CLV Clusters


customer_clv  %>% 
    ggplot(aes(x = total_transactions,
               y = average_spend,
               col = as.factor(clv_bin),
               shape = clv_bin))+
    geom_point(size = 4,alpha = 0.5)

image by author

The output csv can now be fed back to and merged with the customer database. The clv bucket metric will then be available for breaking down in terms of other demographic or behavioral information.

Citation

Daqing Chen, Sai Liang Sain, and Kun Guo, Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining, Journal of Database Marketing and Customer Strategy Management, Vol. 19, №3, pp. 197â€“208, 2012 (Published online before print: 27 August 2012. doi: 10.1057/dbm.2012.17).

Must know ML techniques for digital analysts: Part 2 — Customer Lifetime Value was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Must know ML techniques for digital analysts — Part 1: Association Analysis

Abhinav Sharma — Tue, 08 Sep 2020 10:39:45 GMT

Must know ML techniques for digital analysts — Part 1: Association Analysis

Introduction to machine learning techniques that can help you to optimize your digital analytics value chain

Data Science processes across the digital analytics value chain. Image by author

Business users are increasingly becoming self-service on basic digital analysis and reporting. If you are a digital analyst, it will be a smart move to start to expand/shift towards data science now and be equipped to offer more than basic analysis and dash-boarding.

This was my motivation to learn, apply, and eventually share knowledge around these data science techniques that have been part of incremental data analysis long before “data science” was a buzz word.

My approach would be to explain the concept and the use cases it can cover. I will highlight the prerequisites and follow it up with a decent example. My example implementations will be programmed in R.

In this part 1, I talk about Association Analysis more popularly referred to as Market Basket Analysis. This analysis is relevant for retail and other setups, where a user can add to and eventually purchase multiple products, from a shopping cart. The objective is to better understand what sort of products or items go together well and use that information for better cross-selling, merchandising, or targeted offers.

Associations: Finding Items That Go Together

photo by David Veksler on Unsplash

Association analysis is a statistical technique that helps you identify top association rules between your products.

What is an association rule? An example from a grocery transaction would be that the association rule is a recommendation of the form {peanut butter, jelly} => { bread }. It says that, based on the transactions, it’s expected that bread will most likely be present in a transaction that contains peanut butter and jelly. It’s a recommendation to the retailer that there is enough evidence in the database to say that customers who buy peanut butter and jelly will most likely buy bread.

Association Analysis is simply a search through the data for combinations of items whose statistics are interesting. It helps us establish rules dictating something like “If A occurs then B is likely to occur as well.”

But, what are these interesting stats that we have to look for and how should we set their values/thresholds?

Association parameters (interesting statistics!)

First, we need to consider complexity control: there are likely to be a tremendous number of co-occurrences, many of which might simply be due to chance, rather than to a generalizable pattern. A simple way to control complexity is to place a constraint that such rules must apply to some minimum percentage of the data — let’s say that we require rules to apply to at least 0.01% of all transactions. This is called the support of the association.

We also have the notion of “likely” in the association. If a customer buys the jelly then she is likely to buy the bread. Again, we may want to require a certain minimum degree of likelihood for the associations we find. The probability that B occurs when A; it is p(B|A), which in association mining is called the confidence of the rule (not to confuse it with statistical confidence). So we might say we require the confidence to be above some threshold, such as 5% (so that 5% or more of the time, a buyer of A also buys B).

Just Support and Confidence as a parameter might be misleading for items that are too common/ popular in the basket. It is more likely that popular items are part of the same basket just because they are popular rather than anything else. We need some measure of “surprise” for association analysis. Lift and Leverage are two parameters providing that. The lift of the co-occurrence of A and B is the probability that we actually see the two together, compared to the probability that we would see the two together if they were unrelated to (independent of) each other. As with other uses of the lift we’ve seen, a lift greater than one is the factor by which seeing A “boosts” the likelihood of seeing B as well. An alternative is to look at the difference between these quantities rather than their ratio. This measure is called leverage.

So, for any items A and B in a transaction -

Support=p(A⋂B)

Confidence=p(A|B) or p(A⋂B)/p(A)

Lift(A,B)=p(A⋂B)/p(A)p(B)

Leverage(A,B)=p(A⋂B)−p(A)p(B)

As a Market basket analyst, your job is to search for rules with a lift that are greater than 1 backed with high confidence values and often, high support.

Other Applications

Since we’re using the market basket as an analogy at this point, we should consider broadening our thinking of what might be an item. Why can’t we put just about anything we might be interested in finding associations with into our “basket”? For example, we might put a user’s location into the basket, and then we could see associations between purchase behavior and locations. For actual market basket data, these sometimes are called virtual items, to distinguish from the actual items that people put into their basket in the store. Association analysis finds and tells us statistically significant observations like “If A occurs then B is likely to occur as well.” Now, we can replace anything for A and B provided they happened together (can be basketed).

With the above logic, we have several other applications of association analysis beyond cross-sell opportunities in online e-commerce. It can help us answer questions like:

What seasonal or brand factors contribute towards product mix in the basket? The product mix that the association analysis highlights could vary for different industries. For example, it could be the origin and destination cities for travel operators.
Is the mix of products different for customers who purchase on their mobile device? What products are they more or less likely to purchase?

Apriori algorithm

There are several algorithmic implementations for association rule mining. Key among them is the apriori algorithm by Rakesh Agrawal and Ramakrishnan Srikanth, introduced in their paper, Fast Algorithms for Mining Association Rules.

The Apriori algorithm is a commonly-applied technique in computational statistics that identifies itemsets that occur with a support greater than a pre-defined value (frequency) and calculates the confidence of all possible rules based on those itemsets.

The Apriori algorithm is implemented in the arules package, which can be installed and run in R.

Transactions

The algorithm takes as input, transactional data. Transactions are purchases made by a customer on a single visit to a retail store. Typically, transaction data can include the products purchased, quantity purchased, the price, discount, if applied, and a timestamp. A single transaction can include multiple products. It may register information about the user who made the transaction in some cases, where the customer allows the retailer to store his information by joining a rewards program. For mining, the transaction data is first transformed into a binary purchase incidence matrix with columns corresponding to the different items and rows corresponding to transactions. The matrix entries represent the presence (1) or absence (0) of an item in a particular transaction.

Example Implementation

Association mining is based on probability measures hence generating reliable insights from analysis typically requires large volumes of transactional data. Large data sets are difficult to process without highly-scalable storage and compute resources. Usually, you will be doing this exercise sourcing data from your data lake using cloud-based architecture however the inherent principles will remain the same and your objective will be to get data in a transactional format to apply this rule. R has packages to connect to most of the systems and you can even use SQL for data wrangling.

Let's consider the following scenario -

A retailer is planning a marketing campaign on a large scale to promote sales. One aspect of his campaign is the cross-selling strategy. Cross-selling is the practice of selling additional products to customers. In order to do that, he wants to know what items/products tend to go together. Equipped with this information, he can now design his cross-selling strategy. He expects us to provide him with a recommendation of top N product associations so that he can pick and choose among them for inclusion in his campaign.

We will implement a project where we will apply association rule mining to a retail dataset with the final objective of recommending cross-sell items. This project is based on a dataset released by Instacart in 2017. They released over 3 million anonymized orders for the machine learning community to try hands-on. I will be using a subset training dataset for association rule mining (assuming our cross-sell use case). We will have to do some initial data wrangling to get the desired transaction format. Please refer citation below for dataset related information.

Refer github for the complete code of this project.

https://medium.com/media/9d9ab74d95da6267cb01d33102cef4a2/href

Here is the break down of the code -

loading the required libraries

library(dplyr)
library(arules)
library(arulesViz)

Data wrangling

We are provided with 2 files. An orders csv with around 131k orders with order ID and product ID observations and a product file with product ID and product name mapping. First, we will create a transaction dataset containing order ID and associated product name.

# Orders csv
file1.path = "./order_products__train.csv"
orders = read.csv(file1.path)
head(orders)

image by author

# Products csv
file2.path = "./products.csv"
products = read.csv(file2.path)
head(products)

image by author

Combining both of them and forming a single transaction dataset -

# Combining both of them and forming a single transaction dataset -
data = left_join(orders, products, by = "product_id") %>% select(order_id, product_name)
head(data,50)

image by author

Let’s quickly explore our data. We can count the number of unique transactions and the number of unique products:

# We can count the number of unique transactions and the number of unique products
data %>%
 group_by('order_id') %>%
 summarize(order.count = n_distinct(order_id))

image by author

data %>%
 group_by('product_name') %>%
 summarize(product.count = n_distinct(product_name))

image by author

We have 131209 transactions and 39123 individual products. There is no information about the number of products purchased in a transaction. We have used the dplyr library to perform these aggregate calculations, which is a library used to perform efficient data wrangling on data frames.

writing it back to csv

# writing it back to csv
write.table(data,file =  "./data.csv", row.names = FALSE, sep = ";", quote = FALSE)

Association Rule Mining

We begin with reading our transactions stored in the data frame and create an arules data structure called transactions.

# create an arules data structure called transactions
data.path = "./data.csv"
transactions.obj <- read.transactions(file = data.path, format = "single",
 sep = ";",
 header = TRUE,
 cols = c("order_id", "product_name"),
 rm.duplicates = FALSE,
 quote = "", skip = 0,
 encoding = "unknown")

Looking at the parameters of read.transactions, the function used to create the transactions object. For the first parameter, file, we pass our file where we have the transactions from the retailer. The second parameter, format, can take any of two values, single or basket, depending on how the input data is organized. In our case, we have a tabular format with two columns–one column representing the unique identifier for our transaction and the other column for a unique identifier representing the product present in our transaction. This format is named single by arules. Refer to the arules documentation for a detailed description of all the parameters.

On inspecting the newly created transactions object transaction.obj:

# inspecting the newly created transactions object transaction.obj
transactions.obj

## transactions in sparse format with
##  131209 transactions (rows) and
##  39121 items (columns)

We can see that there are 131209 transactions and 39121 products. They match the previous count values from the dplyr output.

We can explore the most frequent items, that is, the items that are present in most of the transactions and vice versa — the least frequent items and the items present in many fewer transactions?

The itemFrequency function in the arules package comes to our rescue. This function takes a transaction object as input and produces the frequency count (the number of transactions containing this product) of the individual products:

data.frame(head(sort(itemFrequency(transactions.obj, type = "absolute"), decreasing = TRUE), 10)) # Most frequent

image by author

data.frame(head(sort(itemFrequency(transactions.obj, type = "absolute"), decreasing = FALSE), 10)) # Least frequent

image by author

In the preceding code, we print the most and the least frequent items in our database using the itemFrequency function. The itemFrequency function produces all the items with their corresponding frequency and the number of transactions in which they appear. We wrap the sort function over itemFrequency to sort this output; the sorting order is decided by the decreasing parameter. When set to TRUE, it sorts the items in descending order based on their transaction frequency. We finally wrap the sort function using the head function to get the top 10 most/least frequent items.

The Banana product is the most frequently occurring across 18726 transactions. The itemFrequency method can also return the percentage of transactions rather than an absolute number if we set the type parameter to relative instead of absolute.

The purpose of this project is to focus on the method rather than the output. If you will refer to the dataset source — the dataset includes orders from many different retailers and is a heavily biased subset of Instacart’s production data, and so is not a representative sample of their products, users, or their purchasing behavior.

Another convenient way to inspect the frequency distribution of the items is to plot them visually as a histogram. The arules package provides the itemFrequencyPlot function to visualize the item frequency:

# itemFrequencyPlot function to visualize the item frequency
itemFrequencyPlot(transactions.obj,topN = 25)

image by author

The item frequency plot should give us some idea about the threshold that we should maintain for support. Usually, we should select a support threshold where the long tail starts.

Now that we have successfully created the transaction object, let’s proceed to apply the apriori algorithm to this transaction object.

The apriori algorithm works in two phases. Finding frequent itemsets is the first phase of the association rule mining algorithm. A group of product IDs is called an itemset. The algorithm makes multiple passes into the database; in the first pass, it finds out the transaction frequency of all the individual items. These are itemsets of order 1. We will introduce the first interest measure, Support, here.

Now, in the first pass, the algorithm calculates the transaction frequency for each product. At this stage, we have order 1 itemsets. We will discard all those itemsets that fall below our support threshold. The assumption here is that items with a high transaction frequency are more interesting than the ones with a very low frequency. Items with very low support are not going to make for interesting rules further down the pipeline. Using the most frequent items, we can construct the itemsets as having two products and find their transaction frequency, that is, the number of transactions in which both the items are present. Once again, we discard all the two product itemsets (itemsets of order 2) that are below the given support threshold. We continue this way until we have exhausted them.

# Interest Measures
 support <- 0.005
# Frequent item sets
 parameters = list(
 support = support,
 minlen = 2, # Minimal number of items per item set
 maxlen = 10, # Maximal number of items per item set
 target = "frequent itemsets")
 freq.items <- apriori(transactions.obj, parameter = parameters)

image by author

The apriori method is used in arules to get the most frequent items. This method takes two parameters, the transaction.obj and the second parameter, which is a named list. We create a named list called parameters. Inside the named list, we have an entry for our support threshold. We have set our support threshold to 0.005, namely, one percent of the transaction. We settled at this value by looking at the histogram we plotted earlier. By setting the value of the target parameter to frequent itemsets, we specify that we expect the method to return the final frequent itemsets. Minlen and maxlen set lower and upper cut off on how many items we expect in our itemsets. By setting our minlen to 2, we say we don’t want itemsets of order 1. While explaining the apriori in phase 1, we said that the algorithm can do many passes into the database, and each subsequent pass creates itemsets that are of order 1, greater than the previous pass. We also said apriori ends when no higher-order itemsets can be found. We don’t want our method to run till the end, hence by using maxlen, we say that if we reach itemsets of order 10, we stop. The apriori function returns an object of type itemsets.

It’s good practice to examine the created object, itemset in this case. A closer look at the itemset object should shed light on how we ended up using its properties to create our data frame of itemsets:

str(freq.items)

image by author

By calling the function label and passing the freq.items object, we retrieve the item names:

# Let us examine our freq item sites
 freq.items.df <- data.frame(item_set = labels(freq.items)
 , support = freq.items@quality)
head(freq.items.df,10)

image by author

Let’s move on to phase two, where we will induce rules from these itemsets. It’s time to introduce our second interest measure, confidence. Let’s take an itemset from the list given to us from phase one of the algorithm, {Banana, Blueberries}.

We have two possible rules here:

Banana => Blueberries: The presence of Banana in a transaction strongly suggests that Blueberries will also be there in the same transaction. Blueberries => Banana: The presence of Blueberries in a transaction strongly suggests that Banana will also be there in the same transaction. How often are these two rules found to be true in our database? The confidence score, our next interest measure, will help us measure this:

confidence <- 0.2 # Interest Measure
 
 parameters = list(
 support = support,
 confidence = confidence,
 minlen = 2, # Minimal number of items per item set
 maxlen = 10, # Maximal number of items per item set
 target = "rules"
 )
rules <- apriori(transactions.obj, parameter = parameters)

image by author

Once again, we use the apriori method; however, we set the target parameter in our parameters named list to rules. Additionally, we also provide a confidence threshold. After calling the method apriori using the returned object rules, we finally build our data frame, rules.df, to explore/view our rules conveniently. Let’s look at our output data frame, rules.df. For the given confidence threshold, we can see the set of rules thrown out by the algorithm:

# output data frame, rules.df
rules.df <- data.frame(rules = labels(rules), rules@quality)
head(rules.df)

image by author

Lift is also reflected as another interest measure in the dataframe.

Alright, we have successfully implemented our association rule mining algorithm; we went under the hood to understand how the algorithm works in two phases to generate rules. We have examined three interest measures: support, confidence, and lift. Finally, we know that the lift can be leveraged to make cross-selling recommendations to our retail customers.

Given the rule A => B, we explained that lift calculates how many times A and B occur together more often than expected. There are other ways of testing this independence, like a chi-square test or a Fisher’s test. The arules package provides the is.significant method to do a Fisher or a chi-square test of independence. The parameter method can either take the value of fisher or chisq depending on the test we wish to perform.

# is.significant method to do a Fisher test of independence
is.significant(rules, transactions.obj, method = "fisher")

##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [15] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [29] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [43] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

We have written a function called find.rules. This function returns the list of top N rules given the transaction and support/confidence thresholds. We are interested in the top 10 rules. We are going to use leverage values for our recommendation.

# top N rules
find.rules <- function(transactions,topN = 10){
 
 other.im <- interestMeasure(rules, transactions = transactions)
 
 rules.df <- cbind(rules.df, other.im[,c('conviction','leverage')])
 
 
 # Keep the best rule based on the interest measure
 best.rules.df <- head(rules.df[order(-rules.df$leverage),],topN)
 
 return(best.rules.df)
 }

cross.sell.rules <- find.rules(transactions.obj)
cross.sell.rules$rules <- as.character(cross.sell.rules$rules)
cross.sell.rules

image by author

The first four entries have a lift value of 2 to 4, indicating that the products are not independent. These rules have support of around 2 percent and the system has 30 percent confidence for these rules. But wait, what about leverage? These items have a leverage of about 1 percentage points. Whatever is driving the co-occurrence results in a one-percentage-point increase in the probability of buying both together over what we would expect simply because they are popular items. Is that sufficient for cross-selling decisions? Maybe yes or no... that’s a business dependent decision.

For the sake of this example, we recommend that the retailer uses these top products in his cross-selling campaign as, given the lift value, there is a high probability of the customer picking up a {Bag of Organic Bananas} if he picks up an {Organic Hass Avocado}.

We have also included one other interest measure — conviction.

Convicton: Conviction is a measure to ascertain the direction of the rule. Unlike lift, conviction is sensitive to the rule direction. Conviction (A => B) is not the same as conviction (B => A). Conviction, with the sense of its direction, gives us a hint that targeting the customers of Organic Hass Avocado to cross-sell will yield more sales of Bag of Organic Bananas rather than the other way round.

visualize the rules

The plot.graph function is used to visualize the rules that we have shortlisted based on their leverage values. It internally uses a package called igraph to create a graph representation of the rules:

library(igraph)

# visualize the rules
plot.graph <- function(cross.sell.rules){
 edges <- unlist(lapply(cross.sell.rules['rules'], strsplit, split='=>'))
 
 g <- graph(edges = edges)
 plot(g)
}
plot.graph(cross.sell.rules)

image by author

Weighted Transactions

In the pure vanilla use, arules package uses the frequency of the items in the itemset to measure support. We can replace this by explicitly providing weights for different transactions which can then replace support measures. Doing this can allow us to hardcode certain products in our association rules even though they may not be frequent (by assigning more weight to transactions that contain them).

In the arules package, the weclat method allows us to use weighted transactions to generate frequent itemsets based on these weights. We introduce the weights through the itemsetinfo data frame in the str(transactions.obj) transactions object.

If explicit weights are not available, we can use an algorithm called Hyperlink-induced topic search (HITS) to generate one for us. The basic idea of HITS algorithm is to assign weights such that a transaction with a lot of items is considered more important than a transaction with a single item.

The arules package provides the method (HITS). So for example here, we can use hits to generate weights and then use weclat method to do weighted association ruling…

# The arules package provides the method (HITS)
weights.vector <- hits( transactions.obj, type = "relative")
weights.df <- data.frame(transactionID = labels(weights.vector), weight = weights.vector)

head(weights.df)

image by author

Citation

The Instacart Online Grocery Shopping Dataset 2017”, Accessed from https://www.instacart.com/datasets/grocery-shopping-2017 on 3/09/2020

Must know ML techniques for digital analysts — Part 1: Association Analysis was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.