Finding Meaningful Associations in Retail Data

Published in

Weekly Data Science

6 min readJun 20, 2018

Although this post will focus on finding associations in retail data, these same techniques can be applied to other types of data, including text and genetic sequences.

This post builds off of a previous post, The Intuition Behind the Apriori Algorithm. There, I discussed how to efficiently find frequent itemsets in shopping data. Here, I’ll focus on what to do once you’ve identified those frequent itemsets. It’s safe to assume that we’re dealing with the same data. (Example data is available on this Github repository)

Vocabulary Recap

If you’ve read the Apriori post, these terms will be familiar.

Our dataset consists of shopping carts at time of checkout. We call these baskets.
Baskets consist of items. Groups of items are called itemsets.
The number of baskets that an itemset appears in is called the support of that itemset.
We deem an itemset to be a frequent itemset if its support is above some threshold that we choose.
Remember: If an itemset is frequent, all subsets of that itemset are also frequent. (This concept is discussed in detail in the Apriori post.)

What is an association rule?

An association rule implies that a particular item is likely to occur given the presence of some itemset.

An association rule implies that a particular item is likely to occur given the presence of some itemset.

Let I be the itemset {eggs, flour}. Let j be the item {milk}. Then I → j is an association rule that implies that {milk} is likely to occur in a shopping cart if {eggs, flour} occurs.

On its own, this association rule doesn’t tell us anything. We want to measure how significant or important our rule is. After all, maybe the appearance of milk is not affected by the presence of eggs and flour. Or maybe shoppers are less likely to buy milk when they buy eggs and flour.

We need some way of capturing this information.

Confidence

The confidence of an association rule is an important building block. It represents the ratio between the number of times that I and j have appeared together and the number of times that I has appeared on its own.

Confidence essentially answers the question: Out of all the times that people bought eggs and flour, how many times did they also buy milk? Or, more precisely: What proportion of the time?

Confidence essentially answers the question: Out of all the times that people bought eggs and flour, how many times did they also buy milk?

Mathematically, we can say:

Confidence(I →j) = Support(I ∪ j) / Support(I)

In our example, I ∪ j is the set {eggs, flour, milk}.

Note that every time the set {eggs, flour, milk} occurs, the set {eggs, flour} also occurs. This means that Support(I) ≥ Support(I ∪ j) ≥ 0, since supports must be greater than or equal to zero.

Thus, we know that 1 ≥ Confidence(I → j) ≥ 0.

Practice with Confidence

Let’s work through an example together.

What is the confidence of the association rule {eggs, flour} → {milk} in the dataset below?

First, notice that:

Support({eggs, flour}) = 3
and
Support({eggs, flour, milk}) = 3

Next, calculate the confidence:

Confidence({eggs, flour} → {milk}) = Support({eggs, flour, milk}) / Support({eggs, flour})
Confidence({eggs, flour} → {milk}) = 3 / 3 = 1

Nice! Our rule has the highest possible confidence value. That must be good, right? …Actually, not quite.

What if {milk} occurs frequently independently of whether {eggs, flour} is present? Then we might observe a large confidence value despite the fact that I and j are independent.

What if people always buy milk?

Consider the case where every shopping cart contains milk.

Remember, confidence answers the question: What proportion of the time that I appears does j also appear?

If people always buy milk, the answer will be 1. We can clearly see that in this case, buying eggs and flour does not influence people’s decision to purchase milk, so this confidence score is not particularly helpful.

Interest

The notion of interest can help us here. It extends the idea of confidence by subtracting out the proportion of baskets in which j appears.

Interest(I → j) = Confidence(I → j) - Pr(j),
where Pr(j) = Support(j) / (Num. Baskets)

By doing this, we account for the fact that j may occur independently of I.

Since confidence is a number in the range [0, 1] and the proportion of baskets in which j occurs is also a number in the range [0, 1], the interest of an association rule is in the range [-1, 1].

When an association rule has an interest close to 0, it indicates that the presence of I does not imply much about the presence of j. When a rule’s interest has an absolute value that is relatively large (typically above 0.5 or so), it indicates that this is a meaningful association in the dataset. Negative interest indicates that the presence of I discourages the presence of j, and positive interest indicates that the presence of I encourages the presence of j.

Practice with Interest

What is the interest of the association rule {eggs, flour} → {milk} in the dataset below? (same dataset as before)

We know from before that the confidence of this rule is 1. We can see that {milk} shows up in 3 out of the 4 baskets, so Pr(j) = 3/4, or 0.75.

Thus, Interest({eggs, flour} → {milk}) = 1 – 0.75 = 0.25.

Note that if milk had appeared in every basket, the rule would have an interest of 0.

Thus, we can see that although our rule had high confidence, it’s not actually a meaningful association in the dataset, as indicated by its low interest.

A closer look at one of the example datasets

If we look at the example dataset with 100,000 baskets and 5,000 possible items, we’ll see that almost all of the association rules have an interest of zero. This is to be expected, since I generated the dataset using this script, in which items are added to baskets independently of each other.

As expected, these are a bunch of meaningless associations.

Despite the fact that items are independent of each other, we can see that there are some association rules with non-zero interest scores. This happens purely by chance and does not imply that the rules are meaningful. That’s why we need to ensure that rules have relatively high interest scores before deeming them meaningful associations.

Want to practice your skills?

Try finding meaningful associations in the example datasets I’ve set up on Github, or check out Instacart’s Market-Basket Analysis Challenge on Kaggle.

Coming Soon

I’ll be writing a quick post detailing why the expected interest of I → j is 0 when I and j are independent. Keep an eye out!

Thanks for reading!

Let me know what you think of this post in the comments below — and if there are any topics you’d like to see me write about, please let me know!