Entropy Sampling — How to balance “relevance” and “surprise” in recommendation

Reika Fujimura
10 min readFeb 23, 2023

--

Photo by Priscilla Du Preez on Unsplash

This is the full-version of the article posted from CBC Digital Labs.

Imagine opening your laptop and wandering around the internet for a while. You will probably find a lot of recommendations coming up in front of you suggesting “these are the best items for you to go next!”. Reflecting upon these moments, how many times did you find them irrelevant, too biased, or terrible for you?

Actually, you might even find them boorish or equal to a misplaced targeted advertisement, customized for someone else. Otherwise, just after you bought a new air fryer you might find so many other ones filling up the page that look almost the same. Then you might wish you could tell them that you have just bought a brand new nice one.

It is often difficult to balance relevance and diversity in real-world applications — After all, how can we get just the right extent of relevance, that is not too filtered at the same time so you won’t miss the chance to find new products, contents or items?

Indeed, there is plenty of research on how to introduce diversity on recommendations. For example, some papers suggest introducing maximal marginal relevance or intra-list diversity to reduce redundancy of result while keeping its relevance in the series of phrases. Thompson sampling tries to indirectly add diversity by adding some exploration to the pure recommendation results.

However, these approaches tend to be highly technical and heuristic, meaning they are heavily dependent on each business problem. It seems there is still no simple, established, and generalizable method to solve this problem, and that is why so many recommendations fail to be “natural” to us.

Here, the Customization and Machine Learning (CaML) team at CBC tackled this problem with an original method, which turns out to be not only robust but also generalizable to different machine learning models. In this blog post, I will share our algorithm which borrows the notion of “entropy” from statistical physics to balance “relevance” and “surprise” in our recommendations and how to understand it within this new context.

Idea of Entropy

In the context of data science, entropy is frequently used as a measurement of impurity, disorder or randomness of the information processed. For example, the decision tree algorithm learns to minimize the impurity of each leaf node by maximizing the homogeneity of information at every node, using entropy as the loss function. Entropy is also the basis of (as suggested by their names) relative entropy and cross entropy, which can be found in many places such as some dimensional reduction algorithms including t-SNE and UMAP.

In a broader perspective of information theory, entropy, or “Shannon entropy’’ refers to the “informational value” present in a variable. For example, if you find a news article that reports a typical event such as a traffic jam on a snowy day, you are probably not surprised by the news. In other words, the “informational value” of the message is not so significant. Alternatively, if an article pops up in front of you telling that some aliens just landed on the earth from a thousand light years away, it’s quite a huge breaking news story. This implies that the rarity of the event is in proportion with the information it gives.

The mathematical representation of Shannon entropy is described by the following function as the normalized log of inverse probability;

where p is an array of values = { p_0, p_1, … p_i, .. p_{n-1} } known as the probability mass function (PMF) which represents the probability p_i occurring for event i.

If there is plenty of randomness, in other words if the probability distribution of a random variable is closer to even, its entropy gets higher and becomes closer to the maximum value log(n). Inversely, if there is little surprise or impurity, which means the probability distribution is more skewed, the entropy gets smaller to approach zero. For example, the entropy of a random variable following the Bernoulli distribution can be solved exactly as

and when plotted displays this inverse relationship

Binary entropy function

As the graph shows, the more skewed the probability distribution is, the smaller the entropy of a random variable becomes. And this is the same for multivariable random variables.

Consider the multivariable case of rolling a die. We can no longer plot the entropy function as a line in two or even three dimensional space — we would need a six dimensional space! Given this limitation we need another way of understanding the entropy, one way of which is visualizing the shapes of the PMF and cumulative mass function (CMF) themselves which allude to an approximate value of entropy.

If you are playing with a fair die, the probabilities of getting each number is evenly ⅙, so the PMF and CMF look like the followings.

PMF and CMF of playing a normal die

As shown above, when the probability distribution is totally even, PMF is flat and so its CMF is a linear function. Here, the entropy can be calculated as

which is the maximum entropy.

This result of getting the maximum entropy suggests that we know nothing about which number we will get before playing the die. The “information value” is maximized when we play a purely random die.

Now consider a very unfair die, the information value is minimized as we know that it will almost always return the same value, say, 1. In this case, the PMF is skewed to its limit and as so its CMF becomes a step function as shown below.

PMF and CMF of playing a skewed die which only shows 1.

The entropy is calculated as

Opposite from the previous case, getting zero entropy means that there is no additional information value we obtain from rolling the die, obviously because we know for sure that we will get 1 beforehand.

Well, it sounds like both of these cases are too extreme — what happens if the entropy is in the middle of 1 and 0, say, 0.5?

One example of such a case is when we play a skewed die with the following probability distribution: {1: 40%, 2: 25%, 3: 16%, 4: 10%, 5: 5%, 6: 3%}

Now its PMF and CMF are shown in the graphs below.

Comparison of the PMF and CMF for rolling a skewed die, with examples of getting numbers at each x.

The more PMF is skewed, the lower its entropy becomes. The lower the entropy becomes, the more the elbow of its CMF get close to the left edge. To put it the other way around, when we decrease or increase the entropy, the expectation of getting each event becomes more or less predictable.

Just like the case of playing a die as discussed above, we defined “entropy” from a list of scores obtained from a machine learning model, which suggests how likely each user will like each new item. Let’s talk on a deeper level about how we implemented this in the next section.

Balancing the relevance and diversity in the recommendation for a person who liked a coral pink cat. With too low “expected surprise”, the recommendation looks boring, whereas with too high “expected surprise” it looks irrelevant.

Algorithms of Entropy Sampling

Often in recommendation systems, the output of an algorithm is a group of lists which score items for each user. In our case of an item-based collaborative filtering, we obtain a user-item matrix of similarity scores as the raw output from the machine learning model. A higher score as an element in the user-item matrix means the item is more relevant to the user.

Item-based collaborative filtering

There are several ways to create recommendations from the raw similarity score matrix. First, let’s consider the simple and widely applicable case of recommending five CBC podcast shows to a user.

The recommendation matrix R, which is a collection of the lists of recommendations for all users, is obtained by multiplying the user-item matrix U to the similarity score matrix S. The user-item matrix U stores the histories of each user’s consumption / ratings in each row. So, the recommendation matrix R can be thought of (considering the single user case) as the summation of cosine similarity values for item-to-item relations to be ranked. More precisely this is a weighted sum where the coefficients of the weights are unique to each user’s listening history.

In such a way, this recommender works like this: user A listened to show X, and show X tends to be liked by the same people who like show Y. Therefore, given that user A has not listened to show Y they should be given it as a recommendation. Let’s see the bare result first and see what this collaborative filtering model recommends.

Example recommendation for user A using the CBC Listen app with Top 5 sampling. A person who liked Metro Morning, which is a news show, gets a recommendation of Here and Now Toronto, Fresh Air, The Current, Ontario Today, and The Sunday Magazine, all of which are news shows.

These shows are reasonable recommendations, but might be too obvious for the user’s tastes. Furthermore, even if they are of interest to a user, if they open the CBC Listen App again the next day and see the same recommendations it would get awfully unhelpful fast!

So, let’s think about adding some randomness here to avoid the too obvious, too fixed recommendations we saw above. The most straightforward solution would be to do random sampling from the similarity score distribution. This means, from the user side, this similarity score distribution can be transformed into a probability distribution or probability mass function of getting each item in recommendations. Now, applying the random sampling technique recommendations for user A look like:

Example recommendation for user A sampled from the similarity score distribution. A person who liked Metro Morning, which is a news show, gets a recommendation of The Loop (Edmonton local news), Evil by Design (society), The Debaters (comedy show), World Report (news show), and Party Lines (politics).

Now another problem arises. This user liked Metro Morning, which is a news show, but got a lot of shows that don’t sound like related ones. User A might be confused and wonder where The Loop or Party Lines came from.

Things are becoming clear: if we can adjust the “entropy” from the “PMF” — which in this case is the cosine similarity score matrix — we might be able to balance the randomness to achieve the best of the two cases.

To adjust the PMF to achieve the optimal entropy, we defined a function to get an approximate solution for inverting the procedure of calculating entropy: we input the entropy we want and get an adjusted PMF that approximately achieves that specified entropy.

If a function y = f(x) is non-negative and monotone, the exponential function y = {f(x)}^γ where γ is a positive real number is also monotone.

If γ is zero all values transform to unity creating a uniform distribution of maximal entropy, as γ increases the entropy is lowered and the distribution becomes more skewed. The idea is that we try to find the optimal γ which allows us to sample from our recommendation vector to the user’s taste. In the implementation, we search over the possible values of normalized entropy to quickly find the desired γ for the input PMF.

Tuning γ to get optimal PMF.

How does it work?

As we have discussed above, we adjusted the entropy of PMF, the similarity score distribution in this case, to get the balance of surprise and relevance in our recommendation. After entropy sampling, the recommendation for User A looks like this.

Example recommendation for user A sampled with entropy sampling. A person who liked Metro Morning, which is a news show, gets a recommendation of Podcast Playlist (Producer’s Pick), Fresh Air (news), Here and Now Toronto (news), Evil by Design (society), and Ontario Today (news).

Now we have much more relevant recommendations like Fresh Air, Here and Now Toronto and Ontario Today. Moreover, there are also some new genres of shows. This recommendation is not as boring as the top 5 recommendation, and is much more relevant than the random sampling recommendation.

Since we adjust the PMFs for each user, each user’s recommendation is optimized separately and the effect of entropy sampling would be robust among all users.

Conclusion

In this blog post, I showed how we balance “relevance” and “surprise” in our recommendation. Although there is plenty of research about how to introduce diversity in recommendation, we took a simple approach by introducing the concept of entropy.

As discussed in the result section, we could see an improvement in our recommendation, with more fine-tuned relevance and surprise in our delivery endpoint! Not only that, we find our approach to be robust and applicable to other recommendation models because it is a simple post-processing method.

In the field of classic thermodynamics where entropy originally came from, entropy is a measure of the molecular disorder, or randomness, of a system. Our entropy sampling method enables us to tune surprise just as raising the temperature and see molecules mixing up together.

--

--