Bayesian Penalty/Boost Function

Yingjie Miao
Keep It Up
Published in
5 min readJun 10, 2015

Scoring functions are widely used in our system, e.g. in search engine and recommendation service. In practice, a scoring function may be paired with a penalty/boost function for additional tweaks. Given an entity E to be scored, the final score is of form f(E)w(E), where f(E) is the original scoring function, and w(E) serves as the penalty/boost function. Usually, it is desirable to have a boosting/penalty function that takes account of historical data, which we shall discuss today.

To be concrete, consider a simple activity feed service. Suppose user U’s social network has a list of events/posts E_i. We would like to design a ranking algorithm to show these events to user U in certain order — and possibly discard some events if the scores turn out to be too low. Perhaps the simplest solution is to have a reverse chronological feed. This is equivalent to have a function f(t) on the timestamp of the event, such that an older event has lower score than a newer event. Of course, this kind of feed can be noisy: if a user V always has lots of events to share with his/her friends, and U almost always ignore events from V, showing V’s events to U can be annoying. Can we design a penalty/boost function based on the likelihood that U interacts with V’s events (e.g. click, comment, like)? In what follows, without loss of generality, we assume the penalty/boost function is of form w(x) = 1 + ɛ(x), where the drift function ɛ(x) measures how much the score is deviated from the original score.

Suppose we know that user V’s activities were shown to user U for a total of (m + n) times, and U clicked/liked m of them. Let θ = m / (m + n) be the ‘click through’ rate. With this data, a frequentist may go ahead and pick a monotonic increasing function ɛ(θ), e.g. ɛ(θ) = 2θ-1. Note that ɛ(1/2) = 0, i.e. if U is neutral about V’s event, there is no penalty or boost. This makes sense. A problem with this approach is that for (m, n) = (2, 1) and (m,n) = (20, 10), θ = 2/3 in both cases, which results in the same boosting effect. Intuitively, in the latter case, we have more data, and thus we are more certain that user U tends to like events from V, therefore the boosting can be stronger.

This issue can be addressed with a Bayesian approach. In general, let p(θ) be a probability density function and ɛ(θ) be a drift function, the marginalized drift is

This integral can be carried out via numerical integration or Monte Carlo methods. However, analytic solution exists when p(θ) is a beta distribution and ɛ(θ) is a polynomial.

First, a few words on beta distribution. By definition, the pdf is

where

is a normalization constant.

For m, n > 1, beta distribution B(x; m, n) has mode at (m-1)/(m+n-2). We plot three beta distributions: B(2+1, 1+1), B(4+1, 2+1), B(8+1, 4+1). (Blue, green, red, respectively.) The shape goes from more spread out to sharper. Note the three pdfs share the same mode at 2/3.

Now, let p(x) = B(x; m+1, n+1), and for simplicity, choose ɛ(x) = 2x-1, we have the marginalized drift E(m,n):

From here, one can derive a few interesting properties:
1) for m = n, E(m, n) = 0. Another way to look at this: when m = n, B(x; m+1, n+1) is symmetric about x = 1/2, while 2x-1 is anti-symmetric about x = 1/2, so the integral must be zero.
2) Let α > 1 be a constant, and consider a family of (m, n) of the form (m, n) = (α k, k). Write E(m, n) = E(α k, k) = E_k, one can show that:

This shows an advantage of the Bayesian approach over the frequentist approach: θ = α / (1 + α) for any k, so a frequentist gives same boosting value. However, in the Bayesian case, as we collect more data (k increases), the boosting effect is stronger. Similarly, if α < 1 the penalty effect becomes stronger as k increases.

3) In 2), as k goes to infinity, the pdf becomes a delta function at α / (1 + α). The integral can be carried out by plugging α / (1 + α) into ɛ(x) = 2x-1, which gives (α-1)/(α+1). The same result can be obtained by plugging (m, n) = (α k, k) into (m-n)/(m+n+3) and let k go to infinity.

Although it may be possible to generalize the above discussion into general polynomials that are anti-symmetric about x = 1/2, the details are beyond the scope of this blog post. The point is that, Bayesian approach may provide rich and interesting properties that a frequentist counterpart may not enjoy.

We wrote this post while working on Kifi — Connecting people with knowledge. Learn more.

Originally published at eng.kifi.com on June 10, 2015.

--

--