Using Probability Modeling and the Beta-Binomial Distribution to Estimate Heavy and Light Users Among a Customer Base

Kyle
affinityanswers-tech
6 min readJan 17, 2023

At Affinity Answers we help restaurants turn data into action with our offering, the Restaurant Insider. The primary source of data for the Restaurant Insider is customer visitation information collected from multiple restaurant locations. When combined with social and purchase data, this information provides valuable insights into consumer behavior, allowing restaurants to make informed decisions.

For our clients, it is important to understand the behavior of their customers. One way to do this is by segmenting their customers by frequency of visits. However, the data we have on each user typically lasts only a few months, and heavy users are defined as visiting more than 6 times in a year. To solve this problem, we can use probability modeling to estimate the number of heavy and light users among their customer base

The basic idea behind probability modeling is to estimate the likelihood that each user is a heavy or light user, based on the available data. Given that we only have a couple of months of data on each user, we can make the assumption that their behavior will continue at the same rate over the next year. This makes it possible to model their visitation for the next year as a binomial distribution.

The binomial distribution is a discrete probability distribution that describes the number of successes (visits) in a fixed number of trials (days). Given that the trials are independent and the probability of success on each trial is the same, the probability of k successes in n trials can be calculated using the following probability mass function:

Binomial Probability Mass Function

Where p is the probability that an individual user visits the store on any given day and n = 365 days, which represents the number of days in a year. q is the probability that the user doesn’t visit the store, which is equal to 1 — p.

By using this formula, we can estimate the probability that each user visits the store k times in the next year. Then we can sum these probabilities to get an estimated number of heavy and light users among our customer base. If, for example, we have 100 users and we estimate that each of them has an 80% chance of being a light user, visiting 5 of fewer times per year, we can expect that around 80 of those users will be light users and around 20 will be heavy users on average.

However, we can’t only look at each individual user to determine their probability. If we only have data on someone for a day, it would be unrealistic to assume that they visit everywhere they went on that day, on average, once per day. Similarly, if we only have data on someone for a month, it would be unrealistic to assume that they visit everywhere they went at least once per month on average.

If we did, you would get a p = 1/30 and the distribution of visits would look like this:

That corresponds to a 98% chance they are a heavy user: unlikely to say the least.

To improve the accuracy of our predictions, we can use the information from the users that have been in our data for a full year to develop a prior on the probability of visiting the store. This prior can be represented using the Beta distribution, which is a probability distribution that describes the likelihood of a certain probability occurring.

The Beta distribution has two parameters, alpha and beta, which represent the number of successes (days they visited) + 1 and the number of failures (days they didn’t visit) + 1 when using the common uniform prior as a starting point. By using the Beta distribution with a Binomial distribution, we can model the probability that each user is a heavy and light user, and sum those probabilities to get an estimated number of heavy and light users. Fortunately for us, this is a well known, and often implemented distribution aptly named the Beta-Binomial distribution.

To start finding our prior, we will reparameterize the beta distribution so that instead of our two parameters meaning number of successes and number of failures, they mean average probability and confidence in our probability. We’ll call these u and v, where u = alpha / (alpha + beta) and v = alpha + beta, representing mean and sample size (i.e. how confident we are in the mean).

Mean and Sample Size Reparametrization

We start by finding the mean of the users that we have data for a full year and use that as the prior for u. Let’s say this is 2 times per year. For v, we optimize it later, but for now let’s assume it’s 90.

From this, we can calculate the equivalent alpha and beta values for our prior. From this point, for each user we can add 1 to alpha for each visit and 1 to beta for each day they didn’t visit. This way we can generate a probability distribution for how often this person visits the store.

For our example use that has visited once per month, this is how their distribution would look now.

This give them a 31% chance of being a heavy user: much more likely than our first attempt.

Next, we can group them into heavy users and light users using a threshold, in our case visiting more than 6 times per year. By repeating this process for all the users, we can estimate the number of heavy and light users among our customer base.

We have one last thing to take care of, and that is optimizing v. What I did in this case was to look at the first month, two months, and three months of users that had a years worth of data. I created a beta binomial model for them based on their first months and determined the number of heavy and light users that I would have expected. Then I compared this to the actual number of heavy and light users based on their full years worth of data. I modified v to get me closest to the actual results.

After determining the results for each month cutoff, I can decide the minimum amount of data each user needs for me to feel comfortable with the results.

It’s worth noting that when implementing this method, we need to consider the data we have. It’s important to note that this method assumes that people can only visit a place once per day, and that the threshold of 6 visits per year is a fixed value. If this is not the case, or if the definition of heavy and light users is different, then the model should be adjusted accordingly.

Additionally, it’s also important to consider other factors that might influence the number of visits. For example, seasonality, holidays, or changes in the local economy. These might have an effect on the number of visits to the store and should be taken into account when interpreting the results.

In conclusion, segmenting customers by frequency of visits is an important tool for understanding customer behavior. By using probability modeling and the Beta-Binomial distribution, we can estimate the number of heavy and light users among our customer base with a method that can be adjusted to fit the data we have and any changes in our definition of heavy and light users.

A lot of thoughtful Data Science goes inside our Restaurant Insider to make the insights truely actionable for our clients.

--

--