Modeling customer behavior with Markov chains

Published in

Data Science at Microsoft

11 min readMay 16, 2023

Image generated by Microsoft-internal implementation of DALL·E 2.

It’s natural for customers to differ in their consumption patterns and behaviors. As cloud services professionals who serve these customers, we must understand these patterns and how they evolve over time. Developing the ability to predict customer behavior offers a strategic business advantage, as it allows cloud service providers to tailor strategies to serve customers across different economic scenarios. In fact, given the competitive nature of the cloud services industry, it may be considered a necessity to have such predictive models in place.

In this article, I cover approaches to modeling and predicting customer behavior. Specifically, I consider modeling the behaviors of cloud customers with regard to their usage of various cloud products, such as Microsoft Sentinel. The behavior is measured in terms of data consumed by the customer during usage. The unit of consumption I use in this article is GB (gigabytes).

Consumption is associated with a period over which consumption is measured and the billing cycle is a useful practical unit of time to use. For the problem under consideration the billing cycle is a month. Let 𝑥ₖ be the consumption by a customer in the kᵗʰ month. Usually, the values of 𝑥ₖ have a range of, for example, 0 to 2000 GB. Our interest is in understanding the behavior of 𝑥ₖ over k = 1,2,3,…. In other words, if we know the past behavior of these customers, can we predict the future? For a cloud service provider, the aggregate of the customers’ behaviors is of primary interest.

It may be desirable to categorize customers into different homogeneous groups to serve each group with relevant promotions. So, to provide better comprehension, we discretize consumption into buckets. This enables us to model the problem as a Markov process. We shall model our problem using Markov chains and try to get some insights into the problem.

Markov chains

A Markov chain (MC) is a sequence of random variables 𝑥₀, 𝑥₁, 𝑥₂, … with a relaxed concept of independence. It states that for any 𝑛 ≥ 1, the conditional distribution of 𝑥ₙ given 𝑥₀, 𝑥₁, 𝑥₂, … , 𝑥ₙ₋₁ is same as that of 𝑥ₙ given 𝑥ₙ₋₁. As a simple example of a MC, let 𝑥ₙ be the number of heads in a series of coin tosses up to the first 𝑛 tosses. If there are seven heads in the first 10 tosses (i.e., 𝑥₁₀ = 7), then either 𝑥₁₁ = 7 or 𝑥₁₁ = 8 with equal probability. It does not matter how the seven heads were obtained in the first 10 tosses. For the following discussion, we shall assume that 𝑥ₙ, 𝑛 = 0, 1, 2, … is a MC.

State Space, or 𝑆, is the set of all possible values of 𝑥ₙs. The elements of 𝑆 are called states of the system. If 𝑥ₙ = 𝑖, we say that the system is in state 𝑖 at time 𝑛. For this article, we shall assume that 𝑆 is a finite set.

Transition probabilities involve the conditional probabilities 𝑝ᵢⱼᵐ (𝑘) = 𝑝(𝑥ₖ₊ₘ = 𝑗|𝑥ₖ = 𝑖), which are called 𝑚-step transition probabilities. A MC is said to have stationary transition probabilities provided 𝑝ᵢⱼᵐ (𝑘) is independent of 𝑘. Unless stated otherwise, we shall assume that the MCs under discussion have stationary transition probabilities. The one-step transition probabilities play a crucial role in the analysis of systems that are modeled as MCs. For brevity, the one-step transition probabilities are denoted by 𝑝ᵢⱼ instead of 𝑝ᵢⱼ¹. The one-step transition probabilities can be presented by a square matrix 𝑃 where 𝑝ᵢⱼ is its (𝑖, 𝑗)ᵗʰ element. The order of 𝑃 is 𝑁, the number of states in the system. What follows is a simple example of a MC.

Example 1

Let us consider the example of a cloud customer. On any given month 𝑘, the customer is in one of the following four states:

State 1 (low usage): usage is between 0 to 25 GB.
State 2 (moderate usage): usage is above 25 GB but ≤ 200 GB.
State 3 (high usage): usage is above 200 GB.
State 0 (left the product): stopped using the product for good.

Note that entries in each row add to 1. This is the generic feature of any one-step transition matrix (in fact, it is true also for any 𝑛-step transition matrix where 𝑛 ≥ 1). Next, state 0 is a special state — once the system enters it, it is stuck there forever. Such states are called absorbing states. Because 𝑝₁₀ > 0, there is a positive chance that the system starting in 1 may get stuck in 0 forever. For this reason, state 1 is called a transient state. Because there is a positive probability of going from 3 to 1, there is a positive probability of going from 3 to 0 as well. In other words, 3 is also a transient state. Because it is possible to go from 2 to 3, even 2 is transient state.

There are some important questions to consider regarding the example above. Suppose there is revenue that is generated from the customer that is commensurate with usage. If the customer is in state 1, the charge is $3 per month (pm), for state 2 it is $7 pm, and for state 3, it is $10 pm. An interesting question would then be: How much revenue can be expected from a customer with the above one-step transition matrix?

Note that the one-step transition matrix entirely determines the behavior of the customer (under the assumption of stationarity, of course). Let’s simplify and look at the revenue of this customer in the first year assuming the customer started out in the low usage state (𝑥₀ = 1). Suppose the behavior of the customer in the first one year is as follows: 𝑥₁= 2, 𝑥₂= 2, 𝑥₃= 2, 𝑥₄=3, 𝑥₅= 2, 𝑥₆= 3, 𝑥₇=1, 𝑥₈= 2, 𝑥₉= 3, 𝑥₁₀= 2, 𝑥₁₁= 1. This means that in the first 12 months, the customer was in state 1 three times, in state 2 six times and in state 3 three times. Therefore, the revenue in the first 12 months is 3𝑟₁ + 6𝑟₂ + 3𝑟₃, where 𝑟₁= 3, 𝑟₂= 7 and 𝑟₃= 10. Therefore, for the given or known behavior in the first 12 months, the revenue is $81.

As we see, it’s easy to compute revenue when the behavior is known. However, what we don’t know is the future behavior of a customer who is initially trying out a cloud product. All that we know is that first-month consumption is low, i.e., 𝑥₀ = 1. Given this information, let us compute the expected revenue for month 𝑘. Note that this is 𝑟₁ with probability 𝑝⁽ᵏ⁾₁₁ , 𝑟₂ with probability 𝑝⁽ᵏ⁾₁₂, and 𝑟₃ with probability 𝑝⁽ᵏ⁾₁₃. Therefore, the expected revenue for month 𝑘 is 𝑅₁⁽ᵏ⁾ = 𝑟₁𝑝⁽ᵏ⁾₁₁ + 𝑟₂𝑝⁽ᵏ⁾₁₂ + 𝑟₃𝑝⁽ᵏ⁾₁₃. Note that 𝑅₁⁽ᵏ⁾ is nothing but the first entry of the matrix product 𝑃⁽ᵏ⁾r, where the (𝑖, 𝑗)ᵗʰ entry of 𝑃⁽ᵏ⁾ and 𝑟 = (𝑟₁ , 𝑟₂ , 𝑟₃)ᵗ. An interesting and very useful property of a MC with stationary transition probabilities is 𝑃⁽ᵏ⁾=Pᵏ. That is, 𝑝⁽ᵏ⁾ᵢⱼ is the (𝑖, 𝑗)ᵗʰ entry of 𝑃 raised to the power 𝑘. Therefore, the total expected revenue over the first 12 months is given by:

If we wish to find the expected revenue given that the customer started with moderate usage, i.e., 𝑥₀ = 2, the answer is the second entry of the vector (∑𝑃ᵏ) 𝑟.

For a MC, any sequence of possible values of 𝑥₀, 𝑥₁, 𝑥₂, … is called a random path. Thus, the behavior of a customer is nothing but a random path. In the above example, we had 𝑥₀ = 1, 𝑥₁= 2, 𝑥₂= 2, 𝑥₃= 2, 𝑥₄=3, 𝑥₅= 2, 𝑥₆= 3, 𝑥₇=1, 𝑥₈= 2, 𝑥₉= 3, 𝑥₁₀= 2, 𝑥₁₁= 1. If 𝑥₁₂ = 0, then 𝑥ₙ = 0 for all 𝑛 ≥ 12. The resulting path is (12223231232100 …). This brings us to our second question: What is the expected number of months that the customer is going stay with the product? Fortunately, there are some nice mathematics that provide an answer to this question. Let 𝑄 be the submatrix of 𝑃 that corresponds to only the transient states. For the problem in question:

compute the inverse of 𝐼 − 𝑄 where 𝐼 is the identity matrix of order 3 and then add the columns of the resulting matrix. In matrix notation, compute (𝐼 − 𝑄)⁻¹e. The result is

So, a customer who starts with low usage is expected to stay for 33.75 months. A customer starting with high usage is expected to stay for 45 months.

Example 2

Consider a cloud product that is offered by two different companies, such as a spreadsheet. A new customer wants to take a lifetime membership in one of these two companies. But before deciding, the customer tries the two products. What is the state space to model this problem as MC? Let us define four states: (i) temporarily with Spreadsheet A (state 1); (ii) temporarily with Spreadsheet B (state 2); (iii) permanently with Spreadsheet A (state 3); and (iv) permanently with Spreadsheet B (state 4). Suppose the one-step transition matrix for this is:

Note that 1 and 2 are transient states but also that there are two absorbing states 3 and 4. As before, we can raise two questions here:

How long does the customer maintain the trial period?
What is the probability that the customer ultimately decides to become a lifelong customer of Spreadsheet A?

The answers are in (𝐼 − 𝑄)⁻¹, where:

Therefore, a customer who starts with the Spreadsheet A trial is expected to take 2.083 (= 1.250 + 0.833) months before becoming a lifetime member, and the one who starts with Spreadsheet B is expected to take 1.944 months. This answers the first question. For the second question, we compute (𝐼 − 𝑄)⁻¹𝑞ᴬ, where 𝑞ᴬ = (0.4, 0.3)ᵗ. The result is (0.75, 0.67)ᵗ. So, the probability that a customer who starts with Spreadsheet A and becomes a lifetime member of Spreadsheet A is 0.75; and the probability that someone who starts with Spreadsheet B and becomes a lifetime member of Spreadsheet A is 0.67. To compute similar probabilities of merging into Spreadsheet B, use 𝑞ᴮ = (0.1, 0.3)ᵗ.

Example 3

Let us now consider a cloud customer who will not leave the company. This is the same set up as in Example 1 with the difference that there is no state 0. In other words, state space 𝑆 consists of three states: low usage (1), moderate usage (2), and high usage (3). Suppose the one-step transition matrix for this problem is given by:

We say that a state 𝑖 leads to another state 𝑗 if we can find a sequence of states 𝑖₁, 𝑖₂, … , 𝑖ₖ such that pᵢᵢ₁, pᵢ₁ᵢ₂, … , 𝑝ᵢₖ₋₁ᵢₖ and pᵢₖⱼ are positive for some nonnegative integer 𝑘. When 𝑖 leads to 𝑗, we write 𝑖 ⇝ 𝑗. In Example 1, note that 1 ⇝ 2, 2 ⇝ 3 as 𝑝₁₂ and 𝑝₂₃ are positive. Also, 1 ⇝ 3. To see that, take 𝑘 = 1 and 𝑖₁ = 2. A nice tool to understand which states lead to which other states is the graphical presentation of a one-step transition matrix. The 𝑃 of Example 1 can be presented as shown in Figure 1.

Figure 1: Graphical representation of a one-step matrix.

A MC is said to be irreducible if 𝑖 ⇝ 𝑗 for any 𝑖, 𝑗 ∈ 𝑆. MC of Example 3 is an example of an irreducible MC. The behavior of customers represented by an irreducible MC has an interesting property. Over time, they tend to develop a pattern of usage and their behavior becomes more predictable. Such a customer state is called the steady state condition. Under steady state conditions, we can figure out the probabilities of customers being in different states. For the customer of Example 3, under steady state conditions, the customer is in state 1 with probability 0.17, in state 2 with probability 0.44, and in state 3 with probability 0.39. We can present this in the form of a vector as 𝜋 = (0.17, 0.44, 0.39). This 𝜋 is called the steady state distribution. If 𝑃 is the one-step transition matrix of an irreducible aperiodic MC with finite state space, then all the rows of 𝑃ⁿ converge to the steady state distribution. The time taken for stabilization (time to reach the steady state condition) depends on 𝑃. For 𝑃 of Example 2.3, 𝑛 = 5. That is, the customer’s behavior becomes predictable from the fifth month onward.

An important and useful implication is that we can forecast the expected revenue from the customer under steady-state conditions. The expected revenue from a customer of Example 3 under steady-state conditions is 3 × 0.17 + 7 × 0.44 + 10 × 0.39 = 7.49, or $7.49 per month.

Conclusion

In this article, we have seen how Markov chains can be used to model the patterns and behavior of cloud customers. In the sequel to this article, I will show how we can apply these learnings to a real-world scenario, which involves understanding different behaviors of customers and categorizing them into homogeneous groups. Once the homogeneous groups are identified, then each group can be analyzed using the models proposed in this article. A brief and nice presentation of Markov chains can be found in the book: Operations Research: An Introduction by Hambdy A. Taha.

In conclusion, as a famous quote by Richard Feynman says, “You do not really understand something unless you can explain it to your grandmother.” In the spirit of deepening our understanding, programming in Python to solve problems — such as simulating Markov chains — can be an engaging and insightful experience. It’s surprisingly easy too, requiring just a few lines of code. If you’re interested, feel free to check out my Python notebook here.

Manasa Gudimella is on LinkedIn.