Segmentation, recommendation and marketing optimization algorithm in 24 lines of code
Have you ever thought how do recommendation systems work, how to prepare an interpretable segmentation or optimize your marketing campaign target group? I have good news for you! After reading this article, you will know the answer to all of these questions on a fundamental level. Let me introduce you to Non-negative matrix factorization (NMF) algorithm.
This algorithm is used in a vast number of fields including image processing, text mining, clustering, collaborative filtering, and community detection. Its most significant assets are speed, ease of interpretation and versatility.
The article is intended to be an introductory one into NMF and recommendation systems. In the first part, you can find some theory behind the algorithm. The next one is a walk-through a toy example of segmentation, recommendation, and marketing optimization. The subsequent part consists of some projects examples where NMF could be useful. The last part contains a list of sources I gathered while writing this article and Python code used to prepare the toy example.
It’s quite simple: you put your clients as columns and products/ratings as rows of an array (let’s call it V). As values, you should put adequate statistics like a number of purchases or rating.
Grocery purchases example:
and TV shows rating:
I tried to keep it simple, but basic linear algebra knowledge is essential to this part. Our goal in NMF is to approximate this matrix by the dot product of two arrays W and H. Dimensions of the arrays are defined by dimensions of V and number of components we set to the algorithm. If V has n rows and m columns and we want to decompose it to k components, then W has n rows, and k columns and H has k rows and m columns.
This is actually matrix factorization part of the algorithm. The Non-negative part refers to V, W, and H — all the values have to be equal or greater than zero, i.e., non-negative.
Of course usually, it’s impossible to reconstruct the initial matrix precisely. We want to be as “close” as possible to the initial array. To measure the distance, we can use Frobenius norm:
which is the default one in Python’s Scikit-learn package. For more details, please refer to the package’s documentation.
How does it look at our toy grocery example? As a toy example, I’ve prepared 3 components factorization of the grocery purchases matrix.
For the purpose of this article, we can call the W matrix a segment defining array. How to interpret it? It’s not as hard as it sounds: just look at the values (weights — note that they do not sum up to 1) in each column. The higher the weight, the more “determined” the column (segment) is by the variable in the row. This is the place where non-negative constraint pays-off. Could you think how to interpret negative values if positive corresponds to “belongs to” and zero means “does not belong”? I can not.
For instance, I’ve named one segment “Bread eaters,” because it is almost entirely driven by bread consumption. “Fruit pikers” are driven by two product categories — Fruits and Sweets. The third one is a mixed segment with leading Vegetable category.
We can also look at W matrix from another perspective. If we look at the values by rows we can interpret them as follows: provided somebody bought product X, what is the additional assignment weight to the segment. For instance, Coffee purchase contributes exclusively to “Veggies” segment and Bread for both “Bread Eaters” and “Veggies” with higher weight towards the first one.
Let’s move to the H matrix now. The higher the weight value, the more the person belongs to the specific segment. Some people like John can be assigned in 100% to one cluster, and some people like Peter belong to all the segments with some weights.
As a result of interpreting both these matrices, we obtain a customer segmentation with interpretable segments.
Recommendation (collaborative filtering)
Can we use the mechanism to prepare food recommendations for people? Yes, and it’s easier than you may think. By multiplying W and H, we obtain initial V matrix approximation:
This reconstructed matrix serves as a basis to the recommendation. The process of assigning values for previously unknown values (zeros in our case) is called collaborative filtering. We can find attraction weight towards certain products in columns of the matrix. By sorting the values in descending order, we could determine which products should be proposed to the customer to match their preferences. For instance, Mary should be offered products in the following order Bread, Fruits, and Sweets. Recommendation order for Alice: Fruits, Bread, Sweets, Vegetables, and Coffee.
We can also use the reconstructed matrix in another fashion. Let’s say we prepare Coffee marketing campaign and have funds to communicate with 4 people. How to determine who to contact?
In our toy example, only Peter bought Coffee. We can use Coffee row from the reconstructed matrix to determine the most adequate target group. For instance Peter (since he already bought it once) and Jennifer, Alice, and Greg. Note that Jennifer is predicted to be prone to buy Coffee since she has almost the same purchasing history as Peter.
This is it! In a few steps, we prepared customer segmentation, recommendation system, and marketing campaign optimization tool. How cool is that?
One of the examples of Non-negative Matrix Factorization usage is Wikipedia articles topic categorization. Similarly, you can classify any documents you have, i.e., emails, forms, correspondence or phone calls transcripts. As column names, we would use articles’ titles and as row names words. Values populating the matrix would describe the number of word occurrences in the article (or tf-idf weight in the more advanced model). In helpline example, we could discover the most popular problems groups reported.
Another example is a recommendation engine based on online behavior like purchases on Amazon, movies watched on Netflix or posts upvoted on Reddit. The matrix would look like the one from our toy example. In rows, we would see products/movies/posts. The values in the array would indicate if somebody purchased the product/watched the movie/upvoted the post. The output of the engine would be the top 3 offers/suggestions suitable for the user.
An exciting and a bit controversial project is connected with HealthTech field. Based on medical history, a recommendation system could suggest the next specialist for the patient to visit or the examination to be made. The topic is discussed in one of the articles listed in the notes section.
Some notes about the algorithm:
- The NMF algorithm may have problems if the values are not independent. You can check the video with a Haesun Park’s lecture on YouTube. This occurred to me as I was trying to prepare a toy example on Iris dataset.
- One way of dealing with missing values (like in IMDB rating example) is to omit them in cost calculation of the approximation. For details, please refer to Learning from Incomplete Ratings Using Non-negative Matrix Factorization.
- Another non-negative algorithm for matrix factorization is called Latent Dirichlet Allocation which is based on Bayesian inference.
- Comprehensive study of NMF algorithm The Why and How of Nonnegative Matrix Factorization by Nicolas Gillis.
- Deep Learning approach to recommendation systems by Jacob Schreiber — Deep matrix factorization using Apache MXNet
- Exciting Healthtech example of NMF usage — Recommender Systems for Health Informatics: State-of-the-Art and Future Perspectives
The code example in Python:
Thank you for reading the article! Hope you found it informative and useful! :)