Engineering Your Webstore with NMF
Get insights on your customer-product relationships with Linear Algebra
NMF is a powerful linear-algebra tool often used in recommender systems, as it performs well in sparse circumstances. NMF can extract latent “topics” from a matrix — think movie genres, article subjects, etc. It works by decomposing an m x n matrix into a m x p matrix, W, and a p x n matrix, H, with p being the number of latent topics.
The W matrix represents how much each movie-watcher/customer/author belongs to each topic. The H matrix represents how much each movie/product/article belongs to each topic.
This tutorial will walk you through using SKLearn’s NMF implementation on your eCom data, and discuss a few cool things we can do with the results!
Some notes before we get started
Big thanks to RockTape for letting me use their data for this project. RockTape is a sports-medicine business that works mainly with Chiropractors, Physical Therapists and other medical professionals. RockTape makes fitness products for the general public as well, and does a lot of work specifically with the CrossFit industry.
Away we go!
The first thing we’re going to do is create an itemized historical purchase matrix. AKA an m x n matrix with m being the number of customers in your database, and n being the number of products your store offers. Initialize this matrix with zeros, since most customers haven’t bought most things.
historical_purchase_df = pd.DataFrame(0, index = custy_df.index, columns=product_df['Product ID'])
I set the index as my customer dataframe’s index, with each column corresponding to a unique product ID. We’re going to loop through each customer, loop through each order they placed, loop through each product they purchased, and increment the corresponding cell. I know, it’s a loooot of loops.
The way BigCommerce writes order data makes this……challenging:
Product ID: 111, Product Qty: 3, Product SKU: 2012, Product Name: 2" 2012 Stars and Bars, Product Weight: 0.5000, Product Variation Details: , Product Unit Price: 15.00, Product Total Price: 45.00
If you’re working with a different eCom platform, hopefully the order details are a bit easier on the eyes. Check make_historical_purchase_matrix on my GitHub to see how I wrangled mine! You want something similar to this after all the trimming and splitting and somersaults:
historical_purchase_df[prod_id][customer] += prod_qty
I have about 70,000 customers and 4,000 products, and mine takes a solid 15–20 minutes to finish.
NMF Time!
Specify the number of latent topics you’d like to faction out in n_components. It can be a guess at first. We will dial in that hyperparameter in the next step.
nmf = NMF_sklearn(n_components=5, max_iter=450)W = nmf.fit_transform(historical_purchase_matrix) # how much each customer belongs to each 'topic'H = nmf.components_ # how much each item belongs to each 'topic'
NOTE: Keep in mind that NMF initializes numbers at random and uses alternating least-squares to reconcile the W and H matrices. This makes the NMF unable to be reproduced identically each time. Consider setting a random_state, or pickling the resulting matrices if you’re doing further analysis.
This is a nifty function I wrote to display the top products for each topic using the H matrix and your product data. With a little domain knowledge of your customer base and product catalog, you can determine the right number of topics and label each one!
print(['Here are the top products for %s topics' % (n_topics))for topic in range(0, n_topics): indicies = H[topic].argsort()[-25:] print('\n') print(product_df['Name'][indicies])
My product categories became uniquely separable with n_topics = 5. I hand labeled them and ended up with the following ‘topics’.
1. Consumer Rehab/Single Rolls
Top products: Single rolls in popular colors, PreCut RockTape, RockSauce
People recovering from surgery, rehabbing an injury, etc. These folks tend to churn quickly.
2. Consumer Fitness
Top products: Foam Rollers, RockTape, RockBand, RockBalls Infinity, Gloves
Weekend warriors and gym rats. Sports Med for the general public. Most addresses are homes.
3. CrossFit
Top products: CrossFit Knee Sleeves, H2O Tape, Rogue and Camouflage tape, Shin Protection
CrossFit gyms/boxes and individual CrossFitters. Some overlap with the Consumer Fitness category.
4. Marketing & Promo
Top products: Sample strips, tee shirts, RockSauce samples, posters
RockTape influencers, doctors who want to test our products, trade show orders, internal RockTape orders. Very low or zero dollar average order values.
5. Movement Professionals & Education
Top products: Courses, Bulk rolls of RockTape, 6-packs of RockTape, IASTM Tools, Bulk RockSauce
RockTape’s bread and butter — our Chiropractors, Physical Therapists, LMTs. Most of these are clinical addresses (they have a suite number listed). Higher average order values.
Considering our W matrix, every single customer has a weight in each topic. For example, a Physical Therapist who does CrossFit recreationally could be expressed as [0.02, 0.0, 0.3, 0.0, 1.2]. Most of the purchasing on his account is for clinical use, but he bought all his buddies the sweet new CrossFit knee sleeves for Christmas!
Next Steps and Insights
- For my project, I simply concatenated my W matrix with my original feature matrix. This was a succinct yet effective way to digest and quantify the purchase history of my customers. If you’re going to perform more analysis (Random Forest, Regression Modeling, etc) I’d recommend this route!
- Consider leveraging your W and H matrix to display recommended products in a “Customers like you also bought...” fashion. NMF is used for recommender systems all the time!
- RockTape’s customer base is distinctly stratified, so each topic matched an existing webstore category pretty well. If your customer base is more nebulous, consider grouping items in categories that correspond to your H matrix.
Bringing it Back to Churn
My favorite application of NMF relates back to churn. After creating a binary feature for churn and masking your customer dataframe to just your inactives, you can re-run NMF and get a list of products that are most associated with churn.
My list consisted of a lot of discontinued products (often denoted by an asterisk), things we frequently had to warranty, and generally less popular items:
You can do this with more than just churn, too! Take a look at the get_items_associated function in my GitHub Repo. If you wanted a list of products most associated with Consumer Fitness customers in the state of Indiana, a few tweaks of your NMF can provide this (or maybe you could just use SQL, lol!).