Recommendation Systems at Trunk Club: Algorithms, Challenges, and Future Outlook (Part 2)

Published in

Unpacking Trunk Club

9 min readJan 27, 2020

by Anshul Agarwal

In Part 1 we provided an overview of how data science is interwoven with a customer’s journey at Trunk Club through various recommendation systems. In this part we go a little deeper into the technical details of these systems.

Behind the scenes

Algorithms

Data science uses a blend of multiple machine learning algorithms with custom loss functions that are optimized towards particular business needs. The individual elements of such a blended framework are traditional state-of-the-art machine learning techniques, such as various supervised, unsupervised, and collaborative approaches. Their customized sequencing and blending, which improves prediction accuracy and effectiveness, is not only driven by data science expertise, but also influenced by domain knowledge and collaboration with other business groups. Below we describe some of the elements of the data science frameworks.

a) Collaborative approaches
The basic foundation of recommendation systems is the collaborative filtering approaches that can learn patterns collectively across different customers and products. The data science team has explored myriad collaborative approaches from memory-based CF with user-user and item-item similarity with a weighted scoring, to model-based CF that include clustering, latent semantic analysis, or machine learning or bayesian models to predict customer-product interaction. Since the Netflix Prize for movie rating prediction, matrix factorization based approaches have become state-of-the-art for recommendation systems. Data science models also use matrix factorization as one of the core components. Matrix factorization characterizes both customers and products as lower dimensional vectors in a latent space inferred from customer-product interactions as well as their attributes. These vectors, as a string of numbers, inherently represent customers’ style preferences, and can effectively determine their disposition towards items they had never seen before. The figure below shows a particular version of matrix factorization from Rendle. Note that the problem is formulated as a learning-to-rank algorithm with WARP loss, where getting item’s ranking accurate is more important than precisely predicting the customer-product interaction rating.

b) Association rules
A collaborative approach, as the one described above, is a global approach that tries to fit a model in a hard manner such that “all” available items are accurately ranked. In contrast, soft approaches such as association rules, recommend based on local associations with a “few” products. Such approaches especially help make recommendations more novel and diverse. As utilized by the Youtube recommendation system to provide varied yet relevant video choices during a session, we leverage association rules across different types and categories of clothing items to make our recommendations diverse yet selective at the same time. It helps trim the search space to choices that are more relevant currently compared to customer purchases made in the past. The figure below provides an example of such association rules. The table, for instance, shows in the second row that customers who tend to buy blazers and pencil skirts also prefer to buy Poncho/Cape jackets. Such rules discover co-occurrence of items, thus making recommendation more relevant.

Example of association rules for clothing items

c) Content-based approaches
Several content-based approaches are used to complement the aforementioned approaches that primarily exploit customer-product interactions. Various kinds of clustering algorithms (k-means, hierarchical, density-based, LDA) help segment customers, understand their purchase behavior, as well as extrapolate it for newer customers. Customer clusters are used to generate cluster specific recommendations. Clustering is also leveraged for brand segmentation. Moreover, similarity-based approaches with problem-specific distance functions are used to improve item discovery in recommendations. As mentioned in Part 1, similarity algorithms are used to display similar brand in the catalog. Item similarity helps remove selection bias from most purchased items towards less popular yet promising items in recommendations. The figure below demonstrates a few clusters of “similar” clothing items. Under each image we list item_ID and its similarity score with the first item of the cluster (therefore, the score of the first item is 1.0)

Clusters of “similar” items determined from similarity-based algorithms

d) ML approaches
Our toolkit also comprises current state-of-the-art ML algorithms such as ensemble algorithms (XGBoost, Random Forests), Support Vector Machines, C5.0, Back propagation, Bayesian methods, etc. For instance, as described in Part 1, the fit prediction framework is a collection of such algorithms that stacks them to predict the most likely fit for a customer. These algorithms are also used to follow an ML-driven approach to stack various explicit features for a recommendation algorithm with custom loss function, or stack complete recommendation system algorithms themselves.

Validation and A/B test metrics

For model validation and offline testing, besides the standard metrics such as MSE, MAPE, Precision, Recall, AUC, etc., we observe metrics specific to recommendation systems. In particular, we measure Mean Average Precision@k (MAP@k) as an indicator for the purchase likelihood from the top k recommendations. For baseline comparison, we measure MAP@k for recommendations driven by item popularity. The larger the gap between MAP@k of personalized recommendations vs. popularity, the better is the model.

While we strive to optimize recommendation systems for higher MAP@k, we also desire to have greater diversity in recommended items. Diversity can be measured in three distinct terms:

There should be minimal overlap between top k personalized recommendations vs. popularity based recommendations. We call this Overlap@k or Mean Average Overlap@k (MAO@k).
The number of items recommended across all customers should cover a large portion of the total inventory of available items. We call this metric Coverage
One item should be recommended across as few customers as possible. We call this measure Banality.

Generally there is a trade-off between MAP@k and the three diversity metrics. If we optimize models to improve MAP@k, the other three metrics tend to suffer. Thus, we develop models in order to balance all metrics as a multi-objective optimization. The figure below shows the mathematical definition of all four metrics.

Performance metrics for recommendation systems

To assess effectiveness of various recommendation systems deployed in production, we conduct several A/B tests on actual customer trunks in the field. While the validation metrics above are designed to assess offline model performance, the metrics tracked during A/B tests are aimed at measuring and quantifying business impact. Consequently, the A/B test metrics can vary considerably across Trunk Club’s recommendation systems. For instance, the following demonstrates some examples of the A/B test metrics that are specific to the corresponding recommendation system:

New “lead” conversions from sign-ups, and new purchases from these “leads” with Outfit recommender implementation.
Improvement in total revenue, total purchases per trunk, and total revenue per trunk.
Increase in customer satisfaction and reduction in the negative fit feedback from fit recommender framework.
Correlation between the rank of an item in personalized sort and it’s corresponding purchase rate across customers.

Challenges and Future Outlook

Inventory availability

The clothing inventory and availability changes constantly throughout the day. One of the challenges for recommendation systems is to have visibility in real-time availability at prediction time. Data science, engineering, services, and product teams are exploring architectures and deployment methods leveraging Elastic Search, caching, batch uploads, and a multi-service architecture to solve this.

Sparsity

Data sparsity is one of the biggest challenges for recommendation systems. Most products in the inventory don’t have much interaction history with customers. As a result, the suggested recommendation from the models suffer from selection bias towards popular items. Product and data science teams have taken several steps to address this, particularly in the direction of bolstering data quality and volume. Style Swipes, launched last year on the Trunk Club app, is one such effort: customers can swipe left or right through several pre-populated images of products indicating whether they like or dislike that product. With swipes, data science is able to collect more data on customers’ style preferences. Data science is also leveraging stylist expertise by, for instance, sending surveys to tag products and outfits, or by collecting their categorical feedback in the catalog on how relevant the model recommendations are. This feedback is then “looped back” and directly incorporated in the models.

Deep Style

Defining and learning a customer’s style is not trivial. Data science frameworks currently learn styles through the algorithms described above; however, there is a huge potential to leverage Deep Learning and advanced Artificial Intelligence algorithms (CNN, RNN, LSTM, GANs, Attention, Encoder-Decoder, HMMs, etc.). Deep Learning has the potential to push data science at Trunk Club to the next level by not only learning to recognize different types of styles in a more effective way, but also by providing relevant insights into the features learned in the deep layers and embeddings of the neural network.

Real-time customer interaction

While the current Style Swipes provides an excellent platform to collect more information about the customer, it still relies on offline pre-selected assortment of products. The technology team is exploring ideas on how to leverage multi-arm bandit and reinforcement learning methods to deploy adaptive models that learn from customer interaction in real-time. Applications of such a framework could be an interactive version of Style Swipes, or an interactive Member Preview step. Trunk Club also offers services such as Your Picks, Buy-it again where customers can directly access Trunk Club’s merchandise on app or web. While these products leverage recommendation systems, there are opportunities to further leverage data science in the form of either adaptive models or other frameworks such as item similarity.

Computer Vision — Image processing

Processing image data and incorporating in all kinds of models offers the biggest potential as well as a huge challenge for data science. While several state-of-the-art deep learning methods exist to incorporate image data in recommendation systems, the effectiveness relies significantly on the quality of the data. The data science team has been exploring, developing, and deploying Deep Learning/CNN based solutions in this space. This will continue to be a significant research area in the future for the team.

Hidden treasure — text data

Text data and Natural Language Processing (NLP) is another untapped area of opportunity to improve personalization at Trunk Club. Customers offer a wealth of information via open-text feedback during Member Preview and Home try-on, as well as through email and chat communication with stylists. We’ve just begun to scratch the surface with this data. Recently, the product teams have created a “preference center” where all the customer’s indicated preferences (either through free text or stylist communication) can be saved in a structured form in order to be digested readily by both stylists and data science models. A plethora of opportunities exist to complement preference center and recommendation systems with NLP. However, it offers its own challenges such as how to map customer vocabulary with Trunk Club’s product vocabulary, or how to extract mixed sentiments from the same feedback text, etc. Deep Learning based AI methods such as RNN, LSTM, Transformers, Attention offer significant opportunities to leverage text data.

Outfitting

Finally, outfitting is another potential future opportunity. We observe that, on average, trunks with outfits have a higher likelihood for more purchases per trunk compared to trunks with no outfits. Recommendation systems, especially for trunk curation and completion, can be further enhanced by incorporating outfitting algorithms that assemble a coherent assortment of items, which, while being challenging, is an equally exciting problem to solve.

Conclusions

This article highlights how Trunk Club has been effectively marrying research with pragmatism and productization, particularly due to a nimble and productive collaboration between the Data Science, Product, Data Engineering, and other relevant stakeholders. While Trunk Club has been able to push the boundaries of data science in fashion retail, the future still has a plethora of challenging and exciting opportunities that the data science team at Trunk Club is thrilled to tackle in the near future.