Our customer’s next shopping basket?

Sonja Pfeifer
Marc O’Polo Digital Intelligence
6 min readApr 9, 2021

By Sonja Pfeifer, Full Stack Developer

Always start with why

The goal for us at Marc O’Polo was to provide suiting product recommendations for our customers receiving newsletter campaigns. We wanted to start with a simple machine learning based approach and test the results against a random product selection. Since we are already using a lot of AWS Cloud Services within our tech stack we decided to go with Amazon Personalize.

What is Amazon Personalize?

With Amazon Personalize, AWS offers an easy to use recommendation service. Even though it’s helpful, you can build, train and deploy a recommendation model without previous knowledge about machine learning. In the Amazon Personalize Developer Guide and on Github AWS provides useful documentation and hands-on examples to enable you to build a valuable recommendation models.

After binge reading the documentation we wanted to start coding and check how easy it really is. AWS Sagemaker, providing a Jupyter Notebook environment, is the perfect starting point for that. Within Sagemaker we can easily develop and integrate AWS Services such as S3 Buckets and the AWS Personalize resources as well as use standard Python packages and Pandas.

AWS Personalize uses three types of datasets: interaction data, customer data and item data. Interaction data being the connection between customers and the items, for example when they shopped in our online shop. In the customer and item dataset you can add additional metadata to the entities e.g. the categories and materials of an item.

Overview datasets:

With the CRISP Cycle in the back of our minds we evaluated the different, possible datasources. We did a deep dive into the relevant data and reached the surface with a first version of the datasets. Loading the data and training the first model was easy using the AWS SDK for Python. The results were satisfying and we gathered feedback by sending a list of recommended products to colleagues.

In order to automatically provide recommendations for hundred thousands of customers each day we need to build a robust scalable system within AWS. Fortunately in the AWS GitHub Repositories we found a suiting project, which kickstarted our implementation to reach the goal of a robust, scalable and automated workflow.

Our approach

The solution will consist of three state machines:

  • prepare the raw data
  • load the datasets and train the different solutions
  • create and prepare the recommendations

Loading and preparing data

The product data is stored within an AWS Aurora MySQL Database. In a first Python lambda we execute a simple

select * into outfile S3 'path/to/s3/bucket/file' from 'db.table';

query to load the data into an S3 bucket.

Collecting the interaction data is working quite similar since that data is stored in AWS RDS MySQL as well. Once the data is loaded it needs to be prepared to fit our use case. We only use purchases for the model. Those purchases are on Articlenumber — Color — Size (EAN) level. For the recommendation we can aggregate to Articlenumber — Color level, as there is no need to recommend a specific size to the customer. Therefore we assign Articlenumber_Color to the ITEM_ID field in the Interaction dataset and remove the EAN field. Besides that we convert the ordertimestamp to the required UNIX epoch time format. The Interaction data is now ready to be imported into the datasetgroup.

Due to data privacy the user data isn’t stored in AWS. For AWS Personalize we keep a minimized set of non-personal, user data in S3. Based on the Interaction dataset we check if there are new USER_IDs which are not part of the User dataset and load that data.

Now we select a subset of the complete product data, based on ITEM_ID reference in the interaction dataset and store that in S3 as well.

With AWS Step Functions we can orchestrate the invocations of the lambda functions sequentially. Since we are using the Interaction dataset fields ITEM_ID and USER_ID as base for the User and Items dataset we can’t save time by parallel execution. Loading the User data can take some time if many new users purchased items. Therefore we add a retry for the Lambda Timeout error. Within the lambda we load and save user data in batches so it can be executed in iterations. In case of success or errors the notification lambda is invoked and sends a message to the MS Teams Alerts channel. The next step is a Choice Task which checks the payload for a specific variable value. If an error occurred the step function will end in failed state. Otherwise a StartExecution Task is invoked to execute a State Machine which will train the model, similar to the GitHub AWS example.

Train model

We need to train two solutions, user personalisation and popular items. The popular items solution is used for new customers which haven’t bought in our online shop yet. We closely sticked to the example AWS provides on GitHub.

Prepare recommendations

Once the solutions are created it’s time to retrieve and prepare the recommendations to fit our requirements. This is triggered by a StartExecution Task which calls a State Machine for generating batch recommendations and one for generating popular recommendations.

Within the batch recommendation State Machine we are using a Parallel State to create recommendations using different filters for men and women. The user dataset is split according to business requirements and saved to S3 to make it accessible for the BatchInferenceJob. Using again the retry property of the Lambda invocation task, the state machine waits for the job to be finished.

The result is a json file containing an object for each USER_ID with the recommended ITEM_IDs. In a next invoke lambda step, we build a lambda which loads that json file and sends each entry to a queue (AWS SQS). The lambda triggered by that queue enhances the ITEM_ID e.g. by adding product name, description, … to meet the requirements of our email campaign tool. After the product data is enhanced the recommendation is send to another queue which triggers a lambda to send the data to the campaign tool.

We are using queues to enable batch processing and prevent sending each recommendation by its own.

Within the campaign tool the data is now ready to be used as part of our newsletter campaigns.

Conclusion

Building a new feature from scratch always comes with the opportunity to try out new Technologies and Services. Testing, discussing and if necessary discarding these options is a standard approach for us at Marc O’Polo.

By combining various AWS services we found a stable solution which meets the goal of providing recommendations for our customers receiving newsletter campaigns.

Powered by the Github Examples provided by AWS as well as our In-House Knowledge of AWS, gained over the last couple of years, we build a robust AWS CDK application. Following the modular approach of microservices we also achieved a solution which can easily be adapted and parametrised to serve a similar, yet different use case.

Still one question remains:

Are the recommendations really our customer’s next shopping basket?

This question is frequently analysed and answered by our Data Intelligence team.

You want to shape the digital future of the fashion industry? Start creating with us. The Digital Intelligence & Tech Teams at Marc O’Polo are always looking for talented and driven software engineers, data engineers, data scientists, data analysts, ML Engineers or project managers to join our team. Learn about life at Marc O’Polo and visit our career site.

--

--