Serving Athletes* with Personalized Workout Recommendations
At Nike, our mission is to bring inspiration and innovation to every athlete* in the world. And that little asterisk is important: It signifies our belief that if you have a body, you’re an athlete. It doesn’t matter if you’ve never run a marathon or lifted a single dumbbell; sport is for everyone. This is why it’s important to us to create personalized digital experiences for our ever-growing community of athletes*.
As part of Nike’s Digital Transformation, we have been making significant effort towards engaging customers through our various digital channels. One of those channels is the Nike Training Club (NTC) mobile app. NTC is designed to be your personal trainer with 100+ workouts, audio/visual guidance and personalized plans. Every month, it helps more than 1.5 million athletes* meet their fitness goals.
Last year, a team of machine learning enthusiasts came together to explore how we can make NTC even more dynamic and personal for our users in the Top Picks for You section of the app. Prior to this new effort, we were serving workouts based on a user’s gender and fitness level, which they had selected when they first signed into the app. The workouts were curated by our roster of in-house master trainers after extensive user training. The curation was done through intensive research and with guidance from industry professionals with years of experience in the workout domain. With the addition of machine learning, we set out to improve these recommendations further and to personalize them at the individual athlete level.
Here, we will explore the engineering effort and the science behind recommending workouts to athletes*. We will also share insight into the A/B testing results and take a look at the architecture that enabled this experience in the app.
To personalize workouts in NTC, we leveraged a wide variety of data, iterated on the strategy and then finally opted to go deeper on these two techniques: Collaborative Filtering and Neural Network. The idea is to utilize users’ collective workout histories and profiles to train the model; however, data is only included if the user granted us permission to use it through the app’s privacy settings. We also use workout metadata obtained from Nike’s rich proprietary workout catalog created by industry experts specifically for the Nike Training Club app. This includes the types, duration, intensity level and focus (endurance, mobility, strength, etc.) of each drill. We were excited to jump into this project, as this was our first consumer-facing, machine-learning-backed experience.
For the first model, we picked Collaborative Filtering, a fast and easy-to-implement algorithm used widely in recommender systems. Collaborative Filtering predicts what workouts a user is most likely to complete by analyzing the workouts of other individuals with similar workouts. It aims to fill the matrix of users to workouts with the likelihood of a user starting a workout.
Model Training and Validation
We chose the Spark MLlib implementation of Collaborative Filtering for this model, as Spark fits nicely in big data and machine learning architecture and can deliver recommendations at scale. We’ll discuss the high level architecture of our production pipeline in this post. The default implementation of the collaborative filtering approach in Spark uses “explicit” ratings. The explicit approach is typically used when you have actual ratings of products, such as the movie ratings on Netflix. NTC, on the other hand, doesn’t ask for user ratings for the completed workouts. We opted to use the workout completion as the indicator of how much a user “liked” a workout. As the feedback is implicit in our case, we switched to using the implicit implementation of the model, which led to a significant gain in accuracy.
We utilized users’ workout completion data and split it into training and validation sets by set time intervals. Workout Data is unique because the recommendations we make require a longer feedback period. The hope is that users will finish the workout at some point in future. Hence, data is split temporally instead of with machine learning’s usual random split to reflect the real-world scenario.
Various combinations of the parameters were used (over 200 permutations of [rank, iterations, lambda, alpha]) to find the best fit. Using knowledge of the parameters and examining the results, we were able to intelligently prune certain permutations from consideration. After training the model, workouts were predicted for the users in the validation sets, and the predictions where compared to the actuals to determine its accuracy. The graph below shows the parameters we used and the accuracy they had recommending workouts for the cross-validation set. For every combination, we used a fixed number of iterations — 20.
For calculating the accuracy of the model, the model was compared to a simple popularity-based model, which recommends the globally most popular workouts for everyone. It was then compared to an even simpler model that recommends workouts at random. The popularity model had an accuracy of 15 percent on the test set, and the random model had an accuracy of 2.2 percent on the test set. Our collaborative filtering method was able to significantly outperform more uninformed, non-machine-learning-based models.
After several rounds of tuning hyper-parameters, we were satisfied with a model that trained with a rank of 25, a lambda of 2, an alpha of 10, and an accuracy on the cross-validation set of 26.37 percent! The model became stable with those hyper-parameters, and we then rolled it out to the public for true validation.
Every hypothesis in the Recommendation domain at Nike is verified through A/B Testing. The Workout Recommender is no exception. As our users were already experiencing and engaging with the curated recommendations from our in-house experts, we used that experience as a baseline. We ran A/B testing with two variants: 50 percent baseline as the Champion and 50 percent Collaborative Filtering as the Challenger. Our key KPI for the test was driving engagement by encouraging more users to start a workout.
The A/B testing results showed that workouts from the Picks for You section were started 57 percent more often when powered by the Collaborative Filtering model. We periodically retrained the model on all of the workouts completed by all NTC users to create a new matrix and include new workouts. Over time, our model became more intelligent with its accuracy reaching up to 32 percent. Between these two variants, Collaborative Filtering was declared the new Champion.
At Nike, we relentlessly focus on serving users. Although the Collaborative Filtering model proved to be a successful experiment, the team was eager to build an even better experience for our athletes*. Collaborative Filtering models are limited by the fact that they work on three metrics: user, product and rating. Our NTC users, on the other hand, also provide us valuable information such as their workout level, gender, age, etc. Additionally, the workouts offered in NTC are developed at Nike. This enables us to access the metadata associated with those workouts, such as duration, drills, focus, intensity, etc.
For the second round, we decided to go the deep-learning route with a Neural Network model. A Neural Network model allows for unlimited model inputs. In Collaborative Filtering, we could only feed in the user workout history, while a Neural Network could weigh in the user’s profile information in the absence of their workout history.
The TensorFlow library was chosen for this technique. TensorFlow is an open source software library for high-performance numerical computation. Originally developed by researchers and engineers from the Google Brain team within Google’s AI organization, it has good support for deep learning and is capable of running on GPUs. TensorFlow also provides a feature named TensorBoard, which is great for comparing models.
Model Training and Validation
- User Data — These are unique to each user
- Workout Completion — This was also used in Collaborative Filtering.
- Workout Level — NTC users provide their workout level during onboarding selecting from Beginner, Intermediate or Advanced.
- Rated Perceived Exertion (RPE) — NTC users provide a RPE after they complete a workout. This tells us their perceived intensity of the workout.
- Workout Metadata — This is the data from the NTC Workout Catalog. This is constant for all users
- Intensity — Low, Moderate, High
- Focus — Strength, Mobility, Endurance
- Equipment Group — None, Basic, Full
- Drills — Similarity of drills utilized across workout
Similar to Collaborative Filtering, the backend accuracy was calculated on the validation dataset by predicting workouts and comparing them with the actual workouts completed. Also, the dataset was split temporally to reflect the real world behavior of a user possibly completing a workout in the future. Optimization was re-evaluated with different Scholastic Gradient Descent (SGD) techniques. The graph below shows the various optimizers and their loss and accuracy results.
- Light Blue = Adam
- Purple = RMSProp
- Dark Blue = Adagrad
- Red = SGD with decayed learning rate
- Yellow = Adadelta
We decided to go with Adam Optimizer based on the above comparison and literature provided by other researchers in the industry. Other than the optimizer, we tuned the following parameters:
We utilized Hyperopt Python library to tune hyperparameters and chose the Tree-structured Parzen Estimator Algorithm implemented by Hyperopt, which has been shown to outperform random and grid searches in literature. After several iterations, we ultimately went with a three-layers-deep neural network, with 1400, 700 and 450 nodes for the three layers, which had a validation accuracy of ~32 percent.
With Collaborative Filtering being the Champion from the first round, we ran another A/B test with Neural Network Model as the new Challenger. We kept the KPIs the same to drive engagement by encouraging more users to start a workout.
The A/B testing results showed that workouts from the Picks for You section were started five percent more often when powered by the Neural Network model than Collaborative Filtering. This way we were able to gain five percent more engagement on top of the 57 percent engagement gain from the initial model! We periodically retrain the model and, over time, our model has become more intelligent with its accuracy reaching 37 percent. Between the two variants, the Neural Network model had been declared the new Champion.
The whole process of delivering machine learning recommendations in production consists of several steps. The component diagram below shows the various systems and how they connect to each other to serve the end users.
The above components enable collection, aggregation and featurization of the data. After data preparation, the training system picks it up for hyperparameters tuning and produces artifacts such as model checkpoints, weights, and X and Y matrix values in the case of the Collaborative Filtering model. The scoring system then leverages the data and the model artifacts to produce scores for each workout per user. The score reflects the probability that the user will complete the workout. The serving system then picks up those scores and builds the API, which is accessible by the client (in this case, NTC app). It’s then able to serve the recommendation to the end user within the app. In the section below, we will look at the technology and architecture of each of these systems.
Training and Scoring
The training and scoring systems of Workout Recommender consist of several big data and machine learning jobs. Using a user’s latest workout history, we calculate workout recommendations for them daily. To enable this, we put several of the AWS services and open source tools to work.
- Languages — Python and Scala
- Data Processing and Machine Learning — AWS Elastic MapReduce(EMR) service enables us to process large amounts of data at scale with dynamically scalable EC2 instances. Our big data and machine learning jobs run on EMR with Apache Spark libraries. Spark SQL and Dataset Libraries is used for data preparation and Spark MLLib for Collaborative Filtering Model.
- Deep Learning — GPU-enabled AWS EC2 instances run the deep learning workload with TensorFlow Library. Data is converted to Protocol Buffers during the preparation step to allow for faster processing at large scale.
- Storage — Data is stored with the AWS S3 service at several steps. Apache Spark provides robust hooks to read and write data from S3. Additionally, Parquet columnar storage format was used, which helps with the performance even further.
- Data Access — the Nike Consumer Data Engineering team enables several teams at Nike with access to customer data. The data is accessed through Hive Metastore, and the curated tables are used in the data preparation jobs.
- Workflow Management — As there are several data processing jobs involved in the recommender systems, we use Apache Airflow to create, schedule and manage the workflows. Airflow also allows us to connect the various workflows to facilitate the dependency management between the jobs. Several Airflow operators construct the workflow. For example, the EMR operator is used to configure and spin up an EMR cluster. Genie operator is used to manage job submissions to those clusters via Genie services.
- Logging, Monitoring and Alerting — An in-house built and hosted Elasticsearch, Logstash and Kibana (ELK) stack exists for logging. There are email and PagerDuty alerts in place to notify the team when the jobs miss the Service-Level Agreement(SLA) due to duration or unexpected behaviors. An example is the depreciated backend accuracy of a model on retraining. We continue to make improvements in logging and monitoring parts of our Machine Learning pipelines.
- CI/CD — Like other engineering systems, training and scoring systems also follow the CI/CD process with Bitbucket, Pull Requests, Jenkins and Automated Testing. We are continuing to make improvements in this area as well.
The high-level architecture diagram below shows how the stacks listed connect together to bring the training and scoring systems to life. Note that the diagram doesn’t show the complexity of data lake building, as that itself could be another entire blog post. To save the redundancy, various inter-dependencies between the scoring and training jobs are not shown in the component diagram.
Serving Up Recommendations
As we highlighted in our DynamoDb post, Nike has been operating in the microservices world for a couple of years now. For this new system, we created a new service with the purpose of serving Workout Recommendations to the Nike Training Club system. Similar to machine learning systems, we utilize several open source and AWS services along with a few vendor solutions to build the recommendation serving system.
- Language — Scala, Python
- Framework — Nike has a homegrown framework called Scalpel for building microservices and data processing pipelines in Scala atop the Akka actor system. The framework utilizes Akka HTTP for both the server and client-side HTTP stacks. Scalpel also provides additional tools and conventions for building HTTP services at Nike, such as logging and service discovery.
- Service Discovery — We use Eureka for registration and service discovery.
- Storage — AWS S3 was used here, as well, since it’s well supported by the machine learning tools, and its performance is acceptable for our response time requirements. We applied a custom S3 key indexing strategy and performance-tested the response time. We were then able to roll out the service with the least amount of storage overhead. The infrastructure supports Redis, a faster access storage system, in case we need to switch in the future for batch recommendations.
- Logging, Monitoring and Alerting — We use Splunk for logging with a homegrown distributed tracing library called Wingtips, which has been open sourced for the community. We also use New Relic for measuring and monitoring the performance of the service and PagerDuty for Alerting.
- CI/CD — Like several other services at Nike, Workout Recommender Service also follows the CI/CD process with Bitbucket, pull requests, Jenkins and automated testing. We use AWS Cloudformation to follow the practice of infrastructure as code.
- Security — The recommendation endpoints are secured via JSON Web Tokens (JWT).
- Performance — Services are deployed in Autoscaling Groups to handle the traffic increase and decrease. We used Gatling for performance testing.
- A/B Testing — User bucketing and A/B experiment activations were done with a homegrown A/B tool. An existing analytics system sends data to the Nike data lake. We were able to use the same system to measure the health of the experiment. Since the success of Workout Recommender System, we have invested in more advanced tooling for A/B testing, which is already providing value in other production systems.
Apart from serving the machine learning recommendations, the Workout Recommender service provides workouts curated for our new users by our in-house trainers. This lets us test new models with a baseline created by the experts in the health and exercise fields.
The high-level architecture diagram below shows how the stack listed in the section above connect together to bring the serving system to production. The details of some of the boilerplate systems were omitted to keep the diagram less busy.
Give it a Try!
We hope this blog gives you a good understanding of the science and engineering behind the NTC Workout Recommender. We also hope you download the NTC app and give our Workout Recommender a try! We are excited to continue to deliver more experiences to our athletes* in the future with real-time inference and more intuitive ways to connect with Nike.
Credit: Thank you Kristen Harrison, Nike Data Science Manager, for your contribution in the Neural Network section of this post.
Interested in joining the engineering team at Nike? Check out the latest job openings.