In the previous blog post we have described the business problem and what we have learned while tackling it. In this second post I want to shed light on the technical solution, a few learnings and also share a basic code example. The topics touched upon here are worth a blog post each so please refer to the links at the bottom of this post for further reading.
As a reminder, the overall problem is to make recommendations of what apps to display your ads on, given a certain advertiser app. Assume that all you know is:
- The advertiser app
- A set of apps where you could display ads for that advertiser app
- Description texts about every app (these can easily be scraped from the app stores)
That would give us the following dataset where one of the apps is the advertiser app, the rest are apps we can display ads on. The category is unknown at this point but displayed to make the example easier to follow.
This setup reduces our options for the recommender system quite a bit since we don’t know much about our users except that the user has a preference for the particular app he/she is being targeted on. Knowing which app belongs to which category would allow us to recommend apps to a user that is similar to the one he/she is already using. If you would know your users usage history, then these categories can also be used to provide insights such as: users who like to play racing games also like to play action games.
Building the app / text classifier
In the code example we have 6 apps and their description texts. As a first step we extract the words we deemed to be descriptive of an app’s content. For this demonstration we will stick to extracting just the nouns since they tend to be descriptive words related to content (eg. candy, race, car etc.). To know which words in a sentence represent the nouns we use part of speech tagging (POST). Finally we keep only the word stems of the nouns for the purpose of generalizing them. Below we see that each app is now represented by a list of stemmed nouns that are hopefully descriptive enough of what the app is all about.
These nouns can now be used to compute pairwise similarities between apps. Here we use the cosine similarity metric which is well suited for text data. At this point we could have called it a day. However, in 99% of the cases having labelled data is going to yield better results than sticking to unsupervised methods. The collection of labeled data however is time consuming. But this problem can be solved! The pre-computed similarity scores allow us to basically retrieve training data for any given advertiser app in an unsupervised fashion! This is a neat trick to speed up the training data collection process. The code example visualises the idea: Searching for other racing games to label them you simply pass an app you know is a racing app (here: Nitro Nation 6) and receive a sorted list having racing apps at the top if all goes well. The person responsible for creating the labeled dataset then only has to sift through the top results and remove the bad apples looking at the description texts. Tesla has recently showcased the usefulness of speeding up the training process in this way training their neural networks on edge cases in an efficient manner by querying their fleet’s data for relevant training samples! I think this is a simple but powerful idea.
Finally, to classify our apps we use the Naive Bayes classifier. This simple classifier is known to be very efficient at solving text classification problems involving a lot classes. This is proven to be true here. The classifier in our example below has been trained on the above racing and match3 apps and is able to predict them well on novel description texts. Interestingly, using only the nouns greatly would have decreased the classifiers accuracy so there is potentially more useful information in the description texts. Trouble arises when we introduce categories that the classifier has never seen before during training (here, a solitaire card game). Its probability estimates are too confident.
Now we have the ability to take a set of apps and understand which ones are most likely to belong to which of our pre-defined categories. As you can see, this complex task can be solved with surprisingly few steps. The real world of course is more complicated than the toy example presented here. First, there are the common problems with text data such as varying languages, undescriptive texts and difficult to catch edge cases (think golf-cart racing). Trouble arises when we introduce apps that the classifier has never seen before during training (here, a solitaire card game). Also, its probability estimates are not very reliable. Finally, coming up with useful categories such as racing game, golf game etc. proved to be a difficult and quite subjective choice.
The code (link provided below) used for this demonstration is kept as short as possible and the list of improvements is long. The app store description texts certainly make for an interesting dataset to train your Data Science muscles on. Scrapers to get the raw data are readily available on GitHub if you don’t want to code it all up yourself.
Links explaining topics mentioned in this post:
- Part of speech tagging (medium)
- Stemming (datacamp)
- Cosine Similarity (MLplus)
- Text Feature extraction with Scikit
- Naive Bayes Classifier in text classification (towardsdatascience)