Machine Learning Summary (Topics)
Every e-commerce company and content provider need a recommender system which suggests related products or relevant contents to users. Even there is plenty of room for improvement, recommender systems are one of the most successful stories in ML. However, as Mike Tyson said, “Everyone has a plan until they get punched in the mouth.” The recommender systems deployed in real life are built from lessons learned in years. They deal with far more diversified objectives and tougher scalability constraints. We are not minimizing the MSE of our predictions on historical datasets. Our objective is not winning the leaderboard in a competition. Because of the scope, we will cover the basic only. Just want to point out the elephant in the room first. The real production systems are tedious with many ensembles methods done in multiple steps. This is a hard topic with conflicting objectives.
Content-based filtering recommends items based on the similarity in features between an item and a user preference. For example, we can extract features from a movie description and match what a user prefers. For example, the movie “Spy” can be featured as a McCarthy movie and will match users that watch McCarthy movies regularly. Information of a movie can be extracted from the movie title, descriptions, and information provided by the studio. A user preference can be determined by explicit information given by a user, like the demographic information, or information that can be indirectly collected from what the viewers watch, interact and purchase. Many recommender systems need to deal with the cold start problem when there is not enough information collected for a new user. Deriving user meta-information from user interactions is important. For example, we can collect descriptions for movies that a user watched and use that as input to extract a user preference.
There are many feature extraction techniques. We can create a bag of words from the movie description or from the words collected for the user preference.
Then we use TF-IDF method to compute each value in this vector. It reduces the importance of a word that frequently occurs in movie descriptions.
We compare the similarity of the corresponding vector of a movie and a user preference for recommendations. For example, we can use the cosine similarity to match a user to different movies.
Other measuring methods including Pearson Coefficient, L1/L2 distance or Jaccard Similarity can be tried subject to how the vector is calculated.
Like many ML methods, data preprocessing is important. We review the vocabulary used in preparing the bag of words and limit its size. Stop words, like “the”, should be removed. Review top words manually and remove them if it is irrelevant in your problem domain. We may match phrases and treat them as a single word in the vocabulary. In addition, there are variants of words that should match to the same word. For example, all plurals can be treated as singulars.
Nevertheless, even TF-IDF is frequently mentioned in the recommender system, this method reduces the importance of common category words like “comedy” which can be important in making recommendations. TF-IDF has an important role in a query-type searching. But in some context, it may produce too narrow focus recommendations that hurt the diversity of the recommendations. So some systems may eliminate some common and rare words that have little impact on recommendations. Then, they simply use the frequency or the occurrence of the words instead of TF-IDF. The exact choice is influenced by the application context. The words selected in the bags of words, the number of item suggestions and what functions the application are providing are all important factors. This is a good example that we cannot take textbook recommendations as granted, in particular in the field of recommender systems. We are dealing with far more complex objectives and they often conflict with each other.
Collaborative Filtering (User to user or Item to item)
Content-based filtering has a limitation on the quality of the description provided by the content provider. Technically, there is a limitation of what features can be extracted from the limited amount of content information. Extracting “content” information from the user is hard too. In reality, it is easier to judge people by what they do than what they say. Behavior information, like what they view, how they rate items, is much easier to collect. In collaborative filtering, we focus on making predictions through user behaviors instead of extracting content features. The abundant of this information can make collaborative filtering appealing.
In collaborative filtering, we are provided with labels but not the content features. It collects behavior information on how you interact with items — what you rank, how you rank, what you view and/or what you purchase. One typical approach is to create a user rating matrix containing the users’ ratings on movies. We find similarity and make recommendations.
For example, we locate people that give similar movie ratings as you and makes suggestions for movies that you have not watched but ranked highly from these people (user-to-user). Alternatively, instead of finding similarities in people, we can find similarities in movies (item-to-item) and make suggestions from what you already watch. In regarding the cold start problem, item-to-item suggestions may be easier when information for a new user is limited.
One possible implementation of collaborative filtering is to locate the k-nearest neighbors and use its average as our prediction. Or we can compute the similarities (u) between you and all other users and uses them to compute a weighted average for your possible ratings. Of course, some approximation or simplification is needed to scale the second approach.
Some people may be more generous in ratings. Therefore, we may need to normalize each person vote before finding their similarity. Below, we use the Pearson correlation to compute similarity.
Collaborative Filtering (Matrix factorization)
In a previous ML article, we use matrix decomposition to find the principal components of the data. This is an abstract linear algebra concept that is extremely important in ML. The key idea is we can factorize a matrix into simpler matrices.
SVD above is one of the matrix factorization methods. It decomposes data into independent principal components like the one below.
In a nutshell, from a rating matrix, we learn the latent factors (the principal components) that users use in rating movies. For example, these latent factors may be the genre of the movie, the year of the release, the actor or the actress in the movie. But we don’t define it explicitly. We do not know what the latent factors learned unless we examine the results manually. We let the computer to learn them from the data, just like other ML methods. Matrix decomposition is a big topic and we will not elaborate on it further. A high-level discussion can be found here or a more detail one in here.
But the number of latent factors can be huge but many of them may not be important. In PCA, we simply truncate those that are not important (those with very small singular value σᵢ comparing to others).
So even the rating matrix is large, we can reduce the information to latent factors in representing a user or a movie in a much lower dimension. This is why the concept is also termed as low ranking matrix factorization. We are reducing the rank of the matrix (dimension) through factorization.
With the concept of SVD, any user rating matrix can be decomposed into matrix P containing user latent factor and matrix Q containing item latent factor. The movie rating from a user can be reconstructed by multiplying the corresponding latent factor of the user and the movie.
But how can we find the matrices if there are missing entries in the user rating matrix? Users rate or interact with a handful of movies only. To solve that, we apply SGD to learn the decomposed matrix Z and W.
But we include errors for valid entries in the rating matrix R only. We will not introduce an error to the cost function for entries with missing ratings.
In addition, there will be rating bias in individual user and some cult movie can be rated much higher than average. We will add those biases when making a prediction and add regularization terms for them in the cost function.
Many Markov processes can be solved with random walks. Let’s say we drop off 100 shoppers randomly around the downtown area in San Franciso. We provide a transition matrix to show the probability of where the shoppers should head next from the current position. Eventually, we can spot where most interesting shops are located. Recommender system may use similar concepts except we drop all shoppers in one place and check where the shoppers may be after a certain time.
We can represent each movie as a node in a graph. Based on many factors including what products users may select next, items that may be viewed or purchased together, what products a user rated highly together, or other user behaviors, we can formulate a transition probability from one product to another. These probabilities can be represented as a transition matrix below. We want to know the probability distribution of where we may land on as time approaches infinity. These probabilities serve as the ranking in making product recommendations.
Fortunately, we can find this solution in finite steps. If we keep multiplying the transition matrix above, it will converge and in many cases, it converges very fast (we don’t need to perform infinite steps). The vector on the R.H.S. demonstrates the probability of each state we may be in.
Or we can find the eigenvectors of the Transition matrix but this is usually hard to scale for real e-commerce system. In practice, we sample transitions and after many trials, we can estimate the probability distribution by looking at where are we for each trial.
One of the most difficult parts is to compute the transition probabilities using user behaviors. Like many engineering solutions, this will involve many trial-and-error, heavy experiments, and tuning. That is why many recommender systems are tedious with many known or unknown legacy behind them.
Google PageRank utilizes the inbound and outbound links to a page in ranking their search. The basic idea is similar to a random walk on the Markov chain that we just discussed. The detail of the PageRank is discussed in the context of eigenvector here.
By finding the stable state of such a random walk, the corresponding probability of each state can be used for the final ranking. The problem is defined as finding the eigenvectors in the matrix factorization.
In practice, solving it directly with all the web pages in the world is hard. Instead, we continue multiple the matrix R and wish that it converges fast. The PageRank paper has demonstrated that with 322 million page links, the solution converges to a tolerable limit in 52 iterations. So the solution scales unexpectedly well (details).
Here is an example in comparing Coaches Poll on NCAA basketball ranking versus the one computing with the random walk concept using Markov process. It is not bad that the computer findings are close to the poll by the coaches.
The transition probability is updated by the following equations. If team A wins over team B the transition probability increase from B to A. This will further increase if the game is a blowout.
Transaction data shows a lot of user behaviors and patterns. By making associations between items in millions of transactions, we can design deals, product placements, and recommendations better.
Let’s define a few terms first. Support S is the probability of events (say what is the probability of purchasing S₁, S₂, …Sk ∈ S).
Confidence is how often T happens when S happens.
In association, we are looking for high support and high confidence.
For example, an e-commerce company may display associated items, as related products, for the current items in your shopping cart through association analysis.
First, we start with the transaction records of customers. Starting from one item, we compute the corresponding support. (For simplicity, we use the count instead of the probability in our examples.) But to avoid exponential growth of combinations, entries with count smaller or equal than a threshold will be eliminated.
Next, we create 2-item combinations with the remained items on the left above. This stops our tree to grow exponentially.
Then, we continue to build 3, 4 and 5 item-combination unless there are no more items left to build. After this exercise, we create the frequent itemsets (A, P, K, M, AP, AM, PM, KM & APM).
Our next task is to create association rules. For the combination APM, the possibilities are
But as the number of items grows, this combination grows exponentially also. So we need to prioritize which itemset to start our investigation first. Usually, we start with the itemset with the highest confidence first.
Nevertheless, confidence alone can be misleading. Confidence P(Y|X) can be high simply because Y is popular. We want to know the additional confidence in seeing Y given X. Lift recalibrates the confidence by dividing it with P(Y).
If lift is greater than one, users will be more likely to purchase itemset 𝑌 given they buy itemset 𝑋, or the opposite if it is smaller than one. Conviction is the expected frequency that X occurred while Y did not happen. A value of 1.3 means 30% of the prediction is wrong if X and Y happen together by chance only.
Let’s have an example of apply Apriori Algorithm in forming association rules. The example below tries to find any associations for the responses in a 14 question survey. For questions with categorical answers, each possible answer is treated as a separate item. However, for an ordinal answer, this example treats it as either smaller than the median, or greater or equal with the median. So the possible events are like (male, single, single, married, divorced. income < $40,000, income ≥ $40,000, …). How are the different associations that we can make.
Below is the general description of the Star Wars in the Wikipedia. Given the topics below, we can link words in the document to different topics. Soon, I realize Star Wars is about politics as much as entertainment.
In topic modeling, we are dealing with the chicken and egg problem. We have a corpus of documents. We don’t know what are the topics that these documents may cover. But each document should cover a small set of topics only. So by parsing all the documents, we may find a cluster of words in a set of documents that may discuss similar topics. But these cluster of words and cluster of documents are interdependent. Changing one will change the other. The Latent Dirichlet allocation helps us to identify what are the set of topics and the set of words belonging to a topic. As always, we don’t know what are the topics that it may discover. Only after the training, we may evaluate the set of words that a topic has and guess what is it.
So given a corpus, we want to find the distribution of topics in these documents and the distribution of words for each topic. With a new document, we find the distribution of topics in it.
By performing the topic modeling on NYT articles, the following are the 10 topics that it found with the top 10 words associated in each topic. If we examine them closely, it will not be too hard to label what topics are they.
Latent Dirichlet allocation
To solve our topic modeling, we model a distribution of topics θd for each document.
We also model the distribution of words βk for each topic.
Let’s create a generative model for a word in a document. First, for document d, we sample what topic the word should be in this document. Then we generate a word according to the word distribution for this topic.
To model these distributions, we use the Dirichlet distribution. This distribution can be viewed as a distribution of probability distributions.
Non-negative matrix factorization
Visualization & dimension reduction
Multi-dimensional scaling (MDS)
MDS visualizes data in low dimension. PCA extracts independent features orthogonal to each other. While this is good for model training, It often loses the structure of the data and hard for visualization. MDS focuses on creating a mapping that preserves the relative distance between data. If 2 points are close in the feature space, it should be close in the latent factor space.
We create a cost function that maintains the relative distance as
In Sammon mapping, the loss function is re-calibrated with the distance of the input features. So we will not miss out the fine detail structure.
Data points may lay on a manifold, like the swiss roll below. We can discover the structure and cluster data points better by following the manifold instead.
First, we need to redefine the distance between two points. The geodesic distance is the distance along the manifold. So even the Euclidean distance between point A and R is smaller than A and C below, AC has a much shorter geodesic distance — AC has a geodesic distance of 2 while AR has a geodesic distance of 17.
Here is the high-level IsoMap algorithm:
- Find the close neighbors around each point (points within a fixed radius or K nearest neighbors).
- Construct a graph connecting neighboring nodes with edge length equals the Euclidean distance.
- Compute the shortest path between two nodes using a shortest path algorithm — this is the geodesic distance.
- Run MDS using the computed geodesic distance.
t-sne (t-Distributed Stochastic Neighbor Embedding)
t-sne is a MDS with special functions for d1, d2, and d3.
t-sne produces better gaps between each cluster.
E-mail spam filter
We define words that may differentiate spam or not spam easier than others. Then we use bag of words to represent each email.
Then, we compute the probability that it is spam or not spam based on Naive Bayes’ Theorem.
To avoid underflow in multiple small numbers, we take the logarithm of the equation which turns those probability multiplications into additional. People may be more tolerance on false negative than false positive in classify e-mail as spam. So, instead of comparing the
With Gaussian Distribution
Through clustering, outliers are either not part of any cluster or far away from its centroid or neighbors.
But the cluster’s neighbor distances (r) are different in different clusters. To have a fair comparison, we normalize such distance by the average neighboring distance. Conceptually, we use the distance scale of the closest cluster in determining the outlier-ness.
But when two clusters are close together, we want the normalization part to include nodes containing the relevant cluster only. That can be done by a reverse lookup. First, we look for its k-neighbors. Then we check whether these nodes will consider yourself as one of its k-neighbors. If not, we will not include it in the normalization.