In policy-based methods, instead of learning a value function that tells us what is the expected sum of rewards given a state and an action, we learn directly the policy function that maps state to action (select actions without using a value function).
Let us consider trying to personalize the image we use to depict the movie Good Will Hunting. Here we might personalize this decision based on how much a member prefers different genres and themes. Someone who has watched many romantic movies may be interested in Good Will Hunting if we show the artwork containing Matt Damon and Minnie Driver, whereas, a member who has watched many comedies might be drawn to the movie if we use the artwork containing Robin Williams, a well-known comedian.
If you followed the Prize competition, you might be wondering what happened with the final Grand Prize ensemble that won the $1M two years later. This is a truly impressive compilation and culmination of years of work, blending hundreds of predictive models to finally cross the finish line. We evaluated some of the new methods offline but the additional accuracy gains that we measured did not seem to justify the engineering effort needed to bring them into a production environment. Also, our focus on improving Netflix personalization had shifted to the next level by then. In the remainder of this post we will explain how and why it has shifted.
A year into the competition, the Korbell team won the first Progress Prize with an 8.43% improvement. They reported more than 2000 hours of work in order to come up with the final combination of 107 algorithms that gave them this prize. And, they gave us the source code. We looked at the two underlying algorithms with the best performance in the ensemble: Matrix Factorization (which the community generally called SVD, Singular Value Decomposition) and Restricted Boltzmann Machines (RBM). SVD by itself provided a 0.8914 RMSE, while RBM alone provided a competitive but slightly worse 0.8990 RMSE. A linear blend of these two reduced the error to 0.88. To put these algorithms to use, we had to work to overcome some limitations, for instance that they were built to handle 100 million ratings, instead of the more than 5 billion that we have, and that they were not built to adapt as members added more ratings. But once we overcame those challenges, we put the two algorithms into production, where they are still used as part of our recommendation engine.
unchanging. Rein…ying to hit a moving target, as the true values are calculated by the same network we are training. This is the big discrepancy between supervised learning — another subset of machine learning that includes tasks like image classification and sentiment analysis — and reinforcement learning. Supervised learning uses datasets that are labeled, meaning that the target values are manually set by humans and assumed to be accurate and unchanging. Reinforcement learning creates its own, ever-shifting dataset, both because the network generates its own target values and because the action choices of the network directly impact which states it will reach in its environment and therefore what it will have to learn about. To help manage that, we actually take two extra stabilization measures. First, we duplicate the neu…
…gent can’t really see the future, and can’t assign a unique Q value to each of the possible states. That’s where the deep learning comes in. By mapping pixel images to Q values, our neural network acts as a Q function approximator, so that while it can’t see the future, with enough training it can learn to predict it.
With a three-layer Q-network, it only took 493 episodes to solve the problem! For a simple problem like Cart-Pole, the Q-table method was definitely faster. However, the Q-network method has the potential to tackle much harder problems!
Because of this huge variation in the location of the information, choosing the right kernel size for the convolution operation becomes tough. A larger kernel is preferred for information that is distributed more globally, and a smaller kernel is preferred for information that is distributed more locally.