Fantastic Embedding

Published in

Machine Learning and Math

5 min readJan 5, 2019

Foreword

Recently, I have often seen articles related to the BERT language model from Google AI Language, and the BERT effect is also good. I studied a BERT at the company’s Hackathon event to see if BERT can be used in the recommended system. You may have some doubts. This is a language model. How can I use it on the recommendation system? This article does not want to discuss topics such as the evolution of BERT in detail. If you are interested, an article from Zhang Junlin¹ explains BERT very well. This article is currently used to summarize my recent thinking about language models and recommendation systems, and I want to look at fantastic Embedding from a higher level of perspective.

Fantastic Embedding

Fantastic Embedding can be used to solve problems in various fields.

Item Search: Airbnb uses Embedding 2 in search sorting². This method was first published in KDD, entitled Real-time Personalization using Embeddings for Search Ranking at Airbnb. You can also refer to an article at zhihu.com³.

Natural language processing: We all know that we can’t directly apply machine learning to human languages, we need to convert them into vectors to process. One of the most fundamental problems in natural language processing is how to represent a sentence in a vector. The most classic is the Bag of Words model. After that, Google introduced the Word2Vec model, which can map words with similar semantics to similar positions in the vector space. It is said that the distance between the word vector of Girl and Lady will be relatively close. In 2018, Google proposed BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. BERT can take into account that the same word will have different meanings in different positions. Using this language model, the performance of multiple language tasks such as question and answer, text sentiment analysis, etc. is improved. A more detailed evolution of the language model can be found in Zhang Junlin’s article¹.

Recommendation system: Matrix decomposition technology can learn the articles and users’ respective Embedding. Related articles include Matrix Factorization Techniques for Recommender Systems⁴. This model won the 2006 NETFLIX competition and became very popular. When Word2Vec became popular, some people Applying its ideas to the recommendation system, Item2Vec ⁵ appears. Word2Vec is to some extent the same as the nature of Matrix Factorization ⁶’ ⁷.

More examples

Facebook proposed Embed All The Things!⁸, which means that it can generate Embedding for anything and open source StartSpace ⁹. Anything that can form a network structure can have Embedding’s representation. Chih-Ming Chen and others have listed¹⁰related articles, which is very valuable.

Formalization

Embedding has different forms in industrial implementation for different problems, but they can all be described by uniform mathematical symbols.

As the previous column, consider the basic classification problem¹¹ in machine learning, with (X, Y), where X is the feature, Y is the label, X ∈ ℝᴺ ˣ ˢ, indicating that there are N pieces of data, and each piece of data has S Features. Y ∈ ℝᴺ ˣ ᴵ ᵞ ᴵ, where γ is the set of labels, because there is additional information between the N data, they are related to each other and form a network, which can be represented by G, G = (V, E), where V represents the N pieces of data, |V| = N, E ⊆ (V⨉V ). Using traditional machine learning, you can learn a mapping to map X to category γ. Because there is such important information as network G, you can use G to learn Embedding of N data, that is, its representation in the network. X𝚎 ∈ ℝᴺ ˣ ᵈ, each Embedding is a d-dimensional vector, and we hope to use this d-dimensional vector to improve the performance of the final classification task¹².

The most typical example is social networking. We can improve some of the machine learning tasks of social networks by learning the vector representation of nodes in the network. Below is an example from DeepWalk¹². In the figure below, each node represents the user. The connection relationship represents whether they are friends. The number above the node can be used as the ID of the user. The color has no special meaning. These colors are manually marked. The same color will be close in space. Through the learning of this network, a vector representation of each node is obtained.

each node represents the user. The connection relationship represents whether they are friends. The number above the node can be used as the ID of the user. The color has no special meaning. These colors are manually marked. The same color will be close in space.

the position of the node in the network is represented by two-dimensional Embedding. We hope that the nodes that are close to each other in the Embedding space will also be close to each other in the network.

As shown in the above figure, the position of the node in the network is represented by two-dimensional Embedding. We hope that the nodes that are close to each other in the Embedding space will also be close to each other in the network.

Following this idea, many theories are well explained.

Item Search: Using the network formed by the item and the user through the click relationship, or a co-occurrence relationship network that uses items and items that appear together in a user’s historical access list., you can extract its Embedding, provide better clues for the item search problem, and improve the click rate of the search results.

Natural language processing: Word2Vec uses a co-occurrence relationship network that if two words appear in a sentence, they have a connection relationship in the network. By extracting the word Embedding, you can get more features of the word and improve the performance of various natural language processing. The Bag of Words model only uses the features of the word. For the latest BERT model, to some extent, BERT considers a directed network between words and words compared to Word2Vec’s undirected network. Although the BERT model does not have a very obvious relationship with the Embedding representation theory in the network, I always feel that we will have some results, if we think in this direction.

Recommendation system: The problem in the recommendation system is the same as the item search problem. You can also use the network of relationships between items and items, the network of relationships between items and users, generate their Embedding representations, and use Embedding to improve recommendations performance. Embedding¹³ can even be generated for metadata such as categories, tags, etc. The authors also open source code ¹⁴.

Conclusion

Embedding is a typical means of using unsupervised information to improve the performance of supervised learning. Embedding not only works in the field of natural language processing but also shines in the field of recommendation systems.

Fantastic Embedding

Foreword

Fantastic Embedding

Formalization

Conclusion

References

Written by Qiang Chen