Vector in Machine Learning

Jun Xie
2 min readAug 11, 2022

--

Vector is a list of numbers. Vector is widely used in the machine learning area as it is a foundation for many ML methods, like supervised learning algorithms and unsupervised learning algorithms. In recent years, ML domains have gained significant improvement based on embeddings which are generated by deep learning algorithms to capture semantic representation of the input. In a nutshell, embeddings are just n-dimensional vectors, which can be used to calculate the similarity mathematically.

Before deep learning, machine learning applications defined a list of features and filled the vector with corresponding feature values in a column. Take the house price prediction as an example, before we apply linear regression to train a prediction model, we define a list of features, like the number of bedrooms, lot size, and the number of bathrooms. Then for each sold house, we can have a vector representation like [3, 5100, 2.5] for the sold house and its sold price as a data point in the training data.

For now, machine learning applications can use deep learning to learn a semantic representation as a vector of the input and then use some search (KNN) algorithm to quickly find relevant items for a specific user. Take video recommendation at Youtube as example shown in Figure 1, we firstly use deep learning algorithms to transform the video into an embedding. This way, each video is represented as a vector. Those vectors are stored and indexed into a serving platform. When a user visits Youtube, we also can formulate the user into an embedding based on the user’s browser history. Let us say the user watched video 1 and video 2 before, we can get embedding for both video 1 and video 2 and then do average as the query embedding. Then this query embedding will be used to match against all videos in vector format and returns the most relevant videos. Those relevant videos can go through a ranking phase to return the final best recommendation video to users.

Figure 1. Pipeline to generate vector

In regards to the deep learning algorithms to generate embeddings, there are many pre-trained model in either HuggingFace or TensorFlow Hub for different use cases, which you can write some programs to load model, feed the data into the model and then can get the embedding vector.

--

--

Jun Xie

Founder and ex-Snap software engineer. I am interested at Machine Learning and Database. Feel free to drop me an email: xiejuncs@gmail.com