Vector Database for AI

Building Personalized Recommender Systems with Milvus and PaddlePaddle

How to build a deep learning powered recommender system

Milvus
Vector Database for AI

--

Background

With the continuous development of network technology and the ever-expanding scale of e-commerce, the number and variety of goods grow rapidly and users need to spend a lot of time to find the goods they want to buy. This is information overload. In order to solve this problem, recommender system came into being.

The recommender system is a subset of the information filtering system, which can be used in a range of areas such as movies, music, e-commerce, and Feed stream recommendations. The recommender system discovers the user’s personalized needs and interests by analyzing and mining user behaviors and recommends information or products that may be of interest to users. Unlike search engines, recommender system do not require users to accurately describe their needs, but model their historical behavior to proactively provide information that meets users’ interests and needs.

In this article we use PaddlePaddle, a deep learning platform from Baidu, to build a model, and combine Milvus, a vector database for AI, to build a personalized recommender system that can quickly and accurately provide users with interesting information.

Data Preparation

We take MovieLens Million Dataset (ml-1m) [1] as an example. The ml-1m dataset contains 1,000,000 reviews of 4,000 movies by 6,000 users, collected by the GroupLens Research lab. The original data includes feature data of the movie, user feature, and user rating of the movie, you can refer to ml-1m-README [2] .

ml-1m dataset includes 3 .dat articles: movies.dat、users.dat and ratings.dat.movies.dat includes movie’s features, see example below:

MovieID::Title::Genres
1::ToyStory(1995)::Animation|Children's|Comedy

This means that the movie ID is 1, and the title is Toy Story, which is included in three categories. These three categories are animation, children, and comedy.

users.dat includes user’s features, see the example below:

UserID::Gender::Age::Occupation::Zip-code
1::F::1::10::48067

This means that the user ID is 1, the user is female and is younger than 18 years old. The occupation ID is 10.

ratings.dat includes the feature of movie rating, see the example below:

UserID::MovieID::Rating::Timestamp
1::1193::5::978300760

It means that the user with ID 1 evaluates the movie 1193 as 5 points.

Fusion Recommendation Model

In the film personalized recommender system we used the Fusion Recommendation Model [3] which PaddlePaddle has implemented. This model is created from its industrial practice.

  • First, take user features and movie features as input to the neural network, where:

a. The user features incorporate four attribute information: user ID, gender, occupation, and age.

b. The movie feature incorporates three attribute information: movie ID, movie type ID, and movie name.

  • For the user feature, map the user ID to a vector representation with a dimension size of 256, enter the fully connected layer, and do similar processing for the other three attributes. Then the feature representations of the four attributes are fully connected and added separately.
  • For movie features, the movie ID is processed in a manner similar to the user ID. The movie type ID is directly input into the fully connected layer in the form of a vector, and the movie name is represented by a fixed-length vector using a text convolutional neural network. The feature representations of the three attributes are then fully connected and added separately.
  • After obtaining the vector representation of the user and the movie, calculate the cosine similarity of them as the score of the personalized recommendation system. Finally, the square of the difference between the similarity score and the user’s true score is used as the loss function of the regression model.

System Overview

Combined with PaddlePaddle’s fusion recommendation model, the movie feature vector generated by the model is stored in Milvus, the vector database, and the user feature is used as the target vector to be searched. A similarity search is performed in Milvus to obtain the query result as the recommended movies for the user.

The inner product (IP) method is provided in Milvus to calculate the vector distance. After normalizing the data, the inner product similarity is consistent with the cosine similarity result in the fusion recommendation model.

Application of Personal Recommender System

There are three steps in building a personalized recommender system with Milvus, details on how to operate please refer to Mivus Bootcamp [4].

Step 1:Model Training

# run train.py
$ python train.py

Running this command will generate a model recommender_system.inference.model in the directory, which can convert movie data and user data into feature vectors, and generate application data for Milvus to store and retrieve.

Step 2:Data Preprocessing

# Data preprocessing, -f followed by the parameter raw movie data file name
$ python get_movies_data.py -f movies_origin.txt

Running this command will generate test data movies_data.txt in the directory to achieve pre-processing of movie data.

Step 3:Implementing Personal Recommender System with Milvus

# Implementing personal recommender system based on user conditions
$ python infer_milvus.py -a <age>-g <gender>-j <job>[-i]

Running this command will implement personalized recommendations for specified users.

The main process is:

  • Through the load_inference_model, the movie data is processed by the model to generate a movie feature vector.
  • Load the movie feature vector into Milvus via milvus.insert.
  • According to the user’s age/gender /occupation specified by the parameters, it is converted into a user feature vector, milvus.search_vectors is used for similarity retrieval, and the result with the highest similarity between the user and the movie is returned.

Prediction of the top five movies that the user is interested in:

TopIdsTitleScore
03030Yojimbo2.9444923996925354
13871Shane2.8583481907844543
23467Hud2.849525213241577
31809Hana-bi2.826111316680908
43184Montana2.8119677305221558

Summary

By inputting user information and movie information into the fusion recommendation model we can get matching scores, and then sort the scores of all movies based on the user to recommend movies that may be of interest to the user. This article combines Milvus and PaddlePaddle to build a personalized recommender system. Milvus, a vector database, is used to store all movie feature data, and perform similarity search with user features. The search result is the movie ranking recommended by the system to the user.

Milvus’ [5] vector similarity search is compatible with various deep learning platforms, as it can search billions of vectors in milliseconds. You can explore more possibilities of AI applications with Milvus with ease!

Reference

[1] MovieLens Million Dataset (ml-1m): http://files.grouplens.org/datasets/movielens/ml-1m.zip

[2] ml-1m-README: http://files.grouplens.org/datasets/movielens/ml-1m-README.txt

[3] Fusion Recommendation Model by PaddlePaddle: https://www.paddlepaddle.org.cn/documentation/docs/zh/beginners_guide/basics/recommender_system/index.html#id7

[4] Bootcamp: https://github.com/milvus-io/bootcamp/blob/master/demo/recommender_system

[5] Milvus: https://milvus.io/en/

--

--

Milvus
Vector Database for AI

Open-source Vector Database Powering AI Applications. #SimilaritySearch #Embeddings #MachineLearning