Tag-based real-time short video recommendation system | depth

Dũng Cảm Lắng
33 min readJul 1, 2020

Author | gongyouliu

Reprinted from big data and artificial intelligence (ID: ai-big-data)

Introduction: The author gave a detailed explanation of the content-based recommendation algorithm in the article “Content-based Recommendation Algorithm” . One of the most important content recommendation algorithms is the label-based inverted index algorithm, which is also used by industry. More algorithms, especially news information and short video products, use this kind of algorithm a lot. In this article, the author will combine TV cat business scenarios and engineering practice experience to explain in detail the principle of label-based inverted index algorithm and the details of the project landing plan.

This article will introduce the tag-based real-time video recommendation system from 6 aspects: the application scenario of the tag-based recommendation algorithm, the introduction of the principle of the tag-based recommendation algorithm, the overall architecture and engineering implementation, the recall and sequencing strategy, the cold start strategy, and the future optimization direction . .

It is hoped that after reading this article, readers can fully understand the product form, algorithm principle, and engineering implementation scheme of the label-based inverted index algorithm, and be able to build a label-based algorithm system from scratch based on the ideas of this article.

As mentioned in the last article “Erlang-based similar video recommendation system” , TV cats have 6 categories of long video and short video. Long video has relatively low real-time requirements, so this article mainly recommends short-time real-time personalization For example to explain.

1. Application scenarios of label-based recommendation algorithm

Before talking about the specific algorithm principle and engineering practice, we first briefly introduce the feasible product form of the label-based recommendation algorithm to let the reader know which business scenarios this type of algorithm can be used, so as to have an intuitive impression, which is more convenient Understand the content of subsequent explanations well. These product forms TV cats have landed in real business scenarios. The following illustration also uses the product form of TV cats as an example.

In the third section of the “content-based recommendation algorithm”, we briefly describe the application scenarios of content-based recommendation algorithms, and label-based recommendations are a type of content recommendation, and the application scenarios are similar: fully personalized recommendations, The three types of application scenarios related to subject matter recommendation (similar video recommendation) and subject recommendation are all feasible. We will briefly explain these three business scenarios below.

1.1 Fully personalized recommendation

The fully personalized recommendation is to generate different recommendation results for each user. The following figure is the real-time personalized recommendation of TV cat small video. Based on the user’s (tag) interest portrait, the user is recommended to recommend videos that are similar to the user’s interest preferences. The user can Infinite right slide (Since the TV cat is a video software on the living room side, it interacts with the remote control, so the product interaction method is different from the pull-down interaction on the mobile terminal such as Toutiao.) Get the recommendation results that you are interested in. The entire algorithm will be based on the user’s interest The changes update the recommendation results for the user in real time.

Figure 1: Real-time personalized recommendation for TV cat videos

1.2 Relevant subject matter recommendation (similar video recommendation)

Short video similarity recommendation builds similarity between videos based on video tags, and recommends similar videos for each video.

The following picture is the similar recommendation of TV cat short video. The product form adopted is the form of continuous broadcast recommendation. After the user plays the main video, the related similar videos will be continuously played according to the similarity list to maximize the user experience.

figure 2:TV Cat Short Video Streaming Recommendation

1.3 Theme recommendation

The theme recommendation builds a user interest portrait based on the user’s playing behavior history. Here, the user portrait is constructed based on the program’s label, and the program related to the most interesting label is recommended for the user based on the user portrait label.

The following picture is the theme recommendation of the TV cat music channel. Based on the music videos the author has watched recently, the author recommends the authors short music videos related to the two topics of “Mandarin” and “Instrument Teaching”.

image 3:TV cat music channel theme recommendation

After explaining the recommended product forms based on tags, I believe readers have a more intuitive understanding of tag-based recommendations, so how do we implement these product forms in actual business? How to construct a suitable label-based recommendation algorithm? In the next section we will explain the basic principles of the algorithm in detail.

2. Introduction to the principle of label-based recommendation algorithm

We have made a brief introduction to the principle of tag-based personalized recommendation algorithm in “Content-based Recommendation Algorithm”. Readers who have read this article should have an impression. It’s okay to be unfamiliar. In this section, we will mention the previous section. The three product forms: personalized recommendation, similar video recommendation, and theme recommendation algorithm implementation principles make a detailed introduction to facilitate readers to understand the implementation details of the algorithm.

2.1 Personalized recommendation (completely personalized paradigm)

The specific recommendation process of the tag-based personalized recommendation algorithm is shown in Figure 4 below: Get the user’s interest tag from the user’s portrait, and get the program corresponding to the tag from the tag->program inverted index table based on the user’s interest tag. From the user to the program. Each user’s interest tag and the program associated with the tag are weighted.

Figure 4: Video recommendation based on inverted index

Assume that the user’s interest tags and corresponding tag weights are as follows, where

is the tag, which

is the user’s preference weight for the tag.

Assume that the videos associated with the tags

are:

Where

,

respectively, the subject matter and the corresponding weight, then

In the above formula, U is the user’s preference set for the video. We here regard the video

as the basis of the vector space, so there is the above formula. Different tags can be related to the same video (because different videos can have the same tag), the same term needs to be merged to the right of the last equal sign in the above formula, and the coefficients in front of the same basis are added. After merging similar items, the value in front of the video (base) is the user’s preference for the video. We sort these preferences in descending order, and we can make topN recommendations for users.

The above is only the algorithm principle for recommending users based on the user’s interest portrait. In actual business, the user’s interest has long-term and short-term interests. At the same time, it is necessary to consider providing users with diverse recommendations and adjusting based on real-time feedback during the user’s playback process. Recommend the results, so the actual project will be very complicated. We will explain this in detail in the architecture and engineering implementation of Section 3, recall and sequencing in Section 4.

2.2 Video similar recommendation (subject-related paradigm)

In this section, we first explain how to use the label of the video to calculate the similarity between the two videos. With the similarity between the videos, it is easy to make similar recommendations for the video.

Suppose that the video collection is

, which

is the corresponding video. Assume that all video tag sets are among them, which are the corresponding tags. Generally, n and m are very large numbers, ranging from hundreds of thousands to millions, and even larger. Each video has only a few tags, so if the video is expressed as a vector of tags, it must be a sparse vector. We can use the cosine similarity of the video’s tag vector to calculate the similarity between the two videos. The specific calculation process as follows:

Assume that

the vectors of the two videos are expressed as follows (we

encode the vectors in the order of the middle labels), which

are the corresponding weights. If one-hot encoding is used,

=0 or

=1. If the labels are weighted, the corresponding label weights .

We can use the following cosine cosine similarity formula to calculate

the similarity between:

We can calculate the similarity

with all other videos (excluding

itself):

Then

similar recommendations can be sorted using the above list in descending order and take topN as the final recommendation list.

2.3 Theme recommendation

With the introduction of the algorithm principle of personalized recommendation in 1, it is easy to explain how to do topic recommendation.

First, we obtain some of the most interesting tags of the user based on the user portrait. Each interest tag is a topic. It is sufficient to recommend the program associated with each interest tag to the user. The following briefly explains.

Assume that the user’s interest tags and corresponding tag weights are as follows, where

is the tag, which

is the user’s preference weight for the tag.

We can in accordance with the above set of weights in descending order, select the k highest weight (the user’s favorite) label as a theme to be recommended. Then select the corresponding program from the programs associated with each label (in the actual project implementation, we will construct the label->program inverted index table in advance to facilitate the association of the label to the program) and recommend it to the user.

Above, we briefly explained the algorithm principles of the three types of label-based recommendation algorithms. Below, we will combine the practical experience of TV cats to explain how these three types of recommended products are implemented in engineering.

3. The overall structure and engineering realization

In this section, we will explain in detail the overall architecture, core functional modules and engineering implementation of the above three types of algorithms.

Here we focus on explaining only the architecture and implementation of the two recommendation products of personalized recommendation and similar video recommendation. The topic recommendation is very similar to personalized recommendation, we will briefly explain.

TV cat’s personalized short video recommendation based on tags is implemented based on the Spark platform. Streaming processing uses Spark Streaming components, offline processing uses Spark, and the entire code project is integrated into the Doraemon framework . “Engineering Implementation” article). Each processing logic in the following architecture diagram is abstracted as an operator, encapsulated in the Doraemon framework, which is convenient for business reuse, expansion and engineering maintenance.

In order to decouple each module, we use a large number of message queues (RabbitMQ and Kafka) to transmit messages (data), making the entire recommendation system more modular and structured. As long as the (data) agreement between the two modules (operators) is defined, each submodule can be optimized and upgraded independently without affecting each other.

The program inverted index and user portraits are stored in HBase clusters to facilitate distributed reading of the algorithm. The data structure of HBase is shown in the following figure. Unfamiliar readers can search online to find out. The final recommendation results are stored in CouchBase and Redis. Personalized recommendations, theme recommendations and other product forms that generate a recommendation result for each user will have a larger amount of data. The recommendation results are stored in CouchBase (a distributed document database, It can facilitate horizontal expansion), and the similar video data is relatively small and stored in a key-value memory database such as Redis.

Figure 5:HBase data structure

With the above background knowledge, now we officially introduce the engineering implementation of various algorithms.

3.1 Personalized recommendation

The personalized recommendation is divided into two parts: an offline module and a real-time module. The offline part is updated once a day to generate recommendation results for all users, while the real-time part updates the recommendation list in real time based on the user’s real-time behavior. Offline recommendation and real-time recommendation cooperate with each other, “alternately” (strictly not alternately, in the offline task running process, as long as there are users using the product, real-time recommendation is also running, but offline generally runs in the early morning, running time It will not be very long. At this time, there are fewer users. At other times, real-time recommendations are working, so the brief description is alternating), providing users with 24/7 recommendation services (see Figure 6 below).

Figure 6:Offline recommendation & real-time recommendation “alternate”, offline update once a day, using real-time recommendation between two offline recommendations

The following figure is the overall architecture of personalized recommendation based on tags, divided into two lines, one line generates an inverted index of program tags from the media asset system, and the other line generates a tag-based user interest portrait from the user behavior log, and finally inverted The index and user profile are used by the recommendation program (Operator 5) to generate recommendations for the user. For the sake of simplicity here, we only consider recommending users based on user portraits, and do not consider other various recall strategies. More recall strategies will be explained in Section 4.

Figure 7: Overall architecture of personalized recommendations based on tags

The entire algorithm implementation mainly includes the big 5 core modules (corresponding to the 5 operators marked 1, 2, 3, 4, 5 in the above figure), each operator runs as an independent program, and does not affect each other. Among them, the operator 5 is the core recommendation module. Let’s describe the core functions and engineering implementation of each module separately.

(1) New program and label injection

The media asset system is a content management system for the video industry, responsible for the management, operation, and output of all content. The recommendation system relies on the content source of the media asset system. The tag-based video recommendation system obtains new/modified programs and tag information from the message queue, and uses these messages to construct a tag <-> program inverted index table. This module sends the information that the recommendation needs to rely on to a fixed topic in the message queue by means of messages. Subsequent modules monitor the topic to obtain new messages for further processing.

Figure 7 below shows a simplified version of the information. The messages are organized by json, including type (whether it is a newly-introduced program or an update to the old program label), sid (program unique identifier), title (program title) , Tags (tags).

Figure 8: Structure of messages in the information queue

The tag is also uniquely identified, that is, the tid in the figure above, similar to the sid of the video. By using the tid of the tag in the process of building the inverted index and user portrait, the comparison and processing logic can be simplified and the storage space can be reduced.

The tags are also weighted or hierarchical. The TV cat’s tags have a three-level system of classification tags -> column tags -> content tags. From thick to fine, this hierarchical structure has a lot to do with industries. Different industries have different ratings Strategies and methods. Tags are also weighted, and the weight measures how important the tag is to the program. This information can be integrated in the actual algorithm to make the algorithm more accurate. For the sake of simplicity, this article does not consider graded labels, only flat first-level labels.

The benefits of obtaining messages through message queues are twofold: First, the media asset system can be decoupled from the recommendation system (generally two different teams are responsible) to facilitate the independent expansion and upgrade of the two systems, as long as the message format remains unchanged , Does not affect the business on both sides. Second, transmitting information through the message queue can make the system more real-time.

In our project (1) the message queue for docking uses RabbitMQ. This module can be provided by the media asset team to provide basic services and maintained by the media asset team. The algorithm team can provide requirements to the media asset team according to the recommended algorithm. The fields and specifications provide data.

(2) Generate inverted index of label programs

This step (nearly) obtains the label information of the program from the message queue in real time, and constructs the inverted index of the label <-> program for each program, which is convenient to associate from the program to the label and from the label to the program. We use Spark Streaming streaming processing components to build an inverted index, so that the index can be updated in real time, and the index is stored in the HBase cluster to facilitate subsequent distributed reading of real-time processing programs.

The specific data storage format of label->program inverted index is as follows, where tid is the unique identification code (number) of the label, sid is the number of the program, publishTime is the release time of the program, hot (news), game (game), sports (Sports) is a different short video type, and the inverted index structure of program->tag is similar.

Figure 9: HBase storage structure of label->program

Based on the data structure in the message queue in Figure 8, operator 2 (Spark Streaming program) processes the newly added programs in the message queue in near real time (a time window of a few seconds), and simply processes the tags to obtain the correspondence between the tags and the programs. Relationship, and update to the tab->program’s inverted index table. Since the processing operation is very simple, I will not elaborate here.

(3) User behavior ETL and injection into the message queue

The user behavior log is processed by simple ETL to extract key information and insert the information into the corresponding message queue for the subsequent construction of the user portrait module to generate the user portrait.

The core information of the user behavior log must include the user’s unique identification code and the program sid and the user’s preference for the program (which can be measured by the user’s viewing time) (see Figure 10 below). Through the program sid, we can invert from the program -> label The corresponding label is found in the index table.

Figure 10: User core behavior information

Here, we use Kafka to connect the user behavior log component. The entire TV cat’s log is divided into batch and flow links. The batch log enters the data warehouse through ETL on an hourly basis, and the flow log enters Kafka for real-time processing by the backend. Business (such as real-time recommendations, real-time reports, business monitoring, etc.) consumption.

(4) Generate user portrait & playback history

This module obtains user behavior data from the message queue in real time, and generates user portraits and playback history records based on tags for users.

In order to reflect the user’s long-term and short-term interests, we can generate multiple portraits of different time periods, such as long-term portraits (based on the user’s behavior in the past few months or more), medium-term user portraits (day to day), short-term users Portraits (minutes to hours). Long-term and mid-term user portraits can be batch-processed and generated once a day. For short-term user portraits, it is best to use streaming to capture changes in user interest in real time.

The user’s history is used to record the content that the user has played or skipped, which is of no value to the user. It is recorded to facilitate filtering out these contents during the final recommendation and improve the user experience.

The following figure is the HBase data structure of short-term user portraits and user historical behavior. Operator 4 reads real-time user behavior logs from Kafka, obtains program sid, tags, etc. from the logs, and finally generates real-time user portraits and updates the user’s playback history recording.

To avoid misunderstanding, here is a brief mention. Figure 7 only shows the process of generating user portraits from the message queue in real time using Spark Streaming. The offline portrait generation part is not shown. The offline user portraits are directly read from the data warehouse using Spark The offline behavior data is generated through similar processing for the user’s medium- and long-term user portraits (stored in different HBase user portrait tables).

Figure 11:Short-term user portrait (Persona) and user historical behavior (action) HBase data structure

(5) Recommend to users based on user portrait and label program inverted index

With the label-based user portrait and label->program inverted index, you can generate recommendation results for users in real time. You can obtain the user’s preferred label through the user portrait, and then based on the label->program inverted index, you can provide users with Related to the show.

Here we briefly introduce the method of using Spark to calculate recommendations for users offline (real-time recommendations are introduced in Section 4). First, Spark reads all user behavior data from HBase. We divide users into N Partitions, and each Partition Of users update personalized recommendations (refer to Figure 12 below for the specific process), insert the final recommendation results into the CouchBase cluster through Kafka for the recommendation interface to call, and return to the front end to display to the user. The purpose of dividing users into N partitions is to facilitate distributed computing. The recommendation result is inserted into CouchBase through Kafka to decouple the recommendation process from the service process provided by the interface.

Among them, to generate personalized recommendations for a single user (the personalized recommendation algorithm in Section 2), we can encapsulate it as an independent operator, and each partition is called cyclically to generate personalized recommendations for all users in the partition.

Figure 12: Calculate recommended business flow for users based on Spark Streaming

By the way, in addition to inserting CouchBase, the final recommendation result also needs to be inserted into HBase, so that the real-time recommendation module can adjust user interest in real time based on the recommendation result.

The difficulty here is how to generate personalized recommendations for users based on the user’s interest portraits at different time stages, and how to ensure the diversity of content, and to integrate user real-time feedback to provide users with near real-time personalized recommendations. A detailed analysis will be explained in the recall, sequencing, and real-time update strategies in the next section.

3.2 Similar video recommendation

The following figure is the overall architecture of similar video recommendation, including three parts (corresponding to the three operators in Figures 1, 2, and 3 below), of which 1, 2 are exactly the same as personalized recommendations, and will not be explained here.

Only 3 is explained below.

Figure 13: Overall architecture of similar video recommendation based on tags

Generate similar recommendations based on inverted index

In the previous section, we have explained how to calculate video similarity. Here we briefly describe the business process of calculating video similarity.

When new video is injected into the message queue, all programs and their labels are taken from the program inverted index table, and the similarity is calculated with the newly injected program to obtain the final TopN most similar program. We will insert a copy of this similar recommendation list (the specific data structure is shown in Figure 14 below), and at the same time insert a copy into Redis through the Kafka message queue, and insert this Redis as the final recommendation result, for the interface call to return to the front end to provide to the user . This similar recommendation inserted into HBase will be used for real-time personalized recommendation, and the user recommendation list will be updated according to the user’s real-time behavior. How to use it will be explained in the next section of the real-time update strategy.

Figure 14:Data structure of similar videos in HBase

Here we give a brief explanation on how to use Spark Streaming to calculate topN similarity for a single program. It is preferred to take all programs that need to calculate similarity with program A and store them in an RDD. During the calculation, all programs are distributed in N Partitions. We separately calculate the topN similarity of programs in A and each Partition, and finally merge the topN similarities in N Partitions to obtain the final topN recommendation. For the entire process, refer to Figure 15 below.

Figure 15: Algorithm logic for calculating topN similarity based on Spark Streaming

For short videos that require high timeliness, such as news and sports, there is no need to take out all the videos in the library. You only need to take the last few days, which can greatly reduce the amount of calculation. Even if it is taken out, you can filter out the programs that do not contain the label in A (we calculate the similarity based on the label, if the label of B is different from the label of A, the similarity must be 0), and then calculate the similarity, There will be a lot less calculation (because the labels are sparse).

In addition to the above calculation, there is also a case to be dealt with: we need to update the similarity list of videos whose similarity has been calculated, this is because the newly added program A may have a similarity to B than the similarity list in B The similarity of the program is greater, and it is necessary to update the similarity list of B at this time. We will not talk about the specific update strategy here. There is a very detailed explanation in the “Similar Video Recommendation System Based on Erlang Language”. The update process with Spark is similar, but the implementation method is different.

The overall architecture described above is to generate similar recommendation lists for new videos in real time. When we first start the project or make similar recommendations for new short video types, we need to calculate the similarity of all videos at once. There are two feasible methods. One is to import all videos into the message queue and calculate them using a real-time calculation similarity program. The other way is to implement an offline set of similarity calculation program, which is only used for project startup or new addition. The case where the video type first calculates the similarity. The first method may cause the queue to accumulate over a period of time, especially if the total amount of video is relatively large. Our team is the second option.

3.3 Theme recommendation

The overall architecture for generating topic recommendations for users is similar to personalized recommendations. We need to obtain a batch of user preference tags and associate them with a group of programs through the tags.

The only difference is that personalized recommendation will combine all the tags and the programs associated with the tags according to the weights to form a summary recommendation list, and the topic recommendation will form each preference tag into a topic, and each tag-related program is this Theme recommendation. Not elaborate here.

4. Recall and ranking strategies for personalized recommendations

In the section of overall architecture, we explained how to make personalized recommendations for users based on user portraits and the inverted index of program labels. The focus is on how to generate recommendations that meet user interests based on user interest preferences.

In this section, we will introduce in depth how to use more recall strategies to generate more diversified content for users to meet the diversified interests of users, and how to capture changes in users’ interests in real time. Due to the short duration of each short video, these processing strategies are necessary. Recommend only based on user interests will lead to a “narrower recommendation”, which is not conducive to the distribution of content and maintenance of user experience. By recommending a variety of content, it can not only expand the user’s interest space, but also facilitate content distribution.

The following figure is the process of short video recommendation recall and sorting. First, a variety of recall strategies are used to generate recommendations for users, and these contents are combined and recommended to users through sorting strategies. Below we will briefly explain the recall strategy and the ranking strategy.

Figure 16: Personalized recommendation recall and ranking

4.1 Recall strategy

For short videos, in addition to recommending users based on their interests, there are many ways to make recommendations for users. Specifically, there are 6 types of feasible recall strategies:

(1) Recall based on users’ recent interests

For short videos, especially news, the user’s interest changes with time, so we need to generate a user’s interest portrait based on the past short time (a few days or even shorter time), and integrate the user’s Recent interest.

(2) Recall based on the user’s long-term interest

The user’s interest is also stable and slowly changing, which requires that we can generate a longer-term (a few months or longer) portrait of interest for the user, and integrate the user’s long-term interest in the recommendation.

(3) Recall based on user’s region

In the TV Cat app, we know the user’s region based on the user’s IP, and many content has regional attributes. The user also tends to pay attention to local related information, so we can recall the content matching a specific region for the user based on the user’s region ( (Some of the content is geographically labeled).

(4) Relevant recall based on the user’s last program

The last favorite program of the user (the user has finished watching and has a strong liking preference), which represents the user’s recent points of interest, then we have every reason to guess that the user likes similar programs of the program, so we can recommend similar programs of the program Give the user as a recall. The TV cat’s real-time personalized recommendation adopts this recall strategy.

(5) Recall based on new fever

The nature of people’s curiosity about the unknown determines that people will be interested in new things, and the herditary side determines that we have a high probability of liking what everyone likes. So recalling new hot content for users is a very safe strategy. Generally this kind of recall will also be used as the default recommendation for new users to solve the cold start problem.

(6) Recall based on differentiated categories

In order to avoid that the content recommended to users is too narrow, it is necessary for us to recommend diversified content for users and tap new interests of users. We can divide the content into multiple categories according to the label (meeting that the content of different categories is quite different), randomly select several programs from each category and aggregate them to form a “hodgepodge”, as a recall recommendation that meets the diverse needs of users To users.

For some products, if there is a function to follow a certain channel or a certain author, the content from these channels or the author’s source can also be used as a recall strategy. In addition, time also has an impact on the user’s interest. Different content may be suitable for viewing in different time periods, so you can also generate relevant recommendations for users based on time as a recall strategy.

4.2 Sorting strategy

Having introduced various feasible recall strategies, how can so many recall recommendations be recommended to users? Certainly it is impossible to recommend it to users with one mind. We need to integrate, filter, filter, and sort these contents to form a more refined list and recommend it to users. This is the problem that the sorting strategy needs to solve. The ultimate goal is to increase the click rate of the recommended list and improve the user experience. Generally speaking, sorting strategies can be divided into rule-based sorting and model-based sorting. We will briefly introduce them here.

(1) Rule-based sorting

Rule-based sorting is mainly based on operational or manual strategies, which is subjective and requires certain business knowledge and industry experience. For example, one of each of the above 6 recall strategies can be selected and selected cyclically until the number recommended to the user is reached. Assuming the following

is a list of six recalls, then it

is the strategy according to the above circular sorting.

The above only gives one of the most intuitive and simple sorting strategies. According to different product forms and business forms, there are other various sorting and merging strategies. For example, different queues can be given different weights and a certain probability can be used to select a queue. Different queues can also choose different numbers of programs.

(2) Model-based sorting

Model-based sorting. The method is different from the above rules. A machine learning model (logistic regression, deep learning, etc.) is trained through user behavior data. This model can output a user’s preference for the program for each user and program pair. Or rating, we will sort them in descending order according to the probability or rating of all the programs in the recall queue, and recommend the top TopN to the user.

The model-based method is more objective and reliable, and will not be affected by many subjective factors of human beings. It can integrate all the user’s behavior data on the product and the data of the user and the subject matter. Generally speaking, the effect will be better. The author does not elaborate here, and will explain the knowledge of sequencing learning separately in the future.

Different recall strategies may recall duplicate content. We also need to consider filtering out duplicate content in the sorting stage. The sorting strategy is also related to the specific product interaction methods. For example, today’s Toutiao APP uses a sliding method, and each time it slides, 12 new contents are updated. These 12 new contents are recommended to you according to various types of recalled unified sorting for you . For OTT products such as TV cats that use remote control interaction, we use the “infinite” right-slide method of Figure 1 to interact with users.

After explaining the recall and sorting strategy, the following is a personalized recommendation for TV cat short videos. Let us explain in detail how to update the recommendation list for the user in near real time based on the user’s real-time behavior.

4.3 Personalized real-time update strategy for TV cats

The following is a brief description of the sorting scheme for real-time personalized recommendation of TV cat short videos for your reference. Our recommendations are divided into offline recommendations and real-time recommendations. In the offline stage, every day we will generate a recommendation list according to the above rules and recommend 200 programs for users. When the user is in use, the user’s recommendation list will be updated in real time to integrate the user’s real-time interest changes.

The following figure is the architecture diagram of the real-time update of the TV cat. Operator 1 generates real-time messages to the message queue based on the user behavior log. Operator 2 obtains the user and operation behavior to be updated from the message queue, and updates the original recommendation list according to certain rules. . A copy of the recommendation list is backed up in HBase. When a user’s recommendation is specifically updated, the user’s recommendation list in HBase is read, the recommendation list is adjusted to integrate the user’s real-time interest changes, and the update is updated to HBase after adjustment. Kafka synchronizes a copy to CouchBase for the recommendation interface to return to the front end and display to the user, so that the user’s recommendation list is really updated and the user can perceive it.

Figure 17: Real-time personalized recommendation architecture for TV cats

Let’s talk about how to update the recommendation list based on the user’s recent behavior.

We regard the 200 videos recommended to the user as a ring (as shown in Figure 18 below), and every 20 programs are regarded as one page. When the user starts broadcasting, according to the user’s playing behavior on the first page (20 on the first page) In the program, users will broadcast what they are interested in. If they are not interested, they will skip. The content of each page is based on the recommendations generated by different recall strategies and rule sorting strategies in the offline stage). We use Spark Streaming to process, assuming that 5 seconds is a window, when calculating the next window, insert the similar program of the program that the user is interested on the first page at the front of the second page, the number of inserted programs is the same as the user There are as many skipped and played programs in the first window, and the programs that have been played and skipped in the first window are removed from the ring. Since there are as many deleted and inserted programs, the total queue remains at 200. At this time, it is the new first page from the current user’s playing position, back to the original state of the queue, the whole process is a ring that can “slip right infinitely”.

Figure 18: TV cat real-time personalized recommendation real-time update recommendation program

V. Cold start strategy

Tag-based similar video recommendation basically does not have a cold start problem, because any newly injected video contains tags, and we calculate similar videos for new programs in near real time, which will be calculated for new programs in a very short time Similar recommendation. In this section, we talk about the real-time personalized recommendation cold start problem.

Because it is a content-based recommendation, the cold start problem is not so serious. As long as the user has seen a video, the label of this video is the user’s interest label. We can recommend programs with the label to the user. However, if the user has not watched a program, how should the user be recommended?

We can adopt the following three strategies:

(1) Use new hot programs as recommendations;

(2) Generate relevant recommendation lists for users based on user characteristics (such as user region);

(3) Select videos from different categories and recommend them to users. There is always one that users like.

6. Future optimization direction

The label-based recommendation algorithm has a good overall effect on the TV cat app, but there are many places where it can be better. Now list some possible optimization points, as our follow-up optimization direction, for your reference.

6.1 Add model sorting module

Although the algorithm has many recall strategies, the final sorting is performed according to manual rules when displaying to users, and real-time updates are also based on rules. Some are subjective. The feasible optimization direction is to add a layer of real-time model sorting algorithms. The manual recall strategy is thrown to the sorting module for algorithm sorting, and recommends the sorted results to the user.

The model-based ranking strategy is trained based on user click behavior and various features, which can better reflect user clicks and increase the probability of user clicks. The FTRL (Follow-the-regularized-Leader) algorithm proposed by Google can effectively build a real-time ranking model and sort the results of multiple types of recalls. At present, there are a large number of application cases in domestic Internet companies. Interested readers can refer to Reference 11. At present, many deep learning algorithms (such as Wide & Deep) are also widely used in recommendation ranking.

6.2 Filter duplicate programs

In particular, news and short video apps will obtain relevant content from different sources. The content from different sources may be repeated. The simple method is to determine whether the two contents are repeated by the title. Although it is relatively simple, but sometimes Not necessarily reliable, for example, the two video titles are quite different, but in fact the content is very repetitive. At this time, the video content (or article content) needs to be used to determine whether it is repeated, but the processing cost is relatively high, especially for video. Therefore, the cost of distinguishing by title is relatively small, and the accuracy is acceptable.

There are generally two ways to deal with repetition: pre-processing and post-processing. The pre-processing is to check whether there are duplicate programs from all program libraries when new videos are put into the library, and discard them if there are any, otherwise insert them. Generally, information fingerprints can be generated for each video to facilitate comparison. Afterwards, after the recommendation list is generated, it is filtered again to remove only one of the duplicate videos.

6.3 Integrating user negative feedback

If the user plays a certain video and directly switches to the next one, or if it does not play for a short time, this is a signal that the user does not like. So how do we integrate such negative feedback in label-based algorithms? A feasible strategy is to negatively process the label contained in the video, that is, if the user portrait contains the label, then we can subtract a value from the weight of the label to represent the penalty for the label. At present, there is no integrated negative feedback mechanism in our algorithm.

6.4 Optimization for tags

Based on the label recommendation algorithm, the quality of the label is directly related to the quality of the recommendation. In the actual business, there are some problems with the label, which are mainly manifested in the following aspects:

(1) There is a correlation between tags, such as horror and thriller have similar meanings;

(2) Some labels appear particularly frequently while others appear particularly rare;

For (1), we can merge tags with similar meanings as much as possible, so that different tags have a certain degree of differentiation.

For (2), we can remove very rare tags (such as tags that are only available in a few videos). These tags may be dirty data, which is not very helpful for calculating the similarity. For tags that appear too frequently ( A lot of programs have this label), such labels are not very distinguishable, and suggestions can also be eliminated.

7. Write at the end

So far, the tag-based real-time video recommendation system is finished. The entire algorithm and engineering implementation details are basically based on our experience in TV cat short video recommendation.

The label-based algorithm is a very commonly used recommendation algorithm. The algorithm has a simple principle and strong interpretability. It is widely used in real business. Through the experience of our team, the effect is still very good. Today’s recommendation of Toutiao will also be based on the label The recommended algorithm is one of the core modules.

The biggest problem of label-based recommendation algorithm is that it strongly depends on the quality of the label. The quality of the label directly affects the effect of the algorithm.

If you want to do a good label recommendation, you need to define a complete label system in advance according to the relevant business, you need to invest a lot of labor costs, and also have higher requirements for the team’s NLP technology.

references:

1.Real-Time Top-N Recommendation in Social Streams

2.TencentRec- Real-time Stream Recommendation in Practice

3.Real-time Video Recommendation Exploration

4.Tag-aware recommender systems based on deep neural networks

5.Tag-aware recommender systems by fusion of collaborative filtering algorithms

6.Tag-Aware Personalized Recommendation Using a Hybrid Deep Model

7.Content-based recommendation in social tagging systems

8.Real-time Attention Based Look-alike Model for Recommender System

9. [Book] Big Data Principles and Best Practices of Scalable Realtime Data Systems

10.Real-time Personalization using Embeddings for Search Ranking at Airbnb

11.Ad Click Prediction- a View from the Trenches

(*This article is an article reproduced by the AI ​​Technology Base Camp, please contact the original author for reprint)

Wonderful recommendation

“Just talk about technology, refuse to talk ! “ The 2019 AI Developer Conference will be held in Beijing from September 6th to 7th. What are the highlights of this AI Developer Conference? What are the big cows in Tier 1 companies paying attention to? What is the direction of the AI ​​industry? 2019 AI Developer Conference, listen to Daniel share, focus on technical practice, and grow with thousands of developers. At present, the early bird tickets of the conference are in panic buying ~ scan the code to buy tickets, one step ahead!

Recommended reading

--

--