Youtube Video Recommendation System using basic math

Parth Shah
Analytics Vidhya
Published in
5 min readNov 11, 2019

Youtube is the best video sharing platform on the internet. One can get videos related to movies, songs , web-series , education , technology , science-fiction and many more . But do you remember last time when you had opened the application , watched a single video and closed it? It’s hard to remember because around 70% of users spend their time to watch multiple videos when they use the application. Please click here to get more information about Statistical Data of Youtube. Reason behind this phenomena is Youtube uses recommendation system to recommend more videos to users.

Use of Recommendation System in business

Whenever you visit shopping mall or grocery shop you would have noticed that packets of bread and jam are placed nearer to each other so that the customer who buys bread will most probably buy jam. Whenever you go to electronics shop and purchase laptop or pc you will get discounts on accessories because electronics companies have too much amount of customers buying data.As per this when user is watching video on Youtube ,recommendation system will recommend more videos to user so that user will spend more time on application and thus Youtube will get more advertisements.

Problem statement

According to the given above timeline, suppose user u1 has watched video v1 at time t1, video v25 at time t2 now we have to suggest him/her a set S of videos that he/she may be interested in and autoplay all that videos !

Data-Matrix

Youtube have <users -videos> dataset. Suppose user u1 has watched v1, v23, v45 videos we can write it as u1 = {v1 , v23, v45}. Assume there are n users and m videos , so we can draw n*m matrix that will show which user has viewed which video.If user u23 has watched v45 video then data_matrix[23,45] = 1.

Similarity of videos

Suppose video v18 and v34 are more similar and v22 is not much similar so user watching v18 will be recommended v34 to watch. Similarity function sim(vi , vj )will return Similarity Index between two videos. Here in this case we can say that sim(v18, v34) > sim(v18, v22) , but how we can find similarity between two items?

Using matrix above we can find similarity between two videos.
sim(vi,vj) = |vi ∩ vj |
As per above equation if two videos have high number of common viewers then they are similar to each other.
As per above given example there are n users and 3 videos (vi, vj ,vk).
Here we can observe that sim(vi,vj) = 3 and sim(vi,vk) = 1.So if user is watching video vi , then vj video will be recommended to be watched next.As per this approach top 10–20 videos are recommended.

To make this approach more specific , suppose a user has watched 3 videos v1,v2,v3 then we have to find top 10 recommended videos for each v1,v2 and v3 and find intersection of that and set them as per decreasing order of Similarity Index.

  • Problem

(1) First problem with this approach is Cold Start. It means whenever a new video is added in the dataset then Similarity Index with all other videos will become zero.

(2)Suppose most popular video that most of the users have watched (like gangnam style,despacito) than Similarity Index with these videos to other videos will be high . As per given below example vk is popular video so similarity of vi with vk is high but vi and vj are more similar to each other.

  • Jaccard Similarity Index

To solve above problem we have to use Jaccard Similarity index . The Jaccard Similarity Index compares members for two sets to see which members are shared and which are distinct. It’s a measure of similarity for the two sets of data, with a range from 0% to 100%. The higher the percentage, the more similar the two populations. Although it’s easy to interpret, it is extremely sensitive to small samples sizes and may give erroneous results, especially with very small samples or data sets with missing observations.please click here to know more about Jaccard similarity index.

As per above equation vj is more similar to vi instead of vk.

  • User based similarity

In above case we have found similarity between two videos(items). In second approach we can find similarity between two users.Suppose we are taking user u1 and finding similarity with other users and we found u32 with highest similarity.Then videos watched by user u32 and not by u1 will be recommended to user u1 and vice versa .

  • Problem with this approach

It seems that with the help of <users-videos> datasets we can recommend videos with highest similarity. But in reality Youtube have around 1 billion users and 100 million videos. So if we find similarity matrix of videos then size of matrix will be 100 M * 100 M ,which is equal to 10¹⁶ numbers of cells which are storing floating point values(4 byte). So the size will be around 4000 TB which is unexpected.

To solve this problem we have to consider only initial most similar videos. Suppose we are considering first 10 videos then the size of matrix will be 10⁸ * 10 = 10⁹.So the size required for that matrix is 4 * 10⁹ bytes equals to 4 GB.

Actually Youtube use all the data like users view ,comments, search history ,category of video,even content of video and apply Machine learning, Deep learning algorithms to recommend a new video which I will discuss on next blog but this is core concept that algorithms use in the base.

Thank you so much for taking the time to read this blog . If you have any feedback or suggestions then please ping me on my email (shahparth032@gmail.com) or comment below.

Once again Thank you !!!

--

--