Project Similarity

Published in

Jellysmacklabs

4 min readSep 9, 2019

Data science plays a crucial role at Jellysmack to analyze millions of data every day, to create and also find the best and most popular videos published on all major social networks.

Thomas who is a Data Scientist is currently working on a project called Similarity to extract from a set of videos, the ones that are particularly popular and most importantly to understand why and find their commonality.

In other words, the goal of the Similarity project is:

To find videos that ‘out-performed’ the others, these top videos are called ‘outliers’ [1] .
To find the commonality among the outliers.

In the first phase of the project, two types of common data were used.

Each video is characterized by a series of keywords or tags.
The title of each video.

Tags are more explicit than titles and therefore relatively easier to work with.

Titles must first be “cleaned up” that is for example by removing all pronouns (e.g., you, this), changing all plural words into singular words, getting the root of verbs (e.g., ‘playing’ becomes ‘play’), etc.
The remaining elements in the title are then tokenized. That is taking a text or set of text and breaking it up into its individual words. Finally, elements (unigram) are analyzed as well as all pairs of consecutive words (bigram) and more (trigram, etc).

Although, they are more difficult to work with than lists of tags, titles revealed interesting results. For example, soccer games videos with a question in the title are most of the time outliers, in other words those videos tend to have a higher number of viewers.

Outliers

To identify outliers among a set of video, Thomas chose a relatively novel method based on binary decision trees called Isolation Forest. This method is similar to the well-known Random Forest but in addition to the creation of a decision tree, it calculates the path length necessary to isolate an outlier in the tree. Outliers reside closer to the root of the tree and have shorter paths than normal observations.

The outlier score [3] for the Isolation Forest method is defined as:

h(x) : the path length of observation x
c(n) : the average path length of unsuccessful search in a binary search tree
n : the number of external nodes

Thus, each observation has a score.
- The score is close to 1 for an anomaly.
- For a normal observation the score is smaller than 0.5 .

Main Steps

The main steps to analyze videos are:

Define rules and run the Isolation Forest algorithm. Cluster videos from a Jellysmack channel such as Oh My Goal, Gamology, etc. Or a set of videos organized into 2 groups, ‘top’ and ‘basic’.
Display observations (points) in a 2-D graph with the CPM on the ordinate axis and 60 seconds views on the abscissa.
Find similarities within the “top” cluster:

in the tags
in the title (unigrams, bigrams, trigrams…)

Example

Here is an analysis of a cluster of videos from the Jellysmack channel Beauty Studio in April 2019:

Conclusion

The Scikit-Learn library [4] was used for this project. The library implementation is easy to use, the documentation is clear, and the code in Python is open source [5].

The algorithm has a few parameters and it is relatively simple.
The Isolation Forest Algorithm is:
classsklearn.ensemble.IsolationForest(n_estimators=100, max_samples=’auto’, contamination=’legacy’, max_features=1.0, bootstrap=False, n_jobs=None, behaviour=’old’, random_state=None, verbose=0, warm_start=False)

Thomas wrote the program to run the algorithm, and to define rules and parameters in Python.

References

[1] https://machinelearningmastery.com/how-to-identify-outliers-in-your-data/

[2] https://towardsdatascience.com/a-brief-overview-of-outlier-detection-techniques-1e0b2c19e561

[3] Liu, F. T., Ting, K. M., & Zhou, Z. H. (2008, December). Isolation forest. In Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on (pp. 413–422). IEEE.

[4] https://scikit-learn.org/stable/modules/outlier_detection.html#isolation-forest

[5] https://github.com/scikit-learn/scikit-learn/blob/7813f7efb/sklearn/ensemble/iforest.py#L29

Project Similarity

Written by Gerard Picouleau