The Birth of a Data Analyst (Part 1)

I. Motivation

As a person in general, and as a Computer Engineering student living in a time of information overload, my strategy in acquiring knowledge is to be sufficiently proficient in as many disciplines as possible, so whenever I am faced with a new challenge, I would have some knowledge I can build on instead of starting from nothing.

Therefore, when I was presented the opportunity to intern at a start-up working on Big Data, I aimed to get the most out of my exposure to this field of data science.

Hence, as someone who has no prior experience in this field, I am creating this series of articles to document my two-month learning process.

II. Task

To better understand the task in hand, we look at the following scenario:

You visit a news website to read up on your favorite topics (art, politics, economics, etc.). You then proceed by clicking on an article that interests you; keep in mind that articles could belong to one or more topics. After reading that article, a recommended articles section would suggest new articles that you might be interested in reading.

Our task would then be to understand and then implement the logic behind that recommendation engine, that is understand and implement how would that suggestion section know which articles would mostly interest you.

III. Research Phase

Since I had no previous experience in the field of data analysis, the search starts with the most basic Google search query: “recommendation engine for articles”, where one of the top results of that search was an article published in a New York Times blog titled Building the Next New York Times Recommendation Engine by Alexander Spangher which gave me a sneak peek into the general logic of a recommendation engine, and therefore, a way to narrow down my next search.

In his article, Spangher discusses three types of filtering used to recommend articles to readers:

A. Content-based filtering

This approach seeks to recommend articles based on a content model of the articles. Here, two strategies are explored to create a content model of the article:

· Tag-based content model, that represents the article with two metrics: the keywords tags of the article added by the author, as well as the reader’s reading history. This model is illustrated by Spangher in the following example:

The approach has intuitive appeal: If a user read ten articles tagged with the word “Clinton,” they would probably like future “Clinton”-tagged articles. And this technique performs as well on fresh content as it does on older content, since it relies on data available at the time of publishing. — Alexander Spangher

· Topic-based content model, that represents the article as a mixture of “topics”. Here, topics are characterized by predefined words associated to them, so an article (made of words from multiple topics) is modeled as a distribution of topics.

Keep in mind that such a method of grouping similar articles has its drawbacks, including but not limited to, rare tags being given a higher weight than tags that appear in multiple articles. Also, topic models do not take into consideration context, only the literal word which might lead to some false positives due to confusion.

B. Collaborative-based filtering

Unlike the content-based approach that focuses on the article itself (body, title, metadata, etc.), collaborative-based filtering emphasizes on reader behavior. Here, if we take as a metric reader’s history, we can recommend articles of users that have similar histories to ours, following the logic of:

If one reader’s preferences are very similar to another reader’s, articles that the first reader reads might interest the second, and vice versa. — Alexander Spangher

But like any other filtering technique, this one suffers from a lower recommendation rate of new articles that would not have been read by a lot of readers, as well as trapping the reader into a narrow viewpoint.

C. Hybrid filtering

This brings us to a mix of both content-based and collaborative-based filtering that would hopefully reduce the problems of the above-mentioned filtering methods. At this point, we might be facing a problem such as the following:

After reading an economic article titled Equilibrium in the Jungle by Michele Piccione and Ariel Rubinstein, that has a lot of jungle references due to figures of speech, the topic-based content model (provided with two topics: economy and nature) will return with a 40% economic and 60% nature topic distribution (visualized on the graph below). The problem with the topic-based content model is clear, since the Equilibrium in the Jungle paper should have a higher distribution of economy over nature.

To resolve that issue, we adjust for the topic model error in our algorithm by taking into consideration the fact that readers who read this piece, regularly read economic articles, so looking at the reading history of readers who have read the Equilibrium in the Jungle paper, we can add an offset to the original position of the article (blue dot) to get it closer to a more logical position (yellow dot).

Hybrid Filtering Technique

IV. Conclusion

At this point, we formed a general idea about the operation of recommendation engines in suggesting articles to users visiting news websites. As for the succeeding step, we can narrow down our research to one of the two filtering methods discussed above.

Knowing that my research was part of an internship, I was tasked to tackle the content-based filtering aspect of the system. Therefore, the upcoming articles will further explore content-based filtering by figuring out how exactly should it be implemented.