Week 5 - MOOC Recommendation

Muhammet Ali Şentürk
AIN311 Fall 2022 Projects
5 min readDec 25, 2022

What is up ladies and gentlemen, We hope everything is on its way. Welcome back to yet another blog post about our recommendation system project.

This week, we had some different approaches to the problem. The approach is not exactly about the recommendation, it is about comment analysis. What do we mean by that? We will explain it soon. So let’s deep dive into the details.

By the way, if you want to remember what we have done last week, check this link -Week 4 — MOOC Recommendation-

— Analysing the Comments —

At the end of the last week’s post, we did some implications about doing some different applications. We wanted to do -so to speak- an unexpected thing that shows our project was different from the other recommendation system projects. Fortunately, our data set was appropriate for this variation.

Recall that, our data set has 2 files separately. One file contains the courses and the other one contains the user reactions. Comments are included in these reactions as well. Besides the rating a user has given, there is -mostly- a comment from this user. By using these comments, we will try to make a rating generator that rating ranges between 0 and 5.

This idea is slightly similar to sentiment analysis, but not. Because if you are analyzing the sentiments from the user comments, you are -in fact- doing some classification. In such a process, you have to classify the sentiments as positive-negative, happy-sad, etc. As a result, you will come with a categorical value. But the result of the process in our idea will be a rational number.

— Preprocessing —

So let’s start doing some analysis. The comments file contains 9 million entries which is an extremely huge number. In order to have more information about the distribution of ratings by the number of comments, we can look at this pie chart.

  • There is a little typo in the chart label by the way. The label of the purple area must indicate that the number of ratings is bigger than 4. This pie chart is clearly showing that almost 75% of the ratings are over 4.

Besides that, since our data set is so huge, we have taken samples for each rating interval with the number of 5000 (25.000 in total). 5 of the comments which rated with less than 1 is shown below:

After this step, we needed to set a threshold for filtering some comments. Some of the comments have only one or two word (worst, very good, etc.), some of them have not any (such cases contains special characters like dots, question marks, commas, etc.). Since these comments are so short, language detector is not working properly. In order to obtain longer comments, we set a threshold as 20. This means that a comment contains less than 20 characters will not be considered.

Also, some of the comments have not written in English. We needed to extract them from the data. But first, we need to detect them, right? We used detect function from langdetect library. Here is the 5 of the comments that have written in other languages:

And this is the filtered version of the data that contains comments rated less than 1:

After removing them from the data, we decided to look the distribution of the data with the pie chart again. Here is the chart:

  • Yeah, it seems like the distribution is more balanced.

Since we did the filtering process for the comments in all of the 5 intervals, we needed to concatenate them in a single data frame. While doing that, we have to remember the key thing that will provide balanced distribution in train and test sets while modeling in the future, shuffling. Otherwise, it is highly possible to train the model with comments rated less than 4 and test it with comments rated higher than 4. Also, we had to reset the indexes since they causes eye bleeding. So this is the result of this step:

  • As you can see, first 10 of the comments contains all of the rating intervals.

Preprocessing section ended .We have done so many things so far. Our data is ready for modeling after the train-test split. The remaining steps are modeling, testing and evaluating the results. But we will cover these steps in next week.

By the way, the next week’s post will probably be the last one. In 3rd January, this project will be presented and afterwards, finalize the projects with a short video and a project report paper.

So yeah, that is all for this week. Until we meet again, stay awesome, take care, and bye.

Authors

--

--