Week 3 — MOOC Recommendation

Arif Enes Aydın
AIN311 Fall 2022 Projects
3 min readDec 11, 2022

Hello everyone. Last week, we talked about the difficulties we had with the data set and our new data set. This week we will talk about how we process our data set.

Since there is no user ID in the dataset, we thought of grouping appropriate names to give them identity. Here are the unique names in the dataset.

There were too many entries with the same name. To overcome this problem, we thought intuitively: There should be almost nobody ever has purchased more than 10–15 courses and reviewed them. With this idea, we discarded all entries that include the same display name of more than 15.

We also found that the vast majority of courses had very few reviews. We also removed the entries that included these courses otherwise, the size of the utility matrix that we need to use became enormously larger.

With these alterations, we gave the users that we are left an id number. As a result, we have approximately 100k+ reviews from 10k+ users for 3k+ courses. These statistics are subject to change due to needs in the recommendation system.

fraction of a utility matrix

Since our progress is a little delayed, we barely started implementing the recommendation system. This week, we explored the needs and capabilities of the system. Firstly, we normalized the matrix and then calculated the cosine similarity. The results is can be used for finding which courses most similar to a given course. Here are the similar courses to the course with id 15639:

That was all for today. Next week we will consider other options to include our system by doing a literature review and will prepare the progress report for our project. It is hard to decide which way to go since there are not too many sources compared to other machine learning methods.

Thank you for reading. Hope to see you next week.

Authors

--

--