Week 2 — MOOC Recommendation
Hello again. After a short break, we are on the move :)
This week we needed to consider some topics we missed. Our dataset had some problems. It was not suitable for our purpose. That’s why we changed the data with the new one. Let’s have a look in detail.
Click the link to reach the last week’s post (Week 1 — MOOC Recommendation)
— Problem with the data —
As we briefly said above, our old dataset for this problem had some issues. Collaborative filtering methods feed from item and user information. Our data had only item information, we missed the user info. Without user info, we can detect similarities with a simple formula to recommend a course. But this is -as you may guess- not included as a machine learning solution. To solve this issue, we found a new dataset from Kaggle.
— New Dataset —
The new dataset folder contains two CSV files; course info and comments.
- Course info has 20 columns. These columns include such features as course id, course title, number of reviews, average rating, category, etc. We have approximately 210.000 courses in this data.
- Comments data has 6 columns. As the name implies, this data mainly involves comments from users. But our priority will be on ratings. Other columns are not important at this stage. Without forgetting, this data has *9 MILLION* records. “Wow, that is huge.”
— Data Preparation *Part-1* —
- The first thing we need to do is remove the redundancies. As we just said, comments data has 6 columns, but 3 are useful. We dropped the columns except for user, course id, and rating.
- Since our data is so large, we might not keep track of duplicate values and missing entries. This process may take time. To deal with these, we dropped duplicate values and missing entries.
— Conclusion —
This week was all about data selection. Data is the most important component of the machine learning model. Recall that:
“Wrong data leads to the wrong model, the wrong model leads to wrong decisions, and wrong decisions are leads to undesirable results.” — Project Team 05
Next week we will continue the data preparation step with the second part. Until next week, stay awesome, take care and bye.
Authors
- Arif Enes Aydın (@Arif Enes Aydın)
- Muhammet Ali Şentürk