Week 6— Eye Tracking and Prior Knowledge

ali utku
AIN311 Fall 2022 Projects
2 min readJan 5, 2023

by Alper Özöner and Ali Utku Aydın

This week we made Tf-Idf algorithm to get better results for each algorithm and tried Random Forest algorithm in spite of “linearly inseparable” problem of SVMs.

Tf-Idf Implementation

After tried SVM prediction, our “score initialization according to word summation of each bag of words“ hadn’t go well, we decided to use tf-idf algorithm for score initialization.

from math import log
def calculate_tf_idf(bow):
# Total number of documents in the collection
N = len(bow)
# Calculate the tf-idf values for each word
tf_idf = {}
for word, count in bow.items():
# Calculate the term frequency (tf)
tf = count / sum(bow.values())
# Calculate the inverse document frequency (idf)
# idf is defined as log(N / n), where N is the total number of documents
# and n is the number of documents that contain the word
n = sum(1 for doc in bow if word in doc)
idf = log(N / n)
# Calculate the tf-idf value
tf_idf[word] = tf * idf
return tf_idf

After having constructed the method, we modified the scores for each bag of words and formed a new pandas dataframe as shown below.

Train set of world cup, X’s are tf-idf score y’s are the rate of each subject

It boosted accuracy rate of SVM to 0.20! Also our confidence too. Probably tf-idf will be our score initialization in future.

Random Forest

A Random Forest in Baden-Württemberg, Germany occupies over 2,000 square miles.

As we considered, SVM may give a huge error due to not separable data. So to advance, we searched new algorithms and decided on to use Random Forest(RF). Due to RF can be used both classification and regression problems and does not need hyperparameter tuning in beginning(consider that we are trying a hypothesis so finding hyperparameters could be hard) we choose RF for our problem.

We choose n_estimators as 50 because of we have low but dense datapoints. Method gave us 0.08 accuracy for french dataset however others were 0,06. It’s low compared to french dataset and we don’t know why this happened.

Nevertheless, now we have concrete results and way better solutions for our dataset. For future, our plan is finding best parameters for each algorithm, finding new subjects in campus and conclude our project for AIN311. Stay tuned see you in next week!

--

--