NLP | Automatic Essay Grading using TF-IDF, SVD, and K Nearest Neighbors Classifiers By Vinh Bui

Vinh Bui
4 min readAug 9, 2020

--

In the previous blog, I wrote about applying Bag-Of-Words into automatic essay grading. The accuracy is not high, but in this blog, I will improve the efficiency of the model by using TF-IDF, SVD, and extracting additional features from the data.

The flow of my work will be:

The source code and data could be found on my GitHub.

Photo by fabio on Unsplash

TF-IDF

According to Wikipedia, TF-IDF means Term Frequency Inverse Dense Frequency and “is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.” In the scope of this blog, I do not explain the details or the formula of the concept. However, you can find more detail here.

In a large text corpus, some words will be very present (e.g. “the”, “a”, “is” in English) hence carrying very little meaningful information about the actual contents of the document. If we were to feed the direct count data directly to a classifier those very frequent terms would shadow the frequencies of rarer yet more interesting terms.

In short, the more frequently a word appears in the text, the less valuable information it provides to us. In addition, if the appearance of a word is too rare, it means that the word could be not important (It could be a typo). Bag-Of-Word provides a good starting point for text classification. TF-IDF will improve the performance by filtering out the “noise” automatic grading.

The code will result in an attribute metric that does not contain the words that appear less than 10% and appear more than 85% of the corpus. Also, it will only include the top most important 5000 words of the corpus.

Nevertheless, this could be farther improve by adding my custom features into the data.

Feature Engineering

Let see the relationship between the number of words, number of sentences, and the scores.

Relationship between “Score” and “Number Of Sentences”
Relationship between “Score” and “Number of Words”

According to what I see in exploration, longer essays tend to have higher scores. Also, shorter essays have lower scores. Therefore, I want to add these two features to improve the accuracy of the model.

SVD

SVD stands for Singular Value Decomposition. According to scikit-learn

Dimensionality reduction using truncated SVD (aka LSA).

This transformer performs linear dimensionality reduction by means of truncated singular value decomposition (SVD). Contrary to PCA, this estimator does not center the data before computing the singular value decomposition. This means it can work with sparse matrices efficiently.

The main purpose that I use SVD into the project because I want to scale down the dimension of the attribute metric while keeping the “core” of the matrix. The smaller the matrix is, the faster the model is trained.

After applying Singular Value Decomposition, the metric of 5000 features from TF-IDF reduces to 100 features (100 features is a number that I cherry-pick).

Training and validation

Applying Nearest Neighbors and MEAN to classify

K Nearest Neighbor is an unsupervised clustering method to group similar data entries into groups. More information about Nearest Neighbor could be found here.

Image from DataCamp.com

I will use the Nearest Neighbor algorithm to find the nearest n neighbors of the test data entries. The output prediction will be the mean of the score of its nearest neighbors in the training model.

Validation

Reference: Automatic Text Scoring Using Neural Networks

It is important to use Cohen’s Kappa score to validate the accuracy of the model trained in text classification. It basically tells how much better the models predict comparing to randomly statistical predictions based on the frequency of the outcomes. My model has an accuracy score of 87.42%.

Next step

In the next step, I will implement a web service application to host the machine on Heroku for users to write their essays so that the machine can make predictions.

There are many better models with higher accuracy such as using neuron networks and so on. However, I use this approach because of the transparency to stakeholders.

If you have feedback, please kindly respond so that I can improve.

If you like this article, please give me claps and follow me.

--

--

Vinh Bui

Undergraduate student at UC Berkeley, major in Data Science