Teacher’s Aid: A NLP tool to help teachers evaluate student reflections.
By: Jon-Frederick Landrigan
Research has shown that goal setting (Turkay, 2014) and reflecting on ones work and/or the strategies they used can lead to improved academic performance. Given this the company Sown to Grow is trying to leverage technology to harness and implement these findings at scale for students all across the country.
Sown to Grow is a tech company that provides students with a platform to set goals, reflect on their work and interact with teachers. Based on the students’ reflections teachers can give them feedback and provide potential strategies for improvement. While providing students with this platform and service is an admirable goal the company has run into the issue of having many more students then teachers and in turn exponentially more reflections then teachers. Given this Sown to Grow was looking for a way to automate the process of scoring the quality of reflections in an effort to aid teachers (e.g. try to prioritize reflections, as a high quality reflection may not need as much attention as a low quality reflection or help track student progress over time).
Example Reflections and Score in Parentheses
I did bad. (1)
This was hard for me because I don’t like reading. (2)
I tried to understand the problems by rereading them and putting them in my own words. (3)
For this test the strategies that worked best for me were writing down notes as I read and then studying from my notes and the book. (4)
On this assignment I did all the readings and made sure to review my notes a few times. I will try to do this more often in the future. (5)
The data that Sown to Grow shared with me included 1344 student reflections, most of which had been scored by two raters using a rubric the company had developed to assess the quality of the reflections (i.e. 2614 reflections with duplicates). Now I have to admit the first thing I wanted to do upon receiving the data was to try and apply some general NLP techniques (e.g. bag of words) and model it as a classification problem but as will be discussed it is a good thing I didn’t. The first step I took in tackling this challenge was two-fold, (1) to look at the inter-rater reliability to see if the scoring was consistent as this would play a role in validating the performance of the model and (2) to look at the distribution of the scores. In the case of inter-rater reliability or in other words the percent agreement between two raters, it was unfortunately only 50% (raters were within 1 point of each other about 90% of the time). In regards to the distribution of the scores it was quite skewed with the majority of ratings at three and barely any fives.
Given the low inter-rater reliability and unequal distribution of scores I went to the rubric that Sown to Grow gave me to see if I could determine why this was the case. In looking over the rubric it was apparent that the difference between a reflection scored as four and that of a five was minimal.
Incorporating Domain Knowledge
Knowing these two things I talked with Sown to Grow to get more information about the reflections, the scoring process and the rubric. Two main things came out of this conversation. The first was that the company agreed that the difference between a four and five was minimal and in fact they had already considered collapsing the rubric down to a four point range and were fine with me modeling it on a four point scale. Second, while developing the rubric they had gone through a few iterations to get the rater agreement to at least 50% and said that the main goal was to make sure that raters were always within a point of each other since human rating is an inherently subjective practice. This gave me an important launching point in the sense that rather then looking at the ratings as distinct bins in a classification problem, I could look at the ratings as being on a continuum and therefore model this as a regression problem which could potentially help the model deal with border cases (or cases of disagreement between the raters e.g. one rater says the reflection is a 2 and the other a 3).
Processing the Reflections and Finding Features
In processing the data one of the first issues that had to be handled was misspelled words (unsurprisingly there were a lot, the reflections were written by grade school kids after all). Ignoring these would have either lead to increased noise in the data or words being dropped from the analysis. This is because during tokenization some natural language processing tool kits drop words they do not recognize or break them apart in the case of contractions. To get around this issue and in an effort to preserve as much text as possible I chose to implement a semi customized tokenizer that would not drop words or mistakenly split up contractions and used a spelling corrector to correct misspellings.
With the text cleaned the next challenge was to try and find the most informative features. Talking with the company and using the rubric as a basis gave me some key starting points in this process. More specifically I felt it would be important to have some informative / interpretable features in the model given the fact that this model would be scoring students work. Therefore I decided to look at a number of things that could be quantified about the reflections and interpreted by non machine learning experts. One of the biggest indicators of quality was simply the length of the reflection (not to surprising), ranging from reflections containing only a single word to reflections as long as 98 words in total.
Although the length of the reflection correlated quite strongly with the score of the reflection (R = .5), more features were needed to really parse apart the scores. Next I looked at the tenses that were used in the reflections (i.e. past, present, future). The reasoning being that if a student was truly reflecting on what they did you would expect them to use some past tense verbs (e.g. I wrote down all my questions and then looked up the answers after I read.)
The average number of past tense verbs per reflection was also indicative of quality and given the nature of the reflections not that surprising. Following this, I then looked for and generated lists of key words (e.g. read, study) and phrases (e.g. “strategy that worked”) that appeared in the higher quality reflections (i.e. reflections scored as a 3 or 4). Finally, I included some basic word embedding features such as the total sum of the word embedding vectors for the reflections to add a coarse measure of semantic similarity. Including the discussed features and others as the predictors in a random forest regression model yielded decent performance (i.e. it agreed with at least one of the human raters about 60% of the time and was within a point of the raters about 90% of the time).
Moving Beyond the Engineered Features
Although the performance of the model was reasonable, given the low inter-rater reliability, I wanted to try and improve its performance. In order to do this I developed another model based on the term frequencies using tf-idf (term frequency inverse of document frequency). Essentially this technique is similar to a bag of words approach with the exception that it attempts to model how important a given word is to a document by determining how frequent it is used in the document and then offsets it by how often it appears in the corpus (for more information on this technique visit https://buhrmann.github.io/tfidf-analysis.html). The main reason for taking this approach was that although the feature engineering accounted for a good amount of the variability in the scoring and were fairly interpretable, given the small set of reflections I wanted to make sure I accounted for as much of the linguistic variation as possible given each new reflection would be completely different from the ones the models were trained on and could contain new words. Implementing another random-forest regression based solely on the tf-idf sparse matrix on average achieved 65% agreement with the raters and was within a point of the raters about 95% of the time.
Finally seeing that the two models were performing pretty much on par with each other I decided to see if combining the models by taking an ensemble modeling approach would produce a boost in performance. In fact there was. On average the ensemble model agreed with one of the raters at least 72% of the time and was within a point of the raters 98% of the time. Although it may seem odd combining the two approaches (i.e. predictors from feature engineering and tf-idf), the main argument for this is that it is possible that one of the two models could struggle with new input and by getting a prediction from both models and producing the final score through a simple linear combination of their predictions (i.e. the predictions from the random forest models were fed into a linear regression model), it would help prevent large mis-scores.
In this post I presented the approach I took in creating a tool to help the company Sown To Grow automate the scoring of student reflections. Overall, I used two types of features, (1) engineered features based on the rubric and insights from the company and (2) tf-idf to get as much signal from the reflection vocabulary as possible. The model performed fairly well agreeing with the trained raters 72% of the time and was nearly always within a point of the raters (98% of the time). One of the key factors that aided in the process was staying in touch with the company and relying on their domain expertise to gain a deeper understanding of the reflections and the rubric. This ultimately allowed me to specifically tune my model to the problem and provide Sown To Grow with a tool to help improve educational outcomes. To see the code for this project you can visit this github repo: https://github.com/JFLandrigan/STG_Scoring.
About the Author
I am currently a fellow at Insight Data Science. Prior to this I got my PhD in Applied Cognitive and Brain Sciences from Drexel University where I studied semantic memory and language impairments.