TDNN: Building AES Models with Limited Training Data

Kai Hui (Deep Learning Center of Excellence Berlin)

Kai Hui
SAP AI Research
6 min readOct 19, 2018

--

Writing is a creative process that involves several sophisticated abilities that are still unique to humans only, such as linguistic competence, critical thinking, reasoning, argumentation, and originality. Essay writing is required for different standardized tests such as Test of English as a Foreign Language (TOEFL) and Graduate Record Examination (GRE) to evaluate these abilities. Such standardized tests receive a huge number of essays on a regular basis, making the manual grading of such tests extremely costly and time-consuming.

With the ongoing developments in natural language processing (NLP) technologies, automated essay scoring (AES) methods have made strides and reached significant levels of accuracy. They already enjoy extensive use in large-scale tests such as TOEFL and GRE, and also several universities and schools in the US have started to employ what has been called “robo-graders.” The challenge of most state-of-the-art supervised AES models, however, is the need for a massive amount of training data to perform well and to avoid overfitting.

State-of-the-art AES Methods: Shortcomings and Challenges

Most state-of-the-art supervised AES methods are prompt-dependent; meaning that the model is trained using a large number of human-graded essays along with its writing prompt (i.e., topic) as training data. For the model to learn to evaluate essays with different writing quality, a sufficient number of human-graded essays for the same prompt must be available as training data. This limits the use of AES, especially when the graded essays for a target prompt are difficult to obtain.

Additionally, prompt-dependent AES models evaluate essays by capturing their use of vocabulary and semantic content with prompt-dependent features such as the tf-idf weights of subject-related terms and the n-gram features. For example, essays discussing climate changes contain a different range of vocabulary, subject-related terminology and semantic content in contrast to essays discussing cultural gaps between China and Germany. Therefore, using human-graded essays as training data for non-target prompts would result in overfitting and poor generalization across different prompts.

How can we overcome these challenges and build AES models with limited training data?

TDNN: A Two-staged Learning Framework

In our ACL (2018) paper: TDNN: A Two-stage Deep Neural Network for Prompt-independent Automated Essay Scoring, we address the challenge of limited training data in developing AES models by proposing a novel two-staged AES model, coined as TDNN. Our approach combines both prompt-independent and prompt-dependent AES methods; removing the reliance on prompt-dependent training data.

A teacher can take a quick glance at an essay and be able to give a general judgment whether it is a good or a poor essay by quickly spotting grammar or punctuation mistakes or poor sentence structure before delving deeper into the content to assess argumentation, reasoning and other factors. We train an AES model to follow the same rationale in a two-stage learning framework.

Firstly, we train a shallow model to ‘scan’ essays and recognize the difference between extremely good and extremely bad essays by capturing the basic grammatical and syntactic features through human-graded essays of extreme qualities. Secondly, we train the model to read closely for a fine-grained understanding of the content and providing an in-depth analysis of the text with respect to its prompt, along with the ultimate score of the essay.

To achieve the prompt-independent AES, the shallow model trained on the essays for non-target prompts picks out the essays for target-prompt with extreme qualities, serving as training data for the fine-grained model in the second stage.

Figure 1. The architecture of the TDNN framework for prompt-independent AES.

First Stage (Prompt-independent): Learning to identify Essays of Extreme Quality

The first stage is prompt-independent; meaning that we use human-graded essays of extreme qualities from various non-target prompts to train a shallow model. Intuitively, we are giving the model a birds-eye overview to learn to identify essays of extreme qualities, differentiating between good and poor essays without even understanding the content, by employing basic generic features, such as essay length, the average number of clauses per sentence, the number of spelling, grammar, punctuation errors, etc. In particular, we use a set of human-graded essays with the highest and lowest scores to train a RankSVM on essays for non-target prompts. The trained model, afterward, assigns essays for the target prompt with pseudo labels, namely, positive labels for the extremely good ones and negative labels for the extremely poor ones according to its predictions.

Second Stage (Prompt-dependent): Learning to evaluate Essays Closely

With the identified essays as training data, a hybrid deep model (summarized in Figure 2) further makes a more fine-grained judgment of the essays for non-target prompts. The hybrid model delves deeper into the structure and content of the essay and learns how to conduct a fine-grained content-based prompt-dependent assessment, considering features, such as semantic, part-of-speech (POS) and syntactic network. Specifically, the model recognizes the semantic meaning of an essay (Sem) by encoding it in terms of a sequence of word embeddings. The part-of-speech information (POS) is modeled as a sequence of POS taggings to capture the writing and grammatical styles. The structural connections between different components in an essay (e.g., terms or phrases) are considered via a syntactic network (Synt) to learn the organization of the essay’s structure. Several stacked bi-LSTMs are employed as basic components to encode these three sequences separately, whose outputs are concatenated together and fed into several dense layers before generating the final assessment as a scalar.

Figure 2. The model architecture of the hybrid deep learning model in the second stage.

How our Model performs?

We compared TDNN with a couple of state-of-the-art prompt-dependent models, which are trained under a prompt-independent configuration. Different variants of TDNN were considered by including some or all of the Sem, POS and Synt components in the second stage of TDNN. Our model has demonstrated superior results; outperforming different baselines by a margin regarding the correlations between the automatic grades and the ground-truth manual ratings. Overall, TDNN (Sem+Synt) outperforms the baselines by 10%-25% under different correlation metrics. Additionally, compared with TDNN, the established models especially suffer from fluctuating performances when tested on different prompts, suggesting that the complicated models directly trained on non-target prompts may suffer from overfitting, dampening their performances on essays for some target prompts.

What’s Next?

Following the transductive transfer learning method, we envision that the proposed TDNN could be further improved by migrating the non-target training data to the target prompt; directly using the labeled essays for the non-target prompts. Furthermore, the incorporation of a small number of graded essays, e.g., one or two essays for the target prompt in the current TDNN would also be of interest, enabling the model to learn from very few examples provided by humans.

About the Author: Kai Hui is a Data Scientist at SAP Deep Learning Center of Excellence. His focus research areas are information retrieval, text mining and developing neural IR models for ad-hoc retrieval.

--

--