Automated Readability Prediction

Published in

Voice Tech Podcast

19 min readSep 30, 2019

Hi! I am Kate, a PhD student at the University of Gent, Belgium. I am working with educational data mining, focusing on language learning. After drowning in preparation for yet another literature review, I decided to share some overviews, because I believe this fascinating area of research deserves a bit more attention :) I plan to soon publish more posts on other aspects of computer-assisted language learning, such as second language acquisition modelling and automatic question generation. As a part of my project, I am developing methods to assess how readable texts are. So, this is the first post in the series, enjoy!

https://www.pexels.com/photo/books-sculpture-write-reading-34627/

Task

What is it? Readability prediction models score texts based on how easily a reader can extract the information from them [1]. This is a rather subjective definition — but so are many other widely used terms, such as sentiment in natural language processing. If we have access to pairwise comparisons of texts’ difficulty, we can define readability in a more elegant and strict way: it is the probability the text would be assessed as easier than any other text, by any assessor [2]. Even more pragmatically, it is the proportion of times it was assessed as the easier text in a pair.

Where is it used? Assessing readability is an important task for media and educational applications because it allows us to tailor content to the readers. For instance, teachers can find interesting texts at an appropriate reading level for language learning classes. Automating readability assessment steps cuts time and effort spent on finding needed content. It can also be a supportive tool during writing, which highlights unreadable passages and suggests how to reformulate them (like Grammarly).

How is it connect to other concepts? Readability is closely related to coherence. In [9], it turned out that human annotators tend to give the same rating to these two dimensions of the text. And if you are wondering about the link with machine comprehension, recent research shows that the readability scores do not correlate with machine completion rates [3]. So, machines and humans still see text difficulty in a different way.

On which granularity level do we do it? While most of this post is about text readability on the document level, we can also predict it sentence- and word-wise.

Sentences are used in Computer-Assisted Language Learning (CALL) to generate exercises and vocabulary examples. Sentence readability assessment uses more local features and is actually more difficult than text level [4]. I will briefly describe the approaches for sentence-level prediction further on, but for a more detailed overview of feature importance ranking and model results allow me to refer you to [5].

On a word-level, there is a separate subtask of Complex Word Identification (CWI). It stems from the text simplification task. Usually, deciding which words should be simplified in a given text is the first step in the pipeline. The first shared CWI task was on SemEval 2016. Among research questions of the challenge were: “to learn which words challenge non-native English speakers and to understand what their traits are” and “to investigate how well one’s individual vocabulary limitations can be predicted from the overall vocabulary limitations of others in the same category” [6]. As you can see, once again a clear link with educational applications.

Scores

So how do we actually measure readability? The most popular options are:

Grade of school (usually, 5–12 grades of American schools) and similar grade levels for other languages
“English as a second language” levels — most widely used are CEFR levels (A1, A2, … C1, C2). You can find their description in Wikipedia. They are also rather vague, but textbooks can provide you with a nice annotated dataset.
The age group of the intended reader — in the simplest binary case, is the text written for an adult or child reader? Can be more detailed age bands, as well.

[7] provides a more in-depth discussion about whether reading difficulty corresponds to interval, ordinal or nominal scale data. Interval scale assumes that data is both ordered and evenly spaced, which allows for fewer parameters, but it might be too strong of an assumption. The model operating on ordinal data gets the best score, so the authors conclude that “reading difficulty appears to increase steadily but not linearly with grade level”.

Accordingly, we can three different ways to make a prediction with a model [8]. The first one is regression + rounding: the prediction is rounded to its closest integer and clamped to the appropriate range. We can also learn the cut-off separation boundary which bands the ranking scores to levels. And probably the most straightforward is just multi-class classification on the ranking scores. We can also use pairwise ranking: consider pairs of texts and predict which one is more readable. This, too, can be formulated as a classification problem: given two texts, is text 1 more readable than text 2? We will see later how this approach is used for labelling data.

Features

There is a wide range of features which we can use for readability prediction. [9] and [10] use similar feature categories, which I also follow here. The features can be lexico-semantic, morphological, cognitive, syntactic, semantic, and discourse.

Lexico-semantic

(relating to words and statistical language models)

language modelling features — such as the average log-likelihood ratio (discovers keywords which differentiate between corpora) and language model perplexity (on POS and/or word-token, usually 1–5 gram). We can also sort words by information gain using language models.
the mean TF-IDF value of all tokens in the texts.
relative frequency in a large representative corpus. Usually, we would just use the frequency score, but recent research demonstrates that richer representation leads to better results [11]. Even just adding standard deviation already improves the results, but we can also encode separate means for the words of based on their frequency band or cluster. We can also character bigram and trigram frequencies or calculate the proportion of words occurring in a list of the most frequently used words.
the average number of words per sentence, the average number of syllables and characters per word, the proportion of words with three or more syllables.
out-of-vocabulary (OOV) score.
type-token ratio (TTR) — measures lexical richness (+ modifications of the formula, such as Root TTR or Corrected TTR).
word maturity level — it considers how and when a word’s frequency changes with learning stage. This feature is thus well-suited for personalised text difficulty assessment (remember the part about tailoring the content to users?). This feature also accounts for the word’s usage in context by representing it as a feature vector of LSA topics [9].

Morphological

(dealing with morphemes — smallest meaningful units of language, for example, suffixes like “-tion” in “ablation”)

suffixes, prefixes — proven to be effective for German and some agglutinative languages.

Cognitive

(based on psycholinguistic and pedagogical research)

age of acquisition [12], concreteness, degree of polysemy (a word having several meanings).
Coh-Metrix is a popular software to produce discourse and lexical indices for the text. It can be used, for example, for calculating the level of concreteness or lexical ambiguity of words [13].

Syntactic

(dealing with syntactic parse trees of the sentences)

simple average count features: average parse tree height, the average number of noun phrases per sentence, the average number of verb phrases per sentence, and an average number of subordinate clauses per sentence [14]. [15] found that the strongest correlation is that between readability and number of verb phrases.
derivative syntactic indicators like clause centre embeddings depth, or number of words per nominal phrase [16].
automatically extracted frequencies of subtree patterns.

Semantic

semantic role features: an average number of argument, modifiers, locatives, …
quality of semantic networks (which consist of conceptual nodes linked by semantic relations, MultiNet-like) [16]. We can also use them to estimate how probable the sentence is from a semantic point of view and number of connections in them. This way, we can operationalize concepts of polysemy and abstractness [10].
word embeddings
higher-level semantics: use of unusual senses, idioms, or subtle connotation. Here, domain or world knowledge is required to comprehend a text.

Discourse

(for example, use of arguments and connections in the text)

entity features: entity density, the proportion of transitions in entity extraction, grammatical function and salience entity grid model [17].
number of coreference chains per document; average chain span per document; the number of large chain spans within a document [18].
log-likelihood of the language model over discourse relations: text represented as a bag of relations.
cohesion & coherence — describe inner logic and structure of the text [9], such as: topic continuity from sentence to sentence: the actual word overlap, average cosine similarity; count the number of connectives included in a text-based on lists or to calculate the causal cohesion by focussing on connectives and causal verbs; the number of pronouns per sentence and the number of definite articles per sentence. An interesting recent development is lexical coherence graph [19]: coherence as semantic connectedness between words modelled by word embeddings. This method allows extracting large subgraphs capturing coherence patterns, increasing interpretability of results.

Pragmatic

contextual or subjective language influenced by genre, e.g. sarcasm.

So which features are the best? This discussion remains open. The comparison in [18] lists the following features with significant predictive power: POS features, in particular, nouns; verb phrases; average sentence length; language models trained directly on the corpus. However, the conclusions are still inconsistent, because, in another analysis [7], authors conclude that “grammatical features alone can be effective predictors of readability”.

There is a complex interplay between features. As an example, [15] states that: “the entity grid factors which individually have a very weak correlation with readability combine well, while adding the three additional discourse features to the likelihood of discourses relations actually worsens performance slightly.” Overall, traditional shallow features are strong single predictors of readability. In the same time, ablation studies show that more innovative features also have a significant impact on the performance [20]. [15] observe that removing syntactic features causes the largest negative impact on performance while removing the cohesion features actually boosts performance. There is some evidence that grammatical and cognitive features may play a more important role in second-language readability prediction.

Best results are usually obtained by using a combination of features. Filter and wrapper feature selection methods (such as forward selection and backward elimination) can be used to find the optimal feature set. They can be combined with genetic algorithms to perform hyperparameter optimisation simultaneously in an efficient way [21].

Concerning generalizability, we have two main sources of variety: genre and language. Readability can be measured by different text characteristics depending on the specific language [10], and finding out which features are the most predictive for each language is an active area of research. For example, Arabic, Dari and Pashto are explored in [22], Italian in [23], and Russian in [24] However, [10] also mention that some major features such as lemma frequency are shared across most languages, and [25] claim that almost all the features generalize well across corpora.

Methods

Classical readability formulas

Traditional readability measures use easy-to-compute proxy variables to estimate lexical and syntactic complexity of the sentence [9]. They are usually represented as a simple regression formula with up to five predictors. The predictors are shallow or surface features, such as a number of words. As an example, the most popular formula is the Flesch-Kincaid score:

The Flesch-Kincaid score

This formula returns the score from 0 to 100, which is binned according to school levels. For example, the score of 30 corresponds to college graduate level, and 80 to 6th grade. Other popular formulas are SMOG, Coleman-Liau index, Dale-Chall and Gunning fog. They are implemented in many libraries and included in text processors.

These formulas are motivated by the following observations: longer sentences have proven to be more difficult to process than short ones [13], a sentence is difficult to read if the syntactic structure is complex, vocabulary words are not distributed evenly across grade levels [26].

Build better voice apps. Get more articles & interviews from voice technology experts at voicetechpodcast.com

However, these popular and simple formulas have a number of drawbacks.

1. Advanced features, capturing semantic and syntactic relations, are not considered. Discourse flow, topical dependencies, even the ordering of words and sentences are typically ignored [9].

2. Some of them lack absolute output value.

3. The underlying assumption of regression between readability and the text characteristics has also been criticised.

4. They are unreliable for short texts (less than 300 words), and allow no noise, assuming well-formed sentences [9]. As such, there is a body of research investigating measures better suited for Web-texts [25].

Statistical approaches

We can use more advanced models to exploits patterns of word use in language and incorporate more information about the document content. This way we alleviate limitations of the traditional approach. As a nice bonus, many of these models provide a probability distribution of prediction outcomes across all grade models [9].

One of the earliest statistical approaches is a model combining a unigram language model and a sentence length one [27]. Such a model assumes that the word sequence is a 1st order Markov process. It was shown to outperform Flesch-Kincaid readability formula.

This pioneering work was followed by the application of support vector machines (SVMs) [14]. Apart from the new method, they also added syntactic features to language model perplexity scores and shallow features. The authors used a traditional formula as a feature, an approach replicated in some studies after that as well. Expanding the work, [15] added discourse relations to the feature set. They observed that readability predictors behave differently depending on the task: readability ranking of classification, but added discourse relations exhibit robustness across these two tasks.

Sentence-level. As I mentioned in the beginning, we can also have sentence classification, not only text. Graph methods were applied there [28] based on world-coupled TF-IDF matrix. An interesting and important note is that using only word-frequency features was almost as predictive as a combination of all features for the document level, but the latter made more accurate predictions for sentences [4]. Sentence readability model can also be used for evaluating text simplification via pairwise ranking [29], [30].

Word-level. The best G score (harmonic mean of accuracy and recall) in the first shared task on complex word identification problem was 0.774, best F-score is 0.35 [6]. The organisers conclude that “the most effective way to determine a word’s complexity is by searching for its frequency in corpora.” Concerning the methods, Decision Trees and minimalistic threshold-based strategies perform the best. The task was extended to the multilingual and multi-genre setting [31] in 2018, with the best scores of approximately 0.5–0.07 MAE, depending on the genre.

Classification:

The usual: Decision Trees, Support Vector Machines, and Logistic Regression. Some authors also experiment with Multi-Layer Perceptron [32].
Pair-wise ranking with neural models [33], inspired by advances in coherence prediction. The architecture is as follows: 1. run an LSTM over words in sentences to incorporate context information; 2. select the most similar LSTM states in two adjacent sentences to encode the salient semantic concept that relates the sentences and computer their average; 3. apply a convolutional layer to automatically extract and represent patterns of semantic changes in a text.

Regression: linear regression / logistic regression, including proportional odds model [7].

Clustering:

k-nearest neighbours, using deep syntactic and semantic indicator represented by dependency trees and semantic networks [16].
K-means on word embeddings using Euclidean distance [34], combined with SVM regression. Authors mention that the method “can correctly identify sentence pairs of similar meaning but written in different vocabulary and grammatical structure”, which is an exciting and useful addition.

Evaluation

To evaluate the performance of our readability prediction models, we can use:

rank-order correlation (typically Spearman’s rho or Pearson) between the predicted difficulty levels and the ‘gold standard’ difficulty levels on the same reference texts [9]
prediction accuracy (rounded to the nearest integer level if needed) or RMSE (can be understood as the average difference between the predicted grade level and the expected grade level [32]) or F-score [9]

[35] found that it is easier to accurately distinguish content at lower levels, and similar observations were made in [24]. On the other hand, classes in the middle of the scale were harder to distinguish in [4]. As such, sometimes the adjacency correction is used: for example, we allow the B1 text to be classified as A2 or B2 [7].

Data

There is little data with readability labels for generic texts, as most of the existing datasets are based on educational content. [9] provides an overview of the existing publicly available corpora:

The WeeBit corpus created by [32] is one of the largest datasets for readability analysis. It is composed of articles targeted at readers of different age groups from the Weekly Reader magazine and the BBC-Bitesize website The Weekly Reader data covers non-fictional content for four grade levels, corresponding to children of ages between 7–8, 8–9, 9–10 and 10–12 years old. The BBC-Bitesize data covers two grade levels, ages between 11–14 and 14–16.
One public resource recently cited in readability evaluations is the collection of texts known as Common Core, comprising 168 docs that span levels roughly corresponding to U.S. grade levels 2–12. The passages are tagged by both level and genre, (speech, literature, informative, etc.).
Another resource is the 114 articles from Encyclopedia Britannica written in two styles, for adults versus children [36].
The Cambridge English Exams are designed for L2 learners specifically and the A2–C2 levels assigned to each reading paper can be treated as the level of reading difficulty of the documents for the L2 learners. [link]
OneStopEnglish 189 texts, each in three versions (567 in total), freely available [37].
[2] collected 105 texts from the British National Corpus and Wikipedia in four different genres: administrative, informative, instructive, and miscellaneous. 10907 pairs of texts are labelled with five fine-grained categories by human annotators.

Labelling. Absolute scores by human annotators are unavoidably subjective, but we level out differences in reader’s knowledge and attitude by collecting multiple scores per each text. Good news is that according to [2], crowdsourcing is a viable alternative to expert labelling.

As manual labelling is still expensive even if it is crowdsourced, automatic construction is being researched. [38] proposed a framework for automatic generation of large readability corpora. It incorporates a readability filter in combination with a supervised approach, to collect texts at a specific level. The full pipeline: 1. identification of an appropriate set of seed URLs, 2. post-crawl cleaning, 3. readability assessment, 4. near-duplicate detection and removal, and 5. annotation. The authors observe some useful clear patterns which distinguish different levels, such as “personal pronouns are more frequent in the lower levels.” [39] continued this line of work, extending the feature set and concluding that readability models generalize adequately to a new corpus.

Future research

User-centric models: as an example, there is active research in the field of learner-specific word difficulty [40]. By using user-centric models, we can adapt users to content (by providing personalized training) or content to users. Personalised training can include identifying important terms in the text that the user is not likely to know and either explaining or simplifying them. Such systems could “seek optimal strategies and methods for augmenting content or user knowledge in order to actively reduce the ‘knowledge gap’ between the author and a reader” [9].

Public datasets. A prominent issue is a lack of significantly-sized, freely available, high-quality corpora for computational readability evaluation. A part of this challenge is readability in Web: basic readability properties of the Web texts and the influence of readability on user interactions with content, are under-researched.

Knowledge-based models: In general, there are few approaches incorporating higher-level semantic and pragmatic features. World knowledge represented by knowledge bases and graphs like DBPedia is a valuable source of information. The dependencies between concepts are especially important in an educational setting because algorithms should understand what a user needs to know before moving on to the next concept.

Specifics for L2

My own projects are currently concerned with the first proposed avenue of research: intelligent tutoring applications that retrieve content on interesting topics in the zone of proximal development. Hence, I am adding some specifics of readability prediction for second language learners.

Self-directed language learning is explored in [10], and they make a number of interesting remarks. First of all, there is a mismatch between the levels on native and L2 data: “school grade levels indicating the readability of L1 texts cannot be directly mapped to foreign language learning, but rather need to be learned individually from L2 data”. [8] notice the same and recommend to adapt the pairwise ranking algorithms to ensure that the preference pairs are only created from the same domain. Apart from that, “the output of readability measures has to be more fine-grained than standard school grades, … readability measures should account for the native language of the learner and should be adapted to groups of users sharing a common mother tongue”.

If you are interested in collaborating on this topic, drop me a message!

Thank you for reading and please don’t hesitate to point out if I missed something :)

References

[1] R. J. Kate et al., “Learning to predict readability using diverse linguistic features,” Coling 2010–23rd Int. Conf. Comput. Linguist. Proc. Conf., vol. 2, no. August, pp. 546–554, 2010.

[2] O. De Clercq, V. Hoste, B. Desmet, P. Van Oosten, M. De Cock, and L. Macken, “Using the crowd for readability prediction,” Nat. Lang. Eng., vol. 20, no. 3, pp. 293–325, 2014.

[3] M. Benzahra and F.- Orsay, “Measuring text readability with machine comprehension : a pilot study,” Proc. ofthe Fourteenth Work. Innov. Use ofNLP Build. Educ. Appl., pp. 412–422, 2019.

[4] I. Pilán, S. Vajjala, and E. Volodina, “A Readable Read: Automatic Assessment of Language Learning Materials based on Linguistic Complexity,” 2016.

[5] F. Dell’Orletta, M. Wieling, G. Venturi, A. Cimino, and S. Montemagni, “Assessing the Readability of Sentences: Which Corpora and Features?,” pp. 163–173, 2015.

[6] G. H. Paetzold and L. Specia, “SemEval 2016 task 11: Complex word identification,” SemEval 2016–10th Int. Work. Semant. Eval. Proc., pp. 560–569, 2016.

[7] M. Heilman, K. Collins-Thompson, and M. Eskenazi, “An analysis of statistical models and features for reading difficulty prediction,” no. June, pp. 71–79, 2010.

[8] M. Xia, E. Kochmar, and T. Briscoe, “Text Readability Assessment for Second Language Learners,” pp. 12–22, 2016.

[9] K. Collins-Thompson, “Computational assessment of text readability: A survey of current and future research,” ITL — Int. J. Appl. Linguist., vol. 165, no. 2, pp. 97–135, 2014.

[10] L. Beinborn, T. Zesch, and I. Gurevych, “Towards fine-grained readability measures for self-directed language learning,” Proc. 1st Work. NLP Comput. Lang. Learn., vol. 80, no. October, pp. 11–19, 2012.

[11] X. Chen and D. Meurers, “Characterizing Text Difficulty with Word Frequencies,” pp. 84–94, 2016.

[12] V. Kuperman, H. Stadthagen-Gonzalez, and M. Brysbaert, “Age-of-acquisition ratings for 30,000 English words,” Behav. Res. Methods, vol. 44, no. 4, pp. 978–990, 2012.

[13] A. C. Graesser, D. S. McNamara, M. M. Louwerse, and Z. Cai, “Coh-Metrix: Analysis of text on cohesion and language,” Behav. Res. Methods, Instruments, Comput., vol. 36, no. 2, pp. 193–202, 2004.

[14] S. E. Schwarm and M. Ostendorf, “Reading level assessment using support vector machines and statistical language models,” ACL-05–43rd Annu. Meet. Assoc. Comput. Linguist. Proc. Conf., no. June, pp. 523–530, 2005.

[15] E. Pitler and A. Nenkova, “Revisiting readability: A unified framework for predicting text quality,” EMNLP 2008–2008 Conf. Empir. Methods Nat. Lang. Process. Proc. Conf. A Meet. SIGDAT, a Spec. Interes. Gr. ACL, no. October, pp. 186–195, 2008.

[16] T. V. Der Brück, S. Hartrumpf, and H. Helbig, “A readability checker with supervised learning using deep indicators,” Inform., vol. 32, no. 4, pp. 429–435, 2008.

[17] R. Barzilay and M. Lapata, “Modeling Local Coherence: An Entity-Based Approach,” Math. Comput. Model., 2008.

[18] L. Feng, M. Jansche, M. Huenerfauth, and N. N. Elhadad, “A comparison of features for automatic readability assessment,” Proc. 23rd Int. Conf. Comput. Linguist. Posters, vol. 2, no. August, pp. 276–284, 2010.

[19] M. Mesgar and M. Strube, “Lexical coherence graph modeling using word embeddings,” 2016 Conf. North Am. Chapter Assoc. Comput. Linguist. Hum. Lang. Technol. NAACL HLT 2016 — Proc. Conf., pp. 1414–1423, 2016.

[20] T. François and E. Miltsakaki, “Do NLP and Machine Learning Improve Traditional Readability Formulas?,” Proc. First Work. Predict. Improv. Text Readability Target Read. Popul., no. Pitr, pp. 49–57, 2012.

[21] O. De Clercq and V. Hoste, “All Mixed Up? Finding the Optimal Feature Set for General Readability Prediction and Its Application to English and Dutch,” Comput. Linguist., 2016.

[22] E. Salesky and W. Shen, “Exploiting Morphological, Grammatical, and Semantic Correlates for Improved Text Difficulty Assessment,” pp. 155–162, 2015.

[23] L. Forti, A. Milani, L. Piersanti, F. Santarelli, V. Santucci, and S. Spina, “Measuring Text Complexity for Italian as a Second Language Learning Purposes,” pp. 360–368, 2019.

[24] R. Reynolds, “Insights from Russian second language readability classification: complexity-dependent training requirements, and feature evaluation of multiple categories,” pp. 289–300, 2016.

[25] S. Vajjala and D. Meurers, “On The Applicability of Readability Models to Web Texts,” Proc. 2nd Work. Predict. Improv. Text Readability Target Read. Popul., pp. 59–68, 2013.

[26] K. Collins-Thompson and J. P. Callan, “A Language Modeling Approach to Predicting Reading Difficulty.,” Proceddings Annu. Conf. North Am. Chapter Assoc. Comput. Linguist., pp. 193–200, 2004.

[27] L. Si and J. Callan, “A statistical model for scientific readability,” Int. Conf. Inf. Knowl. Manag. Proc., pp. 574–576, 2001.

[28] Z. Jiang, G. Sun, Q. Gu, T. Bai, and D. Chen, “A Graph-based Readability Assessment Method using Word Coupling,” no. September, pp. 411–420, 2015.

[29] S. Vajjala and D. Meurers, “Readability-based Sentence Ranking for Evaluating Text Simplification,” 2016.

[30] S. Štajner, S. P. Ponzetto, and H. Stuckenschmidt, “Automatic assessment of absolute sentence complexity,” IJCAI Int. Jt. Conf. Artif. Intell., no. October, pp. 4096–4102, 2017.

[31] S. M. Yimam et al., “A Report on the Complex Word Identification Shared Task 2018,” pp. 66–78, 2018.

[32] S. Vajjala and D. Meurers, “On Improving the Accuracy of Readability Classification using Insights from Second Language Acquisition,” in The 7th Workshop on the Innovative Use of NLP for Building Educational Application, 2012, pp. 163–173.

[33] M. Mesgar and M. Strube, “A Neural Local Coherence Model for Text Quality Assessment,” pp. 4328–4339, 2019.

[34] M. Cha, Y. Gwon, and H. T. Kung, “Language Modeling by Clustering with Word Embeddings for Text Readability Assessment,” Proc. 2017 ACM Conf. Inf. Knowl. Manag. — CIKM ’17, pp. 2003–2006, 2017.

[35] J. Nelson, C. Perfetti, D. Liben, and M. Liben, “Measures of Text Difficulty: Testing their Predictive Value for Grade Levels and Student Performance,” p. 58, 2012.

[36] R. Barzilay and N. Elhadad, “Sentence alignment for monolingual comparable corpora,” pp. 25–32, 2003.

[37] S. Vajjala and I. Lucic, “OneStopEnglish corpus: A new corpus for automatic readability assessment and text simplification,” pp. 297–304, 2018.

[38] J. Silva, R. Ribeiro, A. Adami, P. Quaresma, and A. Branco, “Crawling by Readability Level,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 9727, pp. 306–318, 2016.

[39] J. Alberto, W. Filho, R. Wilkens, and A. Villavicencio, “Automatic Construction of Large Readability Corpora,” pp. 164–173, 2016.

[40] Y. Ehara, I. Sato, H. Oiwa, and H. Nakagawa, “Mining Words in the Minds of Second Language Learners for Learner-specific Word Difficulty,” J. Inf. Process., vol. 26, no. 0, pp. 267–275, 2018.