Thank you so much for your response Andrew, I very much appreciate your criticism!

2 min readApr 14, 2017

The main focus of my article was to research the composition of “core vocabulary”, and I really do believe that using the 5,000 most common words in English would be the accurate way of determining so. Using a mere 100 words would only yield the 25 following nouns: “time, person, year, way, day, thing, man, world, life, hand, part, child, eye, woman, place, work, week, case, point, government, company, number, group, problem and fact”, which I believe is nowhere sufficient to be considered the “core” of English vocabulary. I agree that trends should be used instead, but the trends do indicate a Romance dominance in the core vocabulary. The graph itself is just a pretty visualization, only the 3,000–5,000 word benchmarks should be used, it is not because the Anglo-Saxon area is the earliest that it remains the core; this is merely the way of visualization. Had I used a cumulative flow graph instead, Old English would most definitely not look like the core. The use of percentages favors Old English.

Maybe to elaborate some more on the concept of “core vocabulary”; vocabulary in the English language can be divided into two categories, fringe vocabulary and core vocabulary (as explained here: https://aaclanguagelab.com/resources/core-vocabulary). Fringe vocabulary is the 5–10% list of the least frequently used words in English (the remaining 245,000 out of 250,000); words that in general are not transferable between different topics, say the “details” of your message. Core vocabulary is then interpreted to be the vocabulary necessary for understanding the “core” of the language. Just knowing 100–1,000 words would mean that you miss around 50% of the message that any given source in English were to display. A dataset of 3,000–5,000 guarantees that the reader would understand 80% of any given source, omitting the 20% “non-core”, a share that has 245,000 words to fill.

I am aware of Swadesh lists as means of determining relations between languages and that he attempted to gather some form of “basic concepts”, but they are inaccurate and of little scientific use (trust me, I tried using them for Machine Learning). The one for the English language was not established through proper statistical calculations (bad data, sample size and a too much of his own human input) and was not intended to reflect a large enough set to be considered the “core” of a language, rather universal concepts that would be present in any means of communication. I don’t believe that his standard should be upheld as core vocabulary.

I wholeheartedly agree with your comment on how vocabulary =/= language; which is why I do not state my opinion on the classification of the English language. I was just genuinely surprised to see that much commotion on the whole Germanic/Romance debate (I think particularly of this video), while nobody had any facts to back their opinions on the core vocabulary. Hence why I saw an opportunity to do some calculations.

EDIT: I’ve removed the “vocabulary/english”, that was a bad conclusion, thank you for pointing that out.

Written by Andreas Simons