DISCLAIMER: I personally do not have an opinion on the classification of English and that is not what this article is about, I am not a linguist. This article only researches the statistics behind English vocabulary as there is currently no such data available.
The English language and its origins have been a topic for fierce debate among many linguists. English is classified as a (West) Germanic language, meaning that it is closely related to other Germanic languages such as Swedish, Dutch and German. The other dominant language family in Western Europe is the group of Romance languages: French, Italian, Spanish… all languages that have sprouted from Latin somewhere throughout history.
Unlike other Germanic languages, English shares a large portion of their vocabulary with French and Latin, often attributed to the period of Norman French dominance in England after 1066. The size of this Romance influence on English, along with some other technical aspects such as pronunciation and syntax, has led some radical linguists to believe that English should in fact not be seen as a Germanic language, but rather as a Romance-Germanic hybrid. However, the general consensus is that the overall English vocabulary is only a third of Old English origin (so, Germanic) but that the core vocabulary is entirely Old English. The keyword here is core, as most linguists claim that French and Latin influence only enters the language through a handful of basic words but a vast majority of academic terms. For many, this seems to be the most important criterion for its classification as a Germanic language.
I personally don’t care much about these classifications, but I was very surprised to discover that in fact no-one recently has actually bothered to research the origins of English vocabulary, let alone the core. The latest research was done in 1975 by Joseph M. Williams, where he examined the 10,000 most frequently used words in English, based on a rather small sample size of corporate letters. Here are my issues with his research:
- the research carries a bias towards French and Latin, as companies are more likely to use academic language
- proper names were not removed, possibly diluting the results for an etymological composition
- he used the 10 000 most common words in that corpus of letters, not really “core vocabulary”
And core vocabulary is precisely what this whole debate is all about, so I decided to do my own little research using Python to see how I could provide some statistics behind these claims.
The quest for etymology
Gathering the data
The Oxford Dictionary claims that there are roughly 250 000 distinct words in English vocabulary. But what share represents the core vocabulary? What does that even mean? The Oxford Dictionary uses the following table with some insight on the relation of the most common words in English to the appearance of words in English sources:
This table shows us a rather large problem: the actual occurrence of words in applied English does not reflect the (core) vocabulary or even the language as a whole. 50% of any given text in English will use the exact same linkers/pronouns, even though those 100 words only reflect 0.04% of distinct English vocabulary. A word such as “the” alone makes up 6% of any given source in English. This disproportionate use of extremely basic structural words deceives the reader into thinking that English vocabulary is of an entirely different etymological composition. This is why to determine the composition of a language, or in my case just the core vocabulary, we have to use frequency lists, and not simply count words in sources. A frequency list, a list with the most commonly used words in a language, lets us accurately determine the core vocabulary of a language.
Now that I’ve established that I need to find a frequency list, I need to know how large that set of words has to be in order to have the core vocabulary. The definition of core vocabulary is not strictly defined, but the numbers seem to vary from roughly 3,000 to the 5,000 most common words in English, as used by various dictionaries, English learning sites and literature. Based on those estimates, I decided to use a sample size with the 5,000 most frequently used words in English, representing the top 2% of distinct vocabulary and making up 85% of all words in any English source.
Acquiring the actual 5,000 words turned out to be a lot harder than initially anticipated; there are almost no clean, unbiased (and uh, free) datasets. On one hand you have large datasets of all subtitles ever uploaded on OpenSubtitles.org, and on the other you have sources such as Project Gutenberg. Unfortunately is either the source severely outdated such as Project Gutenberg or is it severely biased, such as the subtitles; when words such as “kill” and “f*ck” are among the top entries, you know that you are not handling reliable data.
I ended up going with the frequency list made available by Wordfrequency, a dataset that they believe to be the “most accurate frequency data of English”. It is based on the Corpus of Contemporary American English, containing 450 million words sourced from colloquial speech to literature to magazines.
Processing the data
Next, I had to find a way to automate the process of retrieving the original language (or in other words, the etymology) of each word in my dataset. For this, I had to find reliable online dictionaries. The most obvious source for etymological data is Etymonline, an extremely accurate dictionary that was compiled and verified manually, and therefore used for 80% of this dataset. The interface looks like this:
Because the language of origin is mentioned right away, I scrape the first 15 words of each entry using Python’s urllib and BeautifulSoup. The language that is mentioned first is then taken as the language of origin. If there is no language present, it tries the next method.
Note that sometimes a word of Latin origin will return “French” using my method. This is because Etymonline always mentions French before Latin if the word entered English through French and the word changed sufficiently from the root. A word such as “origin” (from “origo”) will therefore return French, whereas a word such as “provide” (from “providere — provideo”) will return Latin.
The second source is Memidex, an index of online dictionaries. Memidex returns the many possible origins of each word based on its indexed dictionaries.
I simply scrape the first 8 words after every mention of “Origin:” on the webpage and then return the language of origin that is mentioned the most. During every step of the process, the percentage of each language that makes up English is then logged into a list for later visualization and interpretation.
Visualizing the data
The following area graph contains the results of my research:
As you can interpret, French and Latin enter English vocabulary much earlier and in much greater quantities than previously thought, it only takes the top 1,627 words in English for Germanic languages to lose the majority share of vocabulary and at exactly the 1,875 most commonly used words do French and Latin dominate English vocabulary. It seems that Romance languages enter English at an exponentially fast rate but eventually do solidify at a level that I cannot reliably determine at a sample size of 5,000 words; it does appear that it eventually reaches the percentages earlier provided by other researchers such as Mr. Williams.
Interestingly enough are French and Latin not the first foreign languages to enter this composition by frequency. For the first 200 words, Old Norse makes up 5–10% of the vocabulary. Words such as “they”, “big” and “die” all seem to come from Scandinavian sources that entered English around the Viking age, a trait that seems to be unique to the English language.
Furthermore does Greek live up exactly to expectations; the share of Greek words in English increases linearly through rather academic words such as “photograph” and “character”.
French and Latin make up the largest portion of English core vocabulary. After the 1,875 most frequently used words out of the 250,000 words in distinct English vocabulary do French and Latin dominate the English language, achieving a share of 56% at the core vocabulary level, 5,000 words. Both languages enter English at an exponentially fast rate whereas Greek’s share seems to stay limited to a small linear growth due to its influx consisting of academic words. Furthermore is a considerable share of very basic English of Old Norse origin.
The source code with detailed guidance for every step can be found here on GitHub.