An AI analysis of Esperanto etymology
Overview
As a language enthusiast, Esperanto naturally intrigues me. It has extremely simple grammar (devoid of any irregular rules inherent to natural languages), logical word formation, intuitive phonetic spelling, and limited vocabulary.
Zamenhof, the creator of Esperanto, was an ophthalmologist by trade, but his real passion was language. He spoke Yiddish, Russian, French, German, Hebrew, Polish, Belarusian, Latin, Greek, Aramaic, English, Italian, Lithuanian, and Volapuk. Naturally, he drew from his working knowledge of the languages he knew during his creation of Esperanto.
Esperanto is Eurocentric, deriving predominatly from the Indo-European language group. There aren’t borrowings from languages like Chinese, Hindi, Arabic, Swahili, etc. And according to Wikipedia, approximately 80% is derived from Romance languages alone:
Esperanto’s vocabulary, syntax and semantics derive predominantly from languages of the Indo-European group. A substantial majority of its vocabulary (approximately 80%) derives from Romance languages, but it also contains elements derived from Germanic, Greek, and Slavic languages [1]
A different Wikipedia article notes:
[A]bout two-thirds of this original vocabulary is Romance, and about one-third Germanic [2]
In 1987, Geraldo Mattos calculated that 84% of basic vocabulary was Latinate, 14% Germanic, and 2% Slavic or Greek
I wanted to see for myself, using AI, what the etymological makeup of Esperanto really was. I decided to use Python and the Open AI API to complete this task.
The Results
counter: 2673
germanic: 648
romance: 1676
slavic: 157
uralic: 99
semetic: 14
invented: 79
% Derived from Romance Languages: 63
% Derived from Germanic Languages: 24
% Derived from Slavic Languages: 6
% Derived from Uralic Languages: 4
% Derived from Semetic Languages: 1
% Invented: 3
Analysis
Whereas Wikipedia said approximately 80% of Esperanto words were derived from Romance languages, Open AI categorized only 63%.
However, looking over the output.txt file, I saw many words that Open AI categorized as Germanic when they appeared to be Latin-based. There were too many words for me to inspect each one individually, but upon a broad glance, I would estimate that half of all the words categorized as Germanic were actually false positives — and should have been categorized as Romance words. So at the very best, 24% of Esperanto vocabulary is Germanic. But I would estimate that to be closer to 12%. Which heavily correlates with what Mattos calculated in 1987.
The Process
I first found a wordlist of the 3000 most common Esperanto words and copied them into a text file called esperanto3000words.txt (it can be viewed and downloaded here)
I read that text file into a string array
def open_file():
with open("assets/esperanto3000words.txt") as file:
dict_ary = [line.split()[1] for line in file]
return dict_ary
I imported the openai library
from openai import OpenAI
I created a function that analyzed each word from the list, and categorized it
def ask_chatgpt(client, esperantoWord):
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You will be provided one Esperanto word. You must provide linguistic/etymological context about the provided word. You will provide a response in this format: 'Given Word: <WORD GIVEN>, Parent Language: <ANSWER>, Grandparent Lanauge: <ANSWER>, Family Tree: <ANSWER>'. The Parent Language MUST be a living language (e.g. Italian, Spanish, etc). The Grandparent Language MUST be a dead language (e.g. Latin). As an example, the Esperanto word 'fari' is given to you, and your answer is 'Given Word: fari, Parent Language: Italian, Grandparent Lanauge: Latin, Family Tree: Indo-European'. If you are not sure of the answer, you can say 'UNKNOWN'. Or if the Esperanto word has no known root and is considered a newly invented word from Esperanto, you can say 'INVENTED'. Remember, the Parent Language cannot be 'Latin'. "},
{"role": "user", "content": esperantoWord}
]
)
return response.choices[0].message.content
I looped through all the words in the list, and I called the Open AI API on each word to provide me with the word analysis. I saved the results to output.txt (it can be viewed and downloaded here)
openApiClient = OpenAI(api_key=OPENAI_API_KEY)
esperanto_words = open_file()
output = open("assets/output.txt", "w")
for esperanto_word in esperanto_words:
if len(esperanto_word) > 1:
response = ask_chatgpt(openApiClient, esperanto_word)
output.write(response + '\n')
output.close()
I then processed the output.txt file by categorizing each word into one of the major language families:
germanic_languages = [
"german",
"old high german",
"dutch",
"icelandic",
"old norse",
"english",
"danish",
"swedish",
"norwegian"
]
romance_languages = [
'latin',
'french',
'spanish',
'portuguese',
'italian',
'romanian',
'romance'
]
slavic_languages = [
"russian",
"czech",
"polish",
"old church slavonic",
"serbian",
"croatian",
"slovenian",
"ukrainian",
"bulgarian"
]
uralic_languages = ['finnish',
'hungarian',
'estonian',
'votic (finnic)']
afro_asiatic_languages = ['arabic',
'hebrew',
'maltese']
with open("assets/output.txt") as file:
word_ary = [line for line in file]
counter = 0
germanic = 0
romance = 0
slavic = 0
uralic = 0
semetic = 0
invented = 0
for line in word_ary:
line_lower = line.lower()
match = re.match(r'given word: (.*?), parent language: (.*?), grandparent language: (.*?), family tree: (.*)', line_lower)
if match:
word_info = {
"given word": match.group(1).strip(),
"parent language": match.group(2).strip(),
"grandparent language": match.group(3).strip(),
"family tree": match.group(4).strip()
}
if any(lang in line_lower for lang in germanic_languages):
counter = counter + 1
germanic = germanic + 1
elif any(lang in line_lower for lang in romance_languages):
counter = counter + 1
romance = romance + 1
elif any(lang in line_lower for lang in slavic_languages):
counter = counter + 1
slavic = slavic + 1
elif any(lang in line_lower for lang in uralic_languages):
counter = counter + 1
uralic = uralic + 1
elif any(lang in line_lower for lang in afro_asiatic_languages):
counter = counter + 1
semetic = semetic + 1
elif "invented" in line_lower:
counter = counter + 1
invented = invented + 1
# else, don't add to counter, as the other word categories I'm counting as chatgpt errors (like indonesian, etc)
# Print totals
print("counter:", counter)
print("germanic:", germanic)
print("romance:", romance)
print("slavic:", slavic)
print("uralic:", uralic)
print("semetic:", semetic)
print("invented:", invented)
print("")
# Calculate ratios:
print("% Derived from Romance Languages:", round((romance / counter) * 100))
print("% Derived from Germanic Languages:", round((germanic / counter) * 100))
print("% Derived from Slavic Languages:", round((slavic / counter) * 100))
print("% Derived from Uralic Languages:", round((uralic / counter) * 100))
print("% Derived from Semetic Languages:", round((semetic / counter) * 100))
print("% Invented:", round((invented / counter) * 100))
Disclaimer
There was some obvious AI error in the word categorizations. Some words would wrongly be label as Indonesian, Maori, etc. I did try to filter out some “junk” results, which is why the “counter” above only shows 2673 total accounted for.
It is also difficult to say which Esperanto words were truly “new” or “invented” (i.e. not derived from any father language). I’ve heard words like “edzo” were created by Zamenhof, not having any clear source language [3]. I would also consider all the tablevortoj to be newly created Esperanto words without any prior linguistic source.