Machine Learning Is Now Translating ‘Dead’ Languages

Most languages that have ever been spoken are not used anymore, so AI is now being used to help linguists translate these ‘dead’ languages.

Sritan Motati
TechTalkers
4 min readDec 31, 2020

--

Writing in ancient cuneiform (Picture Credit: Engoo)

Most languages that have ever been spoken are ‘dead’, or not spoken anymore. Each year, more and more ancient or unknown languages are lost, but to most of us, that doesn’t matter because we don’t speak them. After all, if they’re so unknown, why do we even need to know how to translate them?

The fact of the matter is that a language is not just a way of communicating with someone, it is a container of knowledge and culture unique to its speakers, and when these languages are lost, so is that knowledge. Linguists, people who study foreign languages, try to decipher these obscure languages, but this can sometimes take decades of hard work. Languages can have drastically different grammar, vocabulary, or syntax that make them nearly impossible to translate. Additionally, we can’t use translation algorithms like Google Translate because we don’t have enough information about most dead languages. So what is the solution to this problem? Artificial intelligence, according to a group of researchers at the Massachusetts Institute of Technology (MIT).

Writing in a dead language, which is undersegmented (Picture Credit: ProLingo)

Researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have proposed a solution to two of the main difficulties in deciphering languages:

  1. Some languages don’t separate scripts into distinct words (no spaces).
  2. The closest known language is not known.

The researchers have created a machine learning (ML) system that can automatically solve these problems without needing to know much additional information. This system can determine the closest known language to whatever language it’s translating and separate different words if the language is naturally undersegmented.

Graphic of an AI using natural language processing (Picture Credit: Aliz)

The group of researchers, led by MIT Ph.D. student Jiaming Luo and MIT Professor Regina Barzilay, created the system using techniques from natural language processing (NLP), a subset of AI that is concerned with human language. To make deciphering languages easier, the ML system makes some assumptions based on observations from the history of linguistics. One such assumption is that sounds between similar languages will be, well, similar. For example, an “e” will probably not change into a “k” due to the pronunciation difference.

The researchers’ system uses linguistic constraints like the one above with a decipherment algorithm they created. This algorithm takes in inscriptions from the lost language and vocabulary in a known language (ex. English or Greek) and returns a language similarity measure that helps linguists determine the closest known language to the lost language. It also analyzes the language sounds by representing these sounds in a multidimensional space where pronunciation is reflected by different vectors.

Sound wave (Picture Credit: The Science of Sound)

Using these vectors, the model can detect patterns in language and then segment words if the scripts don’t already have distinct words. These words are then mapped to their counterparts in a known language by the model, which helps linguists obtain a full translation.

This algorithm has been tested with several languages, and for the most part, it has been accurate. One important test was with the Iberian language. Linguists have argued about whether or not Iberian’s closest known language is Basque, so to find out what the model believes, the researchers from CSAIL used their algorithm to find the similarity of Iberian to several known languages, including Basque. What it found was that while Basque and Latin were the closest languages to Iberian, the difference was too great to conclude that Iberian is related to Basque.

The researchers believe that their system has lots of potential in tasks other than decipherment, including inferring lost characters in a lost language. In the future, the researchers plan on modifying the system to go beyond simply connecting text in an unknown language to text in a known one. The flaw with this is that it assumes there is a known language, but that isn’t always true, as shown by the Iberian test mentioned above.

Detailed diagram of how the researchers plan on finding lost characters (Taken from the original research paper)

Currently, languages like English, Chinese, and French are spoken around the world, but there are many unknown languages spoken by only a couple of people worldwide. These forms of dialect may cease to exist soon, and with them, the culture captured by these languages. Linguists are trying their best to decipher as many ‘dead’ languages as they can, but at the rate it currently takes, that may not be possible for a long, long time. Enter artificial intelligence, however, and this rate may increase greatly. AI and machine learning are the future, but now, they’re helping us discover things of the past.

Read the official research paper for the decipherment algorithm talked about in this article here.

--

--

Sritan Motati
TechTalkers

Founder of TechTalkers. Medicine and artificial intelligence enthusiast. https://medium.com/techtalkers