Radical Exploder: An online approach for helping Arabic students learn to identify word roots

David Hutchings
Jul 12 · 15 min read

As an Arabic translator with over 15 years’ experience of study and professional work with the language in various capacities, and now as a new software developer, I’m passionate about the use of technology to facilitate language learning and translation.

One idea I’ve had for a tool for Arabic students which I don’t believe exists is based on a technique I used years ago when I used to tutor brand new students of the language.

There were a number of habits I always encouraged them to take up, the most important of which was to always look up every single unfamiliar word in a Hans Wehr dictionary after capturing each of those words in a notebook. By the end of my own studies, I had over 20 of these notebooks filled with thousands of terms, each and every one of them checked and confirmed in the Hans Wehr, which is a standard dictionary used by all serious English-speaking students of Arabic.

The “catch-22” for new Arabic students is that initially, they aren’t capable of looking many terms up in the Hans Wehr. This isn’t only because of their unfamiliarity with a new script. It is also, to a very large degree, because traditional Arabic dictionaries such as the Hans Wehr are organized very differently from the way we are used to.

An image of the Hans Wehr dictionary with roots circled.
An image of the Hans Wehr dictionary with roots circled.

This is because the Arabic language is largely organized by a system of roots that are modified in hundreds of ways to create all of the verbs, nouns, adjectives, and other words. Most of these word roots consist of three letters, called triliteral roots and a much smaller number consist of four letters, called quadrilateral roots. To further complicate matters, some of the triliteral roots have identical second and third roots which are sort of duplicated in a way that makes them look as if they are only two letters. In this blog post, I may refer to these as biliteral roots (although this is not technically correct).

All beginning Arabic students with more than a week under their belts understand this, but it can sometimes take years of study before they are competent in the major rules of Arabic grammar and morphology enough to feel comfortable quickly looking up any given term in a dictionary.

I’ll use a few examples to illustrate these points:

Some of these derived words are relatively simple, using only one additional letter such as the following:

مكتب

Some of the above are rarely used or only hypothetical but all of them have the same root: ك ت ب .

Words of this simplicity are relatively easy to see the root in, but Arabic terms can very quickly get more complicated, such as the following:

فسيستخدموها

This single term consists of an entire phrase or sentence in English, translated something like “so they will use them”, depending on the context.

For any native Arab or an experienced student, the root here is very obvious, but for the very early-stage student, it may be almost totally indecipherable, despite the fact that the basic verb here (“to use”) is, just an in English, a very common and mundane word.

At this point, you may be wondering what the technological angle to this blog post is. Thanks for asking!

What I want to discuss is a concept for developing a tool to help early-stage students of Arabic quickly develop useful heuristics, or rules of thumb, for decoding written form of Arabic and extracting the root of any root-based Arabic word, or for determining when an Arabic word is not derived from a root (as in the case of loan words, for example).

I call this concept the “Radical Exploder”, because you are going to be “exploding” a word, that is, taking apart letter by letter, and seeking out the 2, 3, or 4 radicals — the individual letters that make up the root.

The purpose of the Radical Exploder is not to help the user definitively arrive at the root of a given word. There are in fact a few existing online tools that already do this in a perfectly sufficient manner. The problem with these existing tools, is they provide the user with a definitive listing of the root without helping the student understand how we can arrive at the root ourselves. Furthermore, these tools are often presented in a matter designed to satisfy advanced students of the language rather than to teach the beginner.

In my Radical Exploder, the emphasis will not be on the objective truth, but on teaching the student how they might get there themselves with only the most basic knowledge of Arabic grammar and morphology.

The interface:

In its most basic form, the Radical Exploder needs almost no interface features at all beyond a single input box and a submit button — not unlike the main Google search page. In an advanced phase of development, the interface may easily be modified to accept multiple words at a time, but for now, we’ll assume the system only handles one.

After the term is submitted, the system runs through a series of tests (to be described below) and the user is presented with a results screen with two key features:

The term itself will be broken up into its constituent letters (they will not need to be joined as most letters here will be assessed individually.) The letters will not be of uniform font size, however. They will be sized according to the likelihood of their being part of the root, based on the “scores” each letter gets as a result of the tests discussed below.

If the system works as I expect, in most cases, at least 2 of the radicals will always be presented in a very large font, suggesting their high probability of being part of the root, while one or two other letters may be of medium size, and other letters will be very small if they are almost certainly not part of the root.

In addition to the sizing, this presentation of the term may also include other features such as coloring or boldness to reflect their probability of being part of the root or of other possible functions the letter may serve in the term. Initially, however, we will focus on font-size as a measure of root probability.

Below this large display of the analyzed and variously-sized letters, will be a series of messages to the user explaining the “logic” of the tests in plain language that a beginner student should understand.

As you can see, three of these letters stand out as much larger than the rest. This is because this term, as long and complicated as it may be, actually is constructed through some very clear and standard rules, which may be “reversed engineered” by the reader to arrive at it root by following simple guidelines such as the following:

(Feel free to skip the italics if you know nothing about Arabic.)

  • An initial letter ف if often a kind of preposition meaning something like “so…”. However, elsewhere in the word, it is essentially always part of the root.
  • The letter س serves several grammatical functions. Here, near the start of the term, it can be one of the ways that Arabic indicates the future tense.
  • The letter ي serves many grammatical functions and is almost never part of the root in the form we see here. At or near the beginning of a term, it can be used to indicate one of several verb conjugations.
  • The letter س serves several grammatical functions. When near the beginning of a longer-term and when followed by a ت, it can indicate that the root has been modified into a Form 10 verb.
  • The letter ت can serve dozens of functions, both near the start, near the end, and also in the middle of a term. Therefore it should always be assumed not to be part of the root unless there are few other viable options.
  • The letter خ serves no morphological or grammatical function and will essentially always be part of the root.
  • The letter د almost always serves no morphological or grammatical function except in some very rare cases where it replaces the ت after the first radical of Form 8 words.
  • The letter م is commonly used to form words, but in those cases, it always preceded the first radical. It also comes at the end of some pronouns. Therefore, if you spot it near the middle of a term, there’s a good chance it is part of the root.

The concept for processing each term:

To achieve these results, the application first needs to validate the term. There are many letters that are part of the Arabic Unicode block which are not useful for this analysis and the system needs to either strip those out or invalidate the term. Most commonly what we may see are terms utilizing characters from alphabets descended from Arabic, but which are used in languages such as Dari and Pashtu. These languages’ letters are usually still part of the Arabic Unicode block so the system will need to limit which characters it will accept.

Or greater concern is the diacritical marks and the shadda, etc. As all Arabic speakers know, these are rendered irregularly in the written form of the language and pronounced inconsistently across the dialects. A more fully developed version of this application should most certainly make use of the diacritics if they are present, but they are not necessary for the initial roll-out fo the tool I am envisioning, so the input term that we will work with will also strip out these characters.

After being converted into an array consisting solely of Arabic characters from the correct section of the Arabic Unicode block, and stripping out the diacritics, etc., we will have the input term prepared for analysis.

The next step will be to establish a corresponding array to contain a score for each character of the input term. The scoring system is key to the processing of the term. Each letter from the term will start out at an arbitrary middle score, from which additions and subtractions will be made as a result of the various tests.

For the purposes of this blog post, we’ll assume that the range of scores can be between 0 and 100, where 0 represents certainty that a character is not part of the root, and 100 represents certainty that a character is part of the root. In reality, I don’t expect that any characters will reach scores at either of these extremes. For the purposes of this blog, we’ll assume at the starting score for each character is 50, but I suspect that in the actual initial implementation of the application, I will assign each character a default score to be determined after fine-tuning the system and seeing how it works out.

You can think of each array of the input characters as having a corresponding array of scores, like this:

[“م”, “ك”, “ت”, “ب”]

has an initial score of

[50, 50, 50, 50]

(Note that in practice, Javascript appears to split Arabic terms in left-to-right order by default, which is probably the way in which we will assess the terms. The scores throughout this document are also left-to-right like in English.)

Term properties:

In addition to the scores, each term will be issued a set of default characteristics representing the system’s guesses about what kind of word it has. These characteristics will mostly be stored in an object or as simple boolean true/false variables such as:

taaMarbutah: false /

These properties will subsequently be stored as true if the tests reveal evidence of these characteristics. The characteristics will in turn linked to the messages displayed to the user following the analysis of the term. For example, if a taa marbutah (a special kind of Arabic “T”) is present, the user will be reminded of its significance (that it indicates a feminine entity of some kind).

The presence of the taa marbutah also tells the system that the term is not a conjugated verb, which means we don’t have to run some of the tests necessary for analyzing conjugated verbs.

Similarly, the presence of an initial waaw (a kind of letter U or W), can trigger a message reminding the student that the Arabic word for “and” is always attached to the subsequent word. The presence of an initial sin can trigger a message reminding the student of that form of the future tense marker.

The tests:

The tests in their simplest forms can consist of the following format:

if (inputTerm.length >= 3 && characterPosition1 === م ) {

The logic of the test is as follows:

First, we want to know the total length of the input. An initial meem on a term 2 characters (مد) has a different likelihood of being a radical from one with 4 (مكتب) or 10 (مستشفياتها), where in the first case, that meem almost certainly is part of the root, in the second case it might be 50/50, and in the third case, the meet is almost certainly not part of the root. In many cases, we may want to vary the scoring based on input string length.

After the test of string length, we look at the first character in the string. We can make these tests extremely long and detailed if we wish. To use the example above of فسيستخدموها, such a test could be rendered as:

if (inputTerm.length >= 8 && characterPosition1 === ف && characterPosition2 === س && characterPosition3 === ي && characterPosition4 === س && characterPosition5 === ت) {

In the example, all the first 5 letters could have their score knocked down because they are so clearly part of clearly established Arabic grammatical patterns which can be easily detected through this kind of test.

In addition to tests analyzing words from the start of the string, there can also be tests analyzing words starting from the end. In the example فسيستخدموها, the last three letters are clearly grammatical (reflecting part of the third-person present tense plural as well as a feminine pronoun ending.)

By assessing and deducting the scores of these characters alone, the tri-literal root of this term would be revealed, but we could go further.

Another kind of test could be run looking specifically to see if the term contains letters with are never or almost never server any morphological or grammatical function and the scores associated with those letters could be boosted accordingly. In our example here, the خ will always be part of the root of any word it is in. This finding could be explained to the student as one of the messages received following the analysis of the term. This way, over time, students using the system could grow more confident in “seeing” the root at the heart of many terms such as فسيستخدموها, which is really very easy to do once you are training to spot these things.

The overall body of tests that could be run could hypothetically be in the hundreds, but I suspect that not all of those are necessary. As I develop the system, trial and error and experimentation will likely guide me in determining a good minimum body of tests to effectively achieve my goal for the tool

Furthermore, a number of the tests could be combined so that one only needs to run a test for an initial ف or و or س or م once and combine those tests as necessary.

It’s also worth pointing out that in many cases we should be able to identify the form of the word with a high degree of confidence for most of the forms. We may also be able to achieve good results with identifying verbal nouns, and active and passive participles. In all such cases, the results would be include guidance to the student about why the system is making the determinations that it is.

Scoring:

As it probably clear by now, as each test is run and the character’s scores are increased or decreased, this correlates to how the system displays the term after the analysis. Our system isn’t aimed at definitely telling the student what the root actually is in reality, but it should show the student where they should start looking in their dictionaries.

While the interpretation of a simple word like مكتب may not be greatly aided by this system (in fact, it may be a little misleading because of the ت in the root, which to the computer could represent the ت of a Form 8 biliteral active/passive participle), longer words, like فسيستخدموها, could be made much easier to interpret. The score array for this one might look like:

[10, 10, 10, 10, 10, 80, 70, 60, 10, 10, 10]

(Note that the “scores” here are provided to the reader to understand what would be happening behind the scenes. The actual scoring system is not yet determined and the user would probably not see it. Remember that they all start with an arbitrary default value like 50 and that they read left-to-right.)

As already shown above, that could be rendered on the screen as:

A few other lengthy terms, with hypothetical scores and renderings:

مستشفياتها

In the case above, the rules would likely under-score the ي following the ف, but the purpose of the system is the help students start to see patterns, not to give them a definitive guide to what the root. It is possible that the system would be able to handle cases like this, possibly giving it a neutral score, or at least warn the student that a long vowel may be part of the root when there are only two obvious options.

بالعنصرية

In the above example, the quadriliteral root would be easily revealed given that three of those four letters are always part of the root, and the ن would never fall behind an ع like that unless it was part of a root.

The system here could even be useful for picking out typos which would be obvious to an experienced Arabic speaker, but which might initially trip up even some intermediate students, as in the following example:

التيتتوقف

Imagine coming across this phrase in a newspaper about some thing(s) “which stopped”. The al-, the ت’s and the ي would all have their scores reduced because those letters serve so many grammatical and morphological functions, but the ق never does, nor does a ف at the end of a word. The و in case would probably not have its score changed at all because of where it is in the term. Thus the user should be able to see through the mess of the merged words and spot the root.

Further developments:

As mentioned above, enhancements to the system should take into account any diacritics provided, especially the shadda and those which indicate whether or not a word is an active or passive participle, etc., for all of this information is critical if a student is the advance in the language and is usually systematic enough to be evaluated by a simple online application such as this, without resorting to large corpuses or machine learning.

A further enhancement to the system might be the implantation of a guess/testing functionality, where an instructor might provide a body of terms and the student must use their knowledge of Arabic to determine the roots, comparing their scores with the systems.

Furthermore, it might be useful to retain a record of terms which have been searched for and allow users to add comments further explaining what they see or to flag incorrectly analyzed terms.

In the near time, I may also be able to implement some color-coding of the fonts to further help users understand the components of the terms, and I may be able to make the system live-updating in a way which encourages students to “play” with different configurations, helping them understand how the addition or removal of various letters alters the interpretation of a term.

Obviously, Arabic contains many irregularities and complexities that the beginning student need not concern themselves with, but over time as the system matures, the system could be developed enough to train even very advanced students in understanding Arabic morphology and grammar.

David Hutchings

Written by

Arabic translator. Music producer. Web developer.