Towards an improved man and machine connect using Sanskrit

11 min readMar 12, 2019

{Driving towards natural language and its possibilities in the real world using Sanskrit}

Introduction

An era of rapidly changing technology, virtualization, SDN/NFV, advent of 4G LTE technology and in the near future, 5G is redefining businesses across the globe. The digitization wave has brought artificial intelligence and machine learning, and with it a wave of organizations trying to understand natural language and intent of customers or visual clues using deep neural network techniques. We are still far away from understanding “understanding” via machines but the steps towards those have started.

While these steps have moved the needle towards realization, a large population in India seems to be unaffected by this change. The “Siris”, “Cortanas” and the “Alexas” of the world have captured a global market by their limited natural language capabilities and building huge datasets of languages in their cloud, about 85% of Indian population seems unaffected by this change. The reason is the language used “English”.

While we move towards the mid of 21st century , it is imperative that about 1/8th of the world population can converse with machines the way the rest of the world does, in its own dialect and in its own way, and this is the reason for me researching upon one of the first communication languages spoken by man , “Sanskrit”

The author’s intent here is to not to belittle any language but showcase how communications with machines can improve if alternative techniques and idioms are utilized

Objective

Objective of this whitepaper is to apprise the user of a research in motion which we do at Tech-Mahindra’s Makers Lab. The vision of this research is to ensure that 26 mother tongues and 1645 dialects spoken in India are also available to every Indian so that they can seamlessly communicate with the machines of the future.

The objective is also to look through the lens of AI (artificial intelligence) and algorithms and see which one of them apply in the world of Sanskrit for NLP(natural language processing)

Definition Natural language processing (NLP) is used for communication between computers and human (natural) languages in the field of artificial intelligence, and linguistics. Being concerned with human-computer interaction, NLP works to enable computers to make sense of human language to make interactions with machinery and electronics as user friendly as possible.

Sanskrit

I would not try and explain Sanskrit with my limited knowledge here, but for the reader, I would supplant what Wikipedia says about Sanskrit in brevity

Sanskrit (/ˈsænskrɪt/; Sanskrit: संस्कृतम्, translit. saṃskṛtam, pronounced [sə̃skr̩təm] (

listen)) is a language of ancient India with a history going back about 3,500 years.[5][6][7] It is the primary liturgical language of Hinduism and the predominant language of most works of Hindu philosophy as well as some of the principal texts of Buddhism and Jainism. Sanskrit, in its variants and numerous dialects, was the lingua franca of ancient and medieval India.[8][9][10] In the early 1st millennium CE, along with Buddhism and Hinduism, Sanskrit migrated to Southeast Asia,[11] parts of East Asia [12] and Central Asia,[13] emerging as a language of high culture and of local ruling elites in these regions.[14][15]

Sanskrit is an Old Indo-Aryan language.[5] As one of the oldest documented members of the Indo-European family of languages,[16][note 1][note 2] Sanskrit holds a prominent position in Indo-European studies.[19] It is related to Greek and Latin,[5] as well as Hittite, Luwian, Old Avestan, and many other extinct languages with historical significance to Europe, West Asia, and Central Asia. It traces its linguistic ancestry to the Proto-Indo-Aryan language, Proto-Indo-Iranian, and the Proto-Indo-European languages.[20]

What makes Sanskrit unique is the rule set that it formulates and the grammar that was formulated much before the language became widely accepted and spoken in the Indian sub-continent

Research Evidence

Most of my approach in writing this paper has been to understand languages from grounds up (alphabets) and then compare them. In my attempt at this research, it took me days to realize a path which would take us to have an empirical evidence of our objectives natural language processing in Sanskrit and also recreating that evidence via software

Since this is a comparison between two languages it seems logical to start with how languages were earlier used to speak and communicate rather than write, and so let us start with some phonetics discussion

Phonetics

The phonetic sounds of alphabets in English and their counterparts in Sanskrit Varnamala is different. Varnamala is the Sanskrit corpus of alphabets. In English the sounds of the alphabets clash with their counterparts in quite a few occasions. Let us take the alphabets of English and the Varnamala in Sanskrit for example

In Sanskrit, the alphabets are called Varnamalas .Every word in Sanskrit is formed because of the combination of two elements Swara (“स्वर”) and a Vyanjana (व्यंजन). In sanskrit, there are 13 Swaras, 33 Vyanjanas and about 2 Swarakrashit(स्वराश्रित ) {Special words} . All in all Sanskrit has a total of 49 Varnas of the Varnamala. By its definition itself, Sanskrit has more alphabets, characters and building blocks than any other language.

Out of these Swaras 5 are pure: अ, इ, उ, ऋ, लृ

Remaining 9 are: आ, ई, ऊ, ऋ, लृ, ए, ऐ, ओ, औ

Well for any observer, the difference is quite apparent. The way we these are arranged is primarily because how air can be modeled within the mouth itself. When you open a mouth and take a sound from the glottis, the word sound has अ there whereas when the mouth opens up wider it is a आ sound. English or any other language in comparison has only A equivalent to ए, which completely misses the way the glottis performs when open the mouth wide

Let’s now compare the alphabets to these varnas

Pure Swaras अ, इ, उ, ऋ, लृ

Others: आ, ई, ऊ, ऋ, लृ, ए, ऐ, ओ, औ

कंठव्य / ‘क’ वर्ग — क् ख् ग् घ् ङ् A B C D E
तालव्य / ‘च’ वर्ग — च् छ् ज् झ् ञ् F G H I J
मूर्धन्य / ‘ट’ वर्ग — ट् ठ् ड् ढ् ण् K L M N O
दंतव्य / ‘त’ वर्ग — त् थ् द् ध् न् P Q R S T
ओष्ठव्य / ‘प’ वर्ग — प् फ् ब् भ् म् U V W X Y Z
विशिष्ट व्यंजन — य् व् र् ल् श् ष् स् ह्

From just observing the tables above, one this which is visible is that number of alphabets do not compare to the number of varnas in Sanskrit . In fact they are much lesser in number, but on close examination, something interesting appears. The range of English vocabulary also becomes lesser because the phonetics of a lot of alphabets do not map to individual phonetics of the Sanskrit varnas …

A map is shown below purely how phonetics is used

Pure Swaras: अ, इ, उ, ऋ, लृ

Others: आ, ई, ऊ, ऋ, लृ, ए, ऐ, ओ, औ

कंठव्य / ‘क’ वर्ग — क् ख् ग् घ् ङ् A(ए) B(ब्) C(क्) D(ड्) E(इ)
तालव्य / ‘च’ वर्ग — च् छ् ज् झ् ञ् F(फ्) G(ग्) H(ह्) I(इ) J(ज्)
मूर्धन्य / ‘ट’ वर्ग — ट् ठ् ड् ढ् ण् K(क्) L(ल्) M(म्) N(न्) O(ओ)
दंतव्य / ‘त’ वर्ग — त् थ् द् ध् न् P(प्) Q(क्) R(र्) S(स्) T(ट्)
ओष्ठव्य / ‘प’ वर्ग — प् फ् ब् भ् म् U(अ) V(व्) W(व्) X((क्)(ज्))

विशिष्ट व्यंजन — य् व् र् ल् श् ष् स् ह् Y(य्) Z(ज्)

Some glaring realities emerge:

क् is the sound of two alphabets in the language both C and K

2. Only 18 varnas are used to describe all 26 English alphabets

3. X((क्)(ज्)): by its nature is a compound word and not an alphabet as phonetically the air does not blow in the mouth like this to make one alphabet sound

4. Z(ज्) : It is a compound word of ज् and a period

Dual Case

Most of the languages ignore the dual Case which leads to confusion while processing dual VS Plural case; however the unique feature of Sanskrit is that its rich grammar removes this ambiguity.

Taking some examples:

English :

Both of you are reading.

Sanskrit: युवां पठथः

Categorical Classification of words

Also in Sanskrit, words are classified into similar categories. Following is a representation of the important classifications. We can see the classification is almost same as all the other languages.

| — — — — — — — — Noun Root (शब्द / shabda)

| |

| — — — — — — — — — — — — — — —

| | |

| सुवन्तपद तद्धितपद

| suvantapada taddhitapada

| |

| | | |

Word (पद / pada) — — | Masculine Feminine Neuter

| पुलिङ्ग स्त्रीलिङ्ग नपुङ्सकलिन्ग

| pulinga strIlinga napunsakalinga

| — — — — — — — — Verb Root (धातु / dhAtu)

तिङतपद कृदन्तपद

tintapada krudantapada

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

| | |

परस्मैपदी आत्मनेपदी उभयपदी

parasmaipadI AtmanepadI ubhayapadI

Algorithmic structure of the language :

Each word in Sanskrit can be divided into sub words and this process would continue and the division would stop when we reach a dhatu . This is similar to algorithmic structure. Ambiguity aroused while combining or dividing words could be eliminated by Fuzzy logic or Fuzzy reasoning techniques.

Solving Distributional Similarity as the first use case

Distributional similarity is an idea that is very common in NLP in particular. This idea has emerged out of a large number of statistical techniques used in 60s and is now accepted as the best way to finding out word meaning in a given corpus of English. The idea comes from base of linguistics which says that “words are recognized by the company they keep” by J.R. Firth

The idea is based on a technique in software called Word2Vec. Word2Vec essentially is software that typically is used for NLP in English which means converting each word to its vector representation so that the meaning of the word is clear. There are two techniques which are utilized namely the “Skip Gram Modelling” or “Continuous Bag of Words(CBOW)”. The idea is very simple; Given a piece of large text in a language, Word2Vec finds out the distributional probability of a context word in relation to a centre word.

Let us explain it via an example: Let us assume there is a single paragraph as shown below

“An era of rapidly changing technology, virtualisation, SDN/NFV, advent of 4G LTE technology is redefining businesses across the globe. The digitalisation wave has brought artificial intelligence, natural language processing, analytics, and big data into the foray making it more possible for the machines to emulate humans.”

In the word2vec technique, the software runs through an unsupervised fashion where each word is chosen and a window is chosen around the word(m) and probability of the words around the center word is found. The net resultant is a vector representation called “embedding” that emerge

An era of rapidly changing technology, virtualization]

{m-2} {centre word} {m+2}

If you notice, the window size (m) is what we decide and this window size decides based on a sliding window principle that if “technology” is the centre word, two words before it and two words after it would model the distribution of technology in the corpus.

This idea is very powerful as a magic happens when this is run. Words and their vectors get formatted in a space which has a meaning as shown below. This meaning is provided via the vectors. So a man vector — the king vector + woman vector yields a “queen” automatically

One has to ask why was this technique used ? It is very simple… However on deep diving, in English similar words represent different contexts .let us take an example of glasses.

Let us take some sentences to represent this

“I have glasses. “

The stress on this sentence is on the object which is “glasses”. The most obvious question here is what glasses are we talking about , pair of glasses used as a pair of lenses to correct vision or the glass tumblers …. ? Now in Sanskrit, these two words by their construction are different

Glass : दर्पण { noun masculine } which means A smooth surface, usually made of glass with

Reflective material painted on the underside, that reflects light so as to give an image of

what is in front of it.

Spectacles : उपनेत्र { neuter } A pair of lenses in a frame that are worn in front of the eyes and

are used to correct faulty vision or protect the eyes.

By their construction words are different to represent different meaning. So the ideology that words are island and do not convey any meaning on their own is not really valid for Sanskrit based NLP. Words by themselves represent the meaning clearly. This is useful from a commercial standpoint to a large variety of FAQ based chatbots which are being used today. This is however just the beginning phase of the research

More research down the line

With our initial research at Maker’s Lab Tech Mahindra, the results obtained are positive. With Sanskrit’s algorithmic base, the word2vec layer of converting words to vectors can be eliminated. We plan to extend the research (not part of this paper) to the following areas which the team is actively involved in

Sequence learning by finding an alternative route other than recurrent neural networks in Sanskrit. Sequences in a language are formed based on grammatical rules and from a machine’s perspective, a model of RNN is utilized to train the machine with a large corpus on how the words have been used in the language
Auto-Finding Paninis’ rule: Panini’s (the grammarian who developed the grammar for Sanskrit) gave a set of ~ 4000 rules so that the language is well formed. However, construction of these rules is very difficult and very few in the world know about these rules. Our approach is to use a deep reinforcement learning technique to enable a machine auto-discover those rules. This would enable us to provide intelligence at the core to the machine about formation of languages and tasks like natural language understanding and generation would become trivial for the machine in the environment the machine is placed in

Conclusion

Human beings evolved at a rapid pace because of the way they could communicate with each other and pass on ideas and messages. One of the oldest well-formed language is now being relegated to scriptures. Our intent is to ensure this language becomes the core of understanding machines and also relaying information not just for a wide variety of native population but for the world

About the Author(s)

“We are reminded of the limitless-ness of Human curiosity, when we see man and machine create marvels for the future together” is the quote Nikhil Malhotra lives by

Nikhil Malhotra is the head of Maker’s Lab, a unique Thin-q-bator space within TechMahindra with over 17+ years of experience in a variety of technology domains.

In his present avatar he is the head of Tech Mahindra’s R&D space called the Makers Lab which he created in 2014. The lab focuses on artificial intelligence, robotics and, mixed reality. Nikhil’s area of personal research has been natural language processing, enabling machines to talk the way humans do. Nikhil has also designed an indigenous robot in his lab, as a personal assistant.

He lives by a dream of creating smart machines that would wed human emotions with artificial intelligence to make lives better.

He is also a leading speaker on digital transformation, practical use of AI and the future AI. He holds a Masters degree in computing with specialization in distributed computing from Royal Melbourne Institute of Technology, Melbourne. Nikhil currently resides in Pune with his wife Shalini and sons Angad and Rudra.

Ipsita Nanda is a Solutions architect with TechMahindra with Over 13 + years of experience of working on different technologies and different domains. Ipsita Is currently working on designing solutions in Big Data, Data Science, AI, Search Applications Like Solr and Publish Subscribe Applications on Kafka.Ipsita Is based in California with Her Husband Srihari. She is currenly in India playing the role of a care giver to her Father.

Towards an improved man and machine connect using Sanskrit

Written by Nikhil Malhotra