Finding an Approach for building Indic Chatbots

Apurva Mudgal
Conversation.ai
Published in
5 min readMay 5, 2020

India is witnessing a new wave of internet users that are opting to access the internet in their native language. While English is the most common language on the internet, only about 12% of India’s population is familiar with English. The complexity and variety of Indic languages presents a unique challenge when it comes to Natural language processing (NLP). Machine learning technologies have been notorious for needing huge amounts of training data and this problem only gets amplified for Indic languages.

For a Chatbot to work end to end, we need multiple NLP components like sentence and word representations NER/ Named Entity Recognisers, spell correctors to name a few. If you are curious to read more about different components that make a good Chatbot system, do check out our blog on the same. All such components have seen significant progress for English due to the abundance of training data available to enable deep learning and reinforcement learning based algorithms. An Indic bot would also need similar components to work however training data is not very easily available. This training data can definitely be created using manual translators.

We wanted to evaluate using 3rd party translation APIs for building Indic Chatbots and compare that to the approach of building native NLU components for each Indic language. The article summarizes our findings.

Defining the two Approaches

Native NLU Development

In this approach, different NLU components that enable a Chatbot to work end to end are specifically built and trained using training data for that language. Since the training data is not readily available, it is generated manually and costs about ₹2 per word for annotation.

Translation based approach

In this approach the user messages to the Chatbot in Indic languages are translated to English using 3rd party translation APIs and we utilize the NLU components built in English. This approach relies heavily on the accuracy of translation APIs and hence we evaluated the same.

State of Translation APIs

One of the key things to understand before we could compare the two approaches was to understand how well the translation APIs work today. To do this, we took a dataset of 200 free form user queries in English, across 3 different bots. These user queries were then translated by native speakers (without using any help from machine translation) into Spanish, French, Mandarin (simplified), Marathi, Hindi and Malayalam. Subsequently, a translation API was used to translate these user queries back to English. This was then manually evaluated and we got the below results :

Some examples of the errors can be seen below:

This activity clearly shows that translation APIs available off the shelf today are fairly accurate for high resource languages like Hindi and Marathi but not so much for low resource languages like Bahasa, Malayalam.

Comparison and Challenges

  1. Cost: Generating data for native NLU is pretty expensive and time consuming. The cost for such annotations would start around Rs 2 per word and can go upto Rs 7 per word depending on the language. In comparison, best available Translation APIs cost around 3.5 paisa per message (1 message = 25 characters or 4 words) which can further be optimized by caching for frequent utterances.
  2. Latency: There is no additional latency added in the native approach however the translation approach will add latency equivalent to the time taken by the translation API. We saw an average latency of 200ms and max delay of 350ms on observing 1000 API calls.
  3. Accuracy: We found translation APIs to be fairly accurate for high resource languages but not for low resource language. Another advantage that we found in this approach is that the accuracy of translation APIs are fairly independent of the domain where they are used. For example, the accuracy doesn’t change while translating text for a railway bot or a food delivery bot as they are trained on very broad datasets already. Where as, in the native approach new domain specific training data would be required. The advantage that stood out in native development was that it was easier to customize for colloquial aspects of that language. For example — translation “mujhe kal subah 6 baje utha dena” mostly means ‘Wake me up tomorrow at 6’, but would translate into ‘pick me up tomorrow at 6’. Such contextual details can be preserved using the native development.
  4. Time to market and Effort: The native development approach requires a language speaker to generate training data, the translation based approach requires the language speaker to add correctional data to the translations. ML and Engineering could take about 3 months per language to take them to market. It is a linear development and requires an independent maintenance effort of more than 20 components like spell correctors, sentence embeddings, intent classifiers, etc for each language. On the other hand, in the translation approach, there is only a one-time integration effort. The bot would require a language detection API and caching, to avoid inflated translation costs. To better analyze the named entities, a transliteration API could be put in place.
  5. On-going engineering and ML effort: In case of native development, complexity of technical stack and maintenance effort is much higher than using a 3rd party translation system. Not only would the initial effort to set up a native pipeline be high, this effort would continue to remain high as each language pipeline would need its own improvements and maintenance over time. The difference in effort required from data scientists in bench-marking and improving model accuracy for English as opposed to multiple languages should not be underestimated.
  6. Security: There can be data privacy and security concerns while using a third party translation API as all user dialogues are sent to the API provider.Some translation APIs allow local deployment which might address the security concerns but would also mean the costs and maintenance of using this option would go up.
  7. Sales and Strategy: The translation based approach creates a dependency on the 3rd party, but doesn’t hinder market expansion. Research in the field of Machine Translation has witnessed rapid progress and seems very promising. The accuracy of machine translations is highly likely to increase in the future. Although from a long term business point of view, it can be a risky implementation due to the heavy reliance on a third party.

Conclusion

For high resource languages, the translation based bot would give reasonable accuracy and is a lot quicker to deliver and would be cost effective. Additionally they provide a more scalable solution across domains. Translation APIs are going to be a lot better over time as well. Unless you are looking to build for a low resource language today, we see a lot of advantages with using translation APIs vs going for native NLU development. The natively developed bot requires more language expertise and annotation of domain specific data, keeping its costs and effort high. Additionally, it makes it harder to readily utilize the development happening across the world in NLU.

Credits: I would like to thank Aarondeadly and Ashutosh Singh for all the effort to collect, tag, annotate and analyze the data which made this analysis possible. Also a big shout out to the awesome machine learning team at Haptik whose perspectives have enabled us to shape the approach.

--

--