Harnessing the Power of Retrieval-Augmented Translation for Low-Resource Languages

Christopher Ibe
3 min readJan 27, 2024

--

Introduction: Navigating the Linguistic Diversity of Low-Resource Languages

The linguistic diversity present in low-resource languages offers a unique set of challenges and opportunities for the field of AI translation. Our project, “Retrieval-Augmented Translation” (RAT), seeks to explore this uncharted territory by focusing on languages that, while rich in cultural nuances and idiomatic expressions, often lack extensive digital resources and tools for accurate translation.

The Project: Exploring the Frontier with Retrieval-Augmented Translation

Our journey began with the “Tiny Stories” dataset provided by Microsoft, designed to teach smaller language models language understanding beyond mere memorization of factual data. This dataset, comprising simplistic yet culturally rich narratives, served as an ideal foundation for our work, given its focus on language comprehension and storytelling.

Technical Workflow: Bridging Languages with Advanced Techniques

At the heart of our project lies a sophisticated workflow that combines traditional translation methods with cutting-edge retrieval-augmented techniques. Here’s a closer look at our workflow:

  1. Data Compilation: Leveraging the Tiny Stories dataset, we compiled a rich collection of narratives translated by linguistic experts into various low-resource languages, including Igbo, Yoruba, and Hausa. This bilingual dataset, with English stories and their corresponding translations, forms the backbone of our translation model.
  2. Embedding and Retrieval: We precomputed embeddings for the English narratives using the “text-embedding-3-small” model from OpenAI, chosen for its improved multilingual handling and cost-effectiveness. These embeddings, representing the semantic essence of each story, were stored in our vector database, serving as a frozen retrieval resource.
  3. Contextual Retrieval and Translation: For each translation query, we used cosine similarity to identify the most contextually relevant English narratives from our database. These contextual examples were then presented alongside the query to our base Large Language Model (LLM), powered by GPT-4, enhancing its ability to generate accurate and culturally resonant translations.
Retrieval Based Approach for Augmenting Machine Translation

Frozen vs. Learned Retrieval: Our initial approach utilized a frozen retrieval database, offering the advantage of stability and reliability in retrieving contextually relevant examples. However, future work may explore learned retrieval techniques, such as those used in REALM or Google DeepMind’s RETRO research, which dynamically update the retrieval database based on ongoing learning and interactions. While learned retrievals promise more personalized and contextually nuanced translations, they also pose challenges in terms of complexity and computational demands.

Comparative Analysis: We conducted a thorough comparison of translations generated solely by the base LLM, with the augmented context from our RAT technique, and against Google Translate. Expert indigenous speakers evaluated the translations for each language, and the results overwhelmingly favored our RAT approach, highlighting its superiority in capturing the linguistic and cultural essence of the narratives.

Table 1: Yoruba Translations using GPT-4, RAT, and Google Translate
Table 2: Igbo Translations using GPT-4, RAT, and Google Translate
Table 3: Hausa Translations using GPT-4, RAT, and Google Translate

Future Directions

Building on the promising outcomes of our RAT project, we aim to delve deeper into the realm of retrieval-augmented translation. Exploring learned retrieval mechanisms will be a key focus, aiming to further enhance the contextual relevance and accuracy of translations for low-resource languages. At Hypa AI, we continue to strive to democratize access to intelligence by building for multicultural, multilingual, and multimodal AIs.

Conclusion: Beyond Translation, Towards Cultural Understanding

Our exploration of retrieval-augmented translation transcends mere technical achievement; it’s a step towards bridging cultural and linguistic divides. By enhancing the depth and fidelity of translations, we endeavor to celebrate and share the rich tapestry of human narratives, fostering understanding and connection across the globe.

About Us

Chris Ibe and Okezie Okoye, from Hypa AI, spearhead this journey, dedicated to exploring the depths of machine learning’s potential. Their work at the intersection of technology and humanity reflects a commitment to ethical development and the collective well-being of society.

Hypa AI is committed to pioneering advancements in artificial intelligence, with a vision that emphasizes ethical development and the collective well-being of society. Our commitment extends to democratizing access to intelligence, focusing on creating solutions that are multicultural, multilingual, and multimodal, ensuring that AI technologies benefit all of humanity.

--

--