Better Language Translation Through Machine Learning: Everything I Wish I Knew 6 Months Ago

Published in

The Startup

8 min readDec 15, 2020

San José, CA is one of the most linguistically diverse cities in the United States. 57% of its residents speak a language other than English at home. About 1 out of 5 residents is considered not proficient in English. Spanish and Vietnamese are the most common languages spoken by residents with limited English proficiency.

In efforts to provide inclusive digital services for all its residents — including those with limited English proficiency — the City of San José invested in advanced technology for use in its San José 311 app.

The San José 311 app team used machine learning to build a custom City-trained translation model. It is trained with words and phrases specific to the City context and reflects regional dialects unique to the Spanish- and Vietnamese-speaking communities of San José.

Most importantly, it was designed to continuously improve over time.

The San José 311 app allows residents to report neighborhood concerns like graffiti or potholes online. Better machine translations would generate more accurate static text in the app. It would also ease communication between non-English speaking residents and City staff through near-instantaneous, dynamic translation.

“[Free translation tools] for Vietnamese are awful.” — Vietnamese-speaking San José resident

Past research at the City revealed that free, widely used machine translation tools often output inaccurate translations (for Vietnamese in particular). This is reflected in our final human evaluation where we compare our final City-trained translation model against one popular free translation tool. The City-trained translation model resulted in 22% to 51% improvement over the free tool depending on language and translation direction.

Biggest surprises:

Building and maintaining a machine learning model requires a lot of humans
Certain text is easier to machine translate with accuracy
Historic immigration patterns influence regional language preferences
Don’t rely solely on machines to evaluate your machine translations
You might need a smaller dataset than you think

Methodology for Human Evaluation

We enlisted the help of 20 human evaluators to rate a popular free translation tool against the City-trained translation model. Our evaluators rated four translation directions: Spanish to English, English to Spanish, Vietnamese to English, and English to Vietnamese.

Human evaluators rated sentences for both (1) consistency with the original meaning, and (2) grammatical correctness on a 3 point scale: “Good”, “Acceptable”, and “Needs work”. A final rating required agreement between at least 3 out of 4 of the evaluators.

We selected 100 machine translated sentences for human evaluation depending on realistic scenarios for each translation direction.

For instance, to evaluate English to Vietnamese machine translations, we randomly selected sentences from current English email scripts used by San Jose 311’s Customer Contact Center staff. We then machine-translated them into Vietnamese and bilingual translators rated the Vietnamese machine translations.

Evaluating the opposite translation direction — from, say, Spanish into English — was more complicated. We didn’t have access to San Jose 311 service request descriptions in Spanish but we needed these to test our specific use case. Instead, we translated 100 random English service requests into Spanish by certified professional translators. We then machine-translated the text back into English. (This is a “Double-blind” technique commonly used in academia).

English-speaking Customer Contact Center staff rated the original English service request against the machine translated version.

Lesson #1: Building and maintaining a machine learning model requires a lot of humans

Machine learning is often perceived as mostly hands-off, automated and mysteriously driven by machines. We learned that skilled humans are required almost every step of the way — from collecting high quality data to translating phrases, from evaluating machine translations to continuously retraining the model.

We were surprised that each new training session requires a human to collect translation errors, correct them and then manually retrain the model.

By far, however, the most time-consuming task was collecting high quality sentence pairs to build out our first dataset. Our dataset is comprised of “sentence pairs”, a.k.a. commonly used words, phrases and sentences in English that map to accurate Spanish and Vietnamese translations.

Lesson #2: Certain text is easier to machine translate with accuracy

It might be obvious that simple, plain language written in short sentences is easier to translate for both humans and machines. Beyond that, we identified a few other reasons why certain translation directions tested better than others:

Spelling errors, idioms and slang challenge machines

Service request descriptions submitted by residents include slang, spelling and grammatical errors, idioms, and figures of speech. This type of language will push an algorithm to its limits. In addition, we learned that some Vietnamese characters use accents that require extra effort on the part of a user submitting through their mobile device.

Emails from City staff, on the other hand, might include City jargon but minimal spelling errors and slang. Our City-trained model excels at correctly translating bureaucratic language since jargon often follows rules.

Culture and dialect

Through our work with Vietnamese bilingual City staff, we learned about distinctions between pre-1975 and post-1975 written Vietnamese along with slight differences between north, central and south Vietnamese. There were some common English words that simply had no Vietnamese equivalent like graffiti, fire hydrant and yard sale.

This added complexity to our Vietnamese model which now includes five different phrases for graffiti in Vietnamese. This was less relevant for our Spanish translation model.

Baseline quality of existing translation tool affected our model

Our technology used an existing free translation tool as a base for our City-trained model. Therefore, the quality of the existing machine translation tool for specific languages influenced our models. This explains how we collected about half the amount of data for Spanish than for Vietnamese — 1,178 sentence pairs vs. 2,049 sentence pairs — yet our Spanish model still tested better with humans.

“[The free translation tool for Spanish] is not too bad. It has gotten a lot better over the years.” — Spanish-speaking evaluator
“[Free translation tools] for Vietnamese are awful.” — Vietnamese-speaking San Jose resident

Lesson #3: Historic immigration patterns influence regional language preferences

Mexico is the number one country of origin for immigrants in Santa Clara County. Almost 20% of all immigrants in the County immigrated from Mexico. Our team concluded early on that our Spanish translation model would be oriented towards Mexican Spanish.

There was less clarity on Vietnamese — at first. Ultimately, however, Vietnamese bilingual City staff expressed a near unanimous preference for pre-1975, southern Vietnamese in written communications from the City.

“…when the communists took over the South [in 1975], more than one million Southern Vietnamese people were able to escape to the outside world… bringing with them their Cultural & Language Heritage … the pre-1975 language is the academic language, taught in schools…San Jose City should use the formal and academic language so that it can be taken seriously by the Vietnamese-American readers.” — Vietnamese community leader
“Based on my experience, working with the Vietnamese community in San Jose, they are very picky in using pre-1975 Vietnamese. I think South vocabulary is more popular.” — Vietnamese community leader

Lesson #4: Don’t rely solely on machines to evaluate your machine translations

The BLEU (Bilingual Evaluation Understudy) score is a widely accepted metric used to evaluate machine-translated text. It ranges from 0 to 100. Early on, we were discouraged by the BLEU scores automatically generated by our translation models. A machine translation expert advised us to aim for a minimum BLEU score of 85. Our BLEU score was consistently in the 50s and 60s.

Human evaluations of our translation models told us a different story:

We suspect the BLEU score is more accurate for larger and more diverse datasets. Our recommendations for other teams struggling with limited resources:

Use the BLEU score as a proxy metric but don’t rely on it too much.
Don’t be overly discouraged by a low BLEU score.
Consider evaluating with humans earlier than you planned

Lesson #5: You might need a smaller dataset than you think

Various machine learning experts advised us that we needed a minimum of between 5,000 to 10,000 training pairs to build a quality machine translation model.

We were able to meet our “definition of done” criteria with a smaller dataset than we’d anticipated. We eventually landed on 1,178 sentence pairs for our Spanish model and 2,049 sentence pairs for our Vietnamese model.

We theorized that our particular use cases limit vocabulary, i.e. there are only so many ways a resident could describe a pothole or illegal dumping. Even communications from the City are constrained in their verbiage.

In some cases, adding data was harmful to the translations. We suggest careful use of the glossary function for monosyllabic words in Vietnamese. The glossary is a separate dataset that overrides translations in the primary machine learning model. With Vietnamese, we saw the translations for newspaper (báo) and housing (nhà) appearing in odd places since their Vietnamese translation is one syllable in many other common words.

We‘d love to connect with you

Through our work, we learned that other government agencies are in the process of implementing or are considering applications for this technology. We found it challenging to connect with any of them in a meaningful way.

We believe there is potential for sharing knowledge, lessons learned and possibly even sentence pairs of government-specific language.

If you have experience with deploying machine learning for translation in a government context, we’d love to hear from you!

Many thanks to my current and former colleagues German Sedano, Arti Tangri, Michelle Thong and Nira Datta. None of this would be possible without thoughtful leadership from Kip Harkness, Jerry Driessen, Rob Lloyd and Dolan Beckel and our partners in language access: Sarbjeet Kaur and Stephanie Jayne.

We are also enormously indebted to our bilingual translators and evaluators: Hoang Troung, Vy Nguyen, Cuong Le, Janie Le, Chau Le, Hanh-Giao Nguyen, Oscar Delgado, Ron Echeverri, Annie Gambelino, Xochitl Montes, Cesar Arrellano, Abelardo Pantoja, Desiree Jafferies, Kia O’Hara, Sharon Smith, Jennifer Pettigrew, Denika Jenkins, Kathy Alvarado, Donna Becker, Vivian Do, and Quynh Nguyen.