Improving Pre-Hospital Care For Language Minorities Using Machine Learning
Pre-hospital emergency care should be of the same quality for all citizens. Similarly, benefits from advances in machine learning should be spread equally across society. Unfortunately, neither is the case. Language barriers, for instance, limit the quality of pre-hospital care given to language minorities, and algorithmic bias has already led to harm of specific societal groups. In this blogpost we outline efforts at Corti.ai to make pre-hospital care more accessible for callers with limited English proficiency (LEP) through a machine learning language recognizer. We then explore limitations and biases of the model, and argue more generally that biases of machine learning systems should be mapped through dedicated test datasets before they are used in production.
Background and current solutions
Emergency calls with language barriers occur often. In the United States, around 9% of the population speaks English ‘less than very well’, and this number is expected to grow in the future [1,2]. Studies suggest that language barriers negatively impact pre-hospital care  and that dispatchers consider LEP calls particularly stressful .
Generally, Emergency Medical Services (EMS) dispatch centers provide a third-party telephonic interpreter service known as the language line. However, there is no conclusive evidence that this improves the quality of pre-hospital care given to LEP callers . Dispatchers identify delays and frustrating communication with language lines as reasons for relying on informal strategies instead, such as multilingual bystanders and co-workers .
Language recognition using machine learning
As a first approach in developing better tools to help dispatchers during LEP calls, we study the feasibility of training a simple and scalable language recognition system using deep neural networks [6,7,8]. The purpose of this is twofold. Firstly, a language recognition system deployed at EMS dispatch centers could reduce delays and mistakes associated with the language line by automatically routing to the correct interpreter. Secondly, it is a good stepping-stone for more advanced machine learning solutions such as in-call translation.
To get a qualitative idea of the difficulty of this task for a human, and how it is affected by changing the duration of the audio snippet, try to determine the languages in playlists 1 and 2. We’ll limit the options to a 6-way classification task: ‘French, German, Japanese, Arabic, Spanish, and Mandarin’. The correct answers can be found in table 5 and 6 at the end of the blogpost. After running a contest at the office, we find that, as we might expect, the perceived level of difficulty is significantly higher for classifying 1 second snippets.
In preparation of training the machine learning model, we collect varying amounts of telephone speech for 11 languages (20–120 hours). The technical details on how to train deep learning models are beyond the scope of this blogpost. The main idea is that as we present batches of audio snippets to the model along with the correct language labels, it becomes incrementally better at recognizing patterns that allow it to distinguish between languages. After training, we find an accuracy of 94%, 90%, and 73% on a balanced validation dataset in predicting the language of 5, 3, and 1 second snippets respectively (see figure 1). It is somewhat surprising that the model accuracy for 1 second snippets is only 21% lower than for those lasting 5, since humans perceive the former as a lot more difficult. This seems to indicate that the model finds discriminative features that are different from those used by humans.
We also study the performance of the language recognizer as a function of the training dataset size (figure 2), from which we conclude that for all the major languages spoken by LEP communities in the US, it is feasible to collect the necessary amount of data. In practice, it is unlikely that a language recognition system needs to recognize the language within 5 seconds, so in a final product, longer snippets or a voting scheme can be used to further increase accuracy.
Exploring limitations and biases
Just reporting the validation dataset accuracy does not paint the full picture since it never fully accounts for the variability of real-life data. Let us therefore explore some important aspects of the system: how it reacts to new languages and foreign accents.
When a previously unseen language is presented to the model, it will incorrectly choose one of the classes present in the training dataset. We are interested in whether the chosen class has a high phonetic similarity to the real class, as we expect intuitively. To explore this, we take samples for 2 languages that are not present in the training dataset: Italian, which is phonetically close to its Romance neighbour Spanish, as well as Finnish, which is an Uralic language with low phonetic similarity to any of the classes in the training dataset.
As expected, we find that when given Italian samples, the model predicts mostly Spanish (95%), while for Finnish, the model predictions are more uniform across classes, with Spanish (42%), French Creole (31%), and German (12%) as the dominant ones. Next, we explore something more subtle.
Given that we are eventually building a product to help LEP communities access high-quality pre-hospital care, we want it to work for sub-communities within language groups. We can get an idea of how our model will behave for different dialects by analysing how it performs on accented speech.
For this purpose, we construct out-of-domain (OOD) datasets for several English accents by taking samples from http://accent.gmu.edu/, balancing the number of male and female speakers, and cutting the files into 5 second snippets.
There have been reports suggesting speech-based machine learning models perform worse on female speech. To get an indication of whether this bias exists for our system we also make two small US accented speech datasets, one with male, and one with female voices.
You can listen to some randomly selected speech samples below:
Interesting! Firstly, the accuracy is slightly lower for the US class than expected from the validation accuracy for English samples (92%). This could be due to differences in the acoustic and/or recording conditions between the training dataset and the OOD datasets. Secondly, our model does not seem to perform worse for female speakers. And thirdly, we find that the model has difficulty classifying accented speech, and that performance varies greatly per accent.
If you tried the task of classifying the speech snippets in playlists 1 and 2, you probably made more mistakes in languages you are less familiar with. Machine systems suffer even more from variations in the data presented to them. For instance, for humans the Spanish accent in playlist 3 should not be much more difficult to identify as English than the US one, but the model is often unsuccessful in doing so. The reason is that it merely detects patterns from the dataset it is trained on, including biases that it inherently contains. Our system is trained on English speech with a US dialect, and it has clearly developed a bias to recognize US natives as speaking English.
The most straightforward way to combat such biases would be to increase the variability of the training dataset; in our case by including speech from different dialects. However, data retrieval and data cleaning are time-consuming and costly processes. A well-established and easily applicable method to increase the robustness of speech-recognition systems is speed-perturbation, in which audio snippets are slowed down or sped up which simultaneously changes their pitch. This method increases the effective variability of the training dataset.
You can listen to what a speech sample sounds like when it is sped up or slowed down below:
We test the usefulness of this approach in our setting by randomly slowing down or speeding up each sample anywhere between 0% to 20% during training.
The results suggest that this method increases the robustness of the system significantly for the US accents as well as for the Mandarin and Spanish accents the previous model performed worst on. It is interesting to mention that the accuracy on our validation dataset actually does not improve. This highlights the importance of measuring the performance of models on OOD datasets to map their limitations.
We conclude that a simple language recognition model using deep neural networks achieves an encouraging accuracy when distinguishing between 11 languages. This is the first step in providing emergency dispatchers with technology to help them with language barriers. We find that the language recognition system is sensitive to accents, and that speed-perturbation mitigates this limitation to a degree. We argue that the biases and limitations of machine learning systems need to be mapped through performance measurements on dedicated OOD datasets before they are used in production.
For more information on Corti.ai in general, or this particular blogpost, don’t hesitate to get in touch at firstname.lastname@example.org.
- Census: Languages Spoken at Home and Ability to Speak English for the Population 5 Years and Over: 2009–2013. Retrieved from: https://www.census.gov/data/tables/2013/demo/2009-2013-lang-tables.html.
- Ortman JM. 2011. Language Projections: 2010 to 2020. Presented at the Annual Meetings of the American Sociological Association.
- Tate RC. 2015. The need for more prehospital research on language barriers: a narrative review. Western Journal of Emergency Medicine. 16(7): 1094.
- Meischke H, Chavez D, Bradley S, Rea T, and Eisenberg M. 2010. Emergency communications with limited-English-proficiency populations. Prehospital Emergency Care. 14(2): 265–271.
- Tate RC, Hodkinson PW, Meehan-Coussee K, and Cooperstein N. 2016. Strategies used by prehospital providers to overcome language barriers. Prehospital Emergency Care. 20(3): 404–414.
- Muthusamy YK, Barnard E, and Cole RA. 1994. Reviewing automatic language identification. IEEE Signal Processing Magazine.11(4): 33–41.
- Lee K, Li H, Deng L, et al. 2016. The 2015 NIST language recognition evaluation: the shared view of I2R, Fantastic4 and SingaMS. Interspeech. 3211–3215.
- Gonzalez-Dominguez J, Lopez-Moreno I, Sak H, Gonzalez-Rodriguez J, and Moreno PJ. 2014. Automatic language identification using long short-term memory recurrent neural networks. Fifteenth Annual Conference of the International Speech Communication Association.
- Ko T, Peddinti V, Povey D, and Khudanpur S. 2015. Audio augmentation for speech recognition. Sixteenth Annual Conference of the International Speech Communication Association.
Appendix A — language labels