Speech Recognition Using ACS service for healthcare

Swathi Manithody
NeST Digital
Published in
7 min readAug 8, 2023

Documentation of observations in healthcare domain is very important in rectifying the illness accurately and to treat the patient better. Most of the healthcare organizations have already moved to digital software to better track and maintain their patient data. Having said that, most of such healthcare domain software are moving towards cloud services for better availability. Manual documentation is always error prone as some important ones might get missed during the heat of the moment also typing it into the system is time consuming. Also, in some cases the documentation happens after the fact and can be done several minutes after. This is where the speech to text services will come into picture where we can reduce the manual work and delay in documentation.

There are plenty of speech recognition open-source software available to use like DeepSpeech, Kaldi, Julius etc. Similarly, there are many cloud-based services as well from AWS, Google, and Azure etc. My Recent study on Azure Cognitive speech recognition service is explained below.

Azure Cognitive Speech to text:

This service transforms both real time speech from microphone input as well as batch transcription of audio streams into text. With additional training (reference text input) it also enables real-time pronunciation assessment and gives speakers feedback on the accuracy and fluency of spoken audio.

Prerequisites:

· Valid Microsoft Azure Subscription

· Speech resource created

· Endpoint information

o Speech resource Key

o Region where it is deployed

o Endpoint ID of the model to use

Once the speech service is added we will have a key and region values which should be used in the configuration. Now we are good to start using the default model of speech service. Below screenshot shows how it looks once the speech service is created

There are implementation guideless and code examples for various programming languages like java, python, C++ which can be directly taken and used. Also, it has the capability of receiving the speech input from microphone (real time) as well as reading it from a file (recorded audio). The transformed text will be provided as an output.

Custom Speech Model:

When we go with default model the error rate will be high because of various reasons like accent, pronunciation, medical terms abbreviations etc. Solution to this problem in create our own speech model and train it with plenty of data sets and deploy the same and use it. We can compare the trained model against the baseline and get the WER (word error rate). Below is the image which shows the custom model deployed. With this we get an endpoint which needs to be used in the code so that APIs will access this model.

Model Creation and Test

· Test Data Creation

Test data consists if audio files and Transcript. The first step to do is to have a good test data that will help identify the accuracy of any of our models or the default model provided by Microsoft (Baseline). To achieve this, we need the following material:

· Recording of each sentence in a WAV file (Recorded in Mono)

· WAV File name and text sentences -(1 line for each sentence) in a Text File

· Regroup the whole content in a ZIP file

Once the test data is ready, Import all the data into the Speech Studio.

· Training Data

Training data includes Plain Text and Pronunciation. The second step to do is to have a good training data that will help train a custom model that will help achieve a better accuracy of speech recognition. To achieve that, below materials are needed:

· Text file that includes sample sentences -Very broad set of sentences

· Text file that includes any special pronunciations - This is very useful for clinical terms, abbreviations.

Once the training data is ready, import all the data into the Speech Studio.

· Test the Baseline with our Test Data

Once the test data and custom language training data are imported, next step is to test against the Baseline provided by Microsoft. This will help you identify where are the errors and the error rate is called WER (word error rate). WER Indicates the percentage of errors that was found in your test (Audio + human-labeled transcription) dataset provided to test the model.

When testing with our baselines, besides specific domain complex words or abbreviation, we can review the sample (WAV) used along with the transcription to see if any mistakes were made. We can always update the test files as we want.

· Create a new Model for Speech Recognition and Test the Custom Model with our Test Data

When creating a model, everything relies on the datasets you have uploaded. We need to select the right clinical sentences dataset uploaded and right the pronunciation dataset uploaded.

Now that we have a new Custom model trained, we can use the same Test Data sample to test our model. The goal is to reach under 5% at minimum. We can see how the trained model improved (or not) over the previous model or baseline.

·Model Interpretation

With the test performed on each of our test models and baseline, we can do following exercise -

· Identify where our model accurately transcribed where it failed before.

· Identify where maybe the model misrepresent words that maybe was working in a previous model or baseline.

· Identify the overall WER trending for the overall model and individual items.

Finally, the “Improved” model can be deployed for usage in application.

· Deploy Models

Deploy the final selected model which has improved error rate. With this we get the endpoint id which can be used in the configuration from code.

Pros and Cons

· Pros

· Command statement can be easily understood and built in the application. Especially as the words are simple. -Saves time in performing manual actions.

· Clinical words and abbreviations can be learned quite easily through a custom model using additional domain data and pronunciation.

· Support of a various of languages and even regionalization languages -Localized French, Localized English

· Cons

· Model trained are language specific and reality of word recognition, pronunciations can be very different from one language to the other.

· Need an always open internet connection to the cloud (Azure, etc.) as recognition is done on the cloud.

· Model has the expiry date and necessary actions must be taken on expiry of the trained model if we do not want the default action of falling back to most recent base model

Use Cases

This solution can be integrated for generic use case like –

· Search Patient (Voice enabled search).

· Open Patient Record #

· Add Free text observations.

· Admit/Transfer/Discharge a patient from a bed (Emergencies) through voice command.

· Acknowledge an emergency alarm for a patient with voice command without going to the workstation.

And then extended to specific use cases of the application wherever applicable.

Conclusion

A base model or custom model deployed to an endpoint using Custom Speech is fixed until you decide to update it. The speech recognition accuracy and quality will remain consistent, even when a new base model is released. This allows us to lock in the behavior of a specific model until we decide to use a newer model. This can act either as pro or as con depends on the situation. Model training will remain as a manual activity. Like any other technology or cloud service, Azure cognitive speech service has its own advantages and Limitations. As per the research, it’s a great candidate to choose or think about while looking for such options to be integrated to the application.

Reference

· Speech to text documentation — Tutorials, API Reference — Azure Cognitive Services — Azure Cognitive Services | Microsoft Learn

· How to recognize speech — Speech service — Azure Cognitive Services | Microsoft Learn

· Create a Custom Speech project — Speech service — Azure Cognitive Services | Microsoft Learn

· Model lifecycle of Custom Speech — Speech service — Azure Cognitive Services | Microsoft Learn

--

--