Empowering Human Interaction through Speech-to-Text Transformation

Published in

BI3 Technologies

8 min readNov 1, 2023

INTRODUCTION:

In a fast-paced era, we present a game-changing innovation. Imagine speaking into a mic, and instantly, the words become text — a bridge between human expression and digital data. No more typing; just speak and command applications, search the web, and more. The following application leverages ‘NODE JS’ for robust application development, ‘Azure Cognitive Service’ for cutting-edge speech recognition, and a portable ‘Docker Image’ for seamless deployment.

PREREQUISITES:

Before understanding the code, it’s important to be familiar with the following concepts and technologies:

Docker:

1.A tool, designed to make it easier to create, deploy, and run applications by using containers, in which a lightweight, standalone, and executable software package that includes everything that is needed to run a piece of software, including the code, runtime, libraries, dependencies, and settings.

2.Node.js and React are encapsulated within a Docker container, providing a portable and consistent development and deployment environment.

Please find the below steps for the installation of Docker.

Install Docker Desktop for Windows:

· Install Docker Desktop for Windows from the official Docker website (https://www.docker.com/products/docker-desktop).

· Follow the installation wizard’s prompts.

· After installation, the Docker Desktop will run in the background, and you can manage Docker from the system tray.

Install Docker Desktop for macOS:

1. Install Docker Desktop for Mac from the official Docker website (https://www.docker.com/products/docker-desktop).

2. Follow the installation wizard’s prompts.

3. After installation, Docker Desktop will run in the background, and you can manage Docker from the macOS menu bar.

Start the Docker container to initiate the process.

Node.js: Fueling the Backend

1. Node.js powers the server side of the transcription application, serving as the development server.

2. It starts the server, handles HTTP requests, and serves the React application to web browsers.

3. Node.js utilizes npm for efficient dependency management, simplifying the installation of required packages.

4. Environment variables in Node.js securely store sensitive information, such as API keys and service locations.

5. The integrated terminal in Visual Studio Code streamlines development tasks, enhancing productivity.

React: Crafting Dynamic Interfaces

1. React is the framework driving the client-side user interface of the transcription application.

2. It enables the creation of dynamic and interactive UI components.

3. React follows a component-based architecture for modularity and reusability.

4. Hooks like useState and useEffect manage application state and component lifecycles.

5. Conditional rendering in React adapts the UI based on the application state, ensuring a seamless user experience.

Azure Cognitive Services: Azure Cognitive Services is a set of cloud-based APIs and services provided by Microsoft for various AI and machine learning tasks. The "API key" and "Location" are important components used for authentication and resource access.

API Key : An API key is a unique alphanumeric code that serves as a secure authentication token that acts as a secret key that validates the identity of the user or application accessing Azure Cognitive Services.
Location : The “location” refers to the Azure region where the Cognitive Service resource is hosted. Cognitive Services are available in various Azure regions worldwide, and the location determines where the service’s data and processing are physically hosted.

IDE Setup: Visual Studio Code(VS Code), is a popular code editor developed by Microsoft. It is highly extensible and customizable, making it a preferred choice for many developers.

Application Launch: To launch the provided React application in Visual Studio Code, open the project folder in VS Code and start the development server with "yarn start." Access the application in web browser at the provided address.

Please find the below steps to launch the application.

STEPS TO LAUNCH THE APPLICATION:

Step 1: Clone or Download the Code

· Clone the code repository given below using the platform GitHub or download the code as a ZIP file and extract it to a directory of your choice.

GitHub - krish4120/Real-Time-Transcription-Voice-Recognition-Node

Contribute to krish4120/Real-Time-Transcription-Voice-Recognition-Node development by creating an account on GitHub.

github.com

Step 2: Open the Project in Visual Studio Code:

· Open Visual Studio Code.

· Click on "File" in the top menu and select "Open Folder."

· Navigate to the folder where you cloned or extracted the code and select it. Click "Open."

· When you open the vs code, it prompts the dev container to reopen within the Docker environment as shown above.

Note: Please navigate to the specific directory where all the project code files are located.

Step 3: Configure Environment Variables:

· The code uses environment variables for the API key and location. You need to set these environment variables.

Create a file named “.env” in the project root directory and add the credentials needed to it using “env_template”.

Replace “your_api_key_here” and “your_api_location_here” with your actual Microsoft Cognitive Services API key and location.

Step 4: Start the Development Server:

· In the integrated terminal, while still in the project’s root directory, run the following command to start the development server.

Execute the "yarn start" command in the command prompt to launch the Node.js application within the Docker image.

"yarn start" : It is a command which is used in web development, particularly for projects built with JavaScript, React, or similar frameworks and it is associated with the package managers like Yarn or npm (Node Package Manager), and starts a development server or run a script defined in the project's configuration.

Step 5: Use the Application:

· You should now see the application running in your web browser.

· You can interact with it by starting and stopping speech recognition, viewing recognized text, and exporting the transcription to a text file.

Step 6: Stop the Development Server:

· To stop the development server, go to the terminal and press any key to kill the terminal.

That's it! You've successfully launched and run the provided React application in Visual Studio Code. You can now explore and use the application for real-time speech transcription.

Application Interface: Access the application through a designated port in a web browser, revealing a user interface with key buttons: 'Start', 'Export', and 'Stop.'

1. 'Start' Button:

When clicked, this button initiates the speech recognition process, allowing the application to listen to the user's speech and convert it into text in real-time.

2. 'Export' Button:

Clicking this button triggers the export functionality, allowing the user to save the transcribed text into a text file, providing a way to store or share the transcriptions.

3. 'Stop' Button:

This button halts the speech recognition process, stopping the application from transcribing any further speech. It provides control over when to end the recognition.

This innovative approach empowers users to effortlessly convert speech into written text, streamlining communication and interaction with digital systems. The fusion of 'NODE JS', 'Azure Cognitive Service', and 'Docker Image' unlocks a future where the power of voice defines efficiency and enhances digital interactions.

Now lets analyse the code provided in GitHub.

Exploring Code Implementation:

The code is a React component that implements real-time speech recognition using Microsoft Azure's Cognitive Services Speech SDK. It captures audio input from the user's microphone, processes it using the speech recognition API, and displays the recognized text in a text area.

Let's break down the code step by step:

Imports: The code starts by importing various modules and components from external packages. These include React, Bootstrap components, and the Microsoft Cognitive Services Speech SDK.

2. API Key and Location: The Azure API key and location are fetched from environment variables (‘REACT_APP_COG_SERVICE_KEY’ and ‘REACT_APP_COG_SERVICE_LOCATION’).

3. Initializing Variables:

· The ‘sdk’ module is imported using ‘require’ and the Azure speech configuration is set up.

· A global variable ‘recognizer’ is defined to hold the SpeechRecognizer instance.

4. State Management: The component uses React’s ‘useState’ hook to manage state variables like ‘recognisedText’, ‘recognisingText’, and ‘isRecognising’.

5. Microphone Access and Recognizer Setup:

· In the ‘useEffect’ hook, audio constraints for ‘getUserMedia’ are defined, and a media stream is acquired.

· The ‘createRecognizer’ function sets up the SpeechRecognizer instance using the acquired audio stream and speech configuration. Event handlers for recognizing and recognized results are defined here.

6. Start and Stop Recognizer:

· The ‘toggleListener’ function starts or stops the speech recognizer based on the current state.

· ‘await async’ function is used to pause the execution of the async function fetchData until the fetch promise is resolved. If there’s an error during the fetching process, it will be caught and handled in the catch block.

7. Exporting Transcription: The ‘export2txt’ function converts recognized text into a downloadable text file.

8. Rendering UI:

· The UI is rendered using Bootstrap components. An image of the Microsoft logo is shown, which rotates when speech recognition is active.

· The recognized and recognizing text is displayed in a text area.

· Buttons are provided to start/stop recognition and export the recognized text.

The component essentially creates an interface where users can start and stop real-time speech recognition using their microphone, and the recognized text is displayed in the UI. Note that this code is specific to utilizing the Microsoft Cognitive Services Speech SDK and Azure services for speech recognition.

OUTPUT:

CONCLUSION:

The speech-to-text recognition using Azure Cognitive Services offers a powerful and versatile solution for converting spoken language into written text. It provides various benefits by managing the time, ease of integration, accuracy and performance, scalability and accessibility.

About Us

Bi3 has been recognized for being one of the fastest-growing companies in Australia. Our team has delivered substantial and complex projects for some of the largest organizations around the globe, and we’re quickly building a brand that is well known for superior delivery.

Website : https://bi3technologies.com/

Follow us on,
LinkedIn : https://www.linkedin.com/company/bi3technologies
Instagram : https://www.instagram.com/bi3technologies/
Twitter : https://twitter.com/Bi3Technologies

Empowering Human Interaction through Speech-to-Text Transformation

GitHub - krish4120/Real-Time-Transcription-Voice-Recognition-Node

Contribute to krish4120/Real-Time-Transcription-Voice-Recognition-Node development by creating an account on GitHub.

About Us

Written by Darsini Ramu