Building a ChatGPT-Powered and Voice-Enabled Assistant using React and Express

Gustavo Cordido
Microsoft Azure
Published in
4 min readAug 15, 2023

With Large Language Models becoming more and more popular in today’s world, interest in their usage for development is also increasing, though it is not always easy to figure out where to start.

In this post, we’ll walk through how to build a simple chatbot powered by the Chat GPT language model (in this case gpt-35-turbo), as a way to understand how these models can be implemented in a real-life application. The chatbot we will develop will consist of simple web application, written mainly in TypeScript. We will be using the Azure OpenAI Service to access the model, Azure AI Speech to enable speech-to-text and text-to-speech features, an Express server to handle the API requests and communication between the chatbot and the Azure services, and finally a React front-end.

Quick demo of the application.

GPT stands for “Generative Pre-trained Transformer”. It’s a machine learning model that is trained on a large corpus of text data to predict the likelihood of a word or phrase appearing in a given context. GPT models can be used for a variety of natural language processing tasks, including language translation, text summarization, and chatbot development. Instead of diving deeper into how these types of models work, I’d rather direct you to a great, very in-depth article written by Beatriz Stollnitz here.

The application can be found in this GitHub Repository: github.com/gcordido/VoiceEnabledGPT, which contains instructions on how to install and run it, tips on Prompt Engineering and how to best prepare the model for accurate results.

To develop this application, we first have to consider its ‘architecture’, and to address this we have to answer the following questions: How does the chatbot interface look like? How does the user provide input to the bot? How does the application communicate with the model? What will the input look like? How do we get the information from the model to the user?

Application Architecture Diagram

First, let’s sketch how the chat window would look. In this context, I opted for a typical chat interface where the window displays a list of messages, with the user’s messages on the right and the chatbot’s responses on the left. To achieve this, we utilize a React component to render the chat interface, including elements such as an input box, a speech-to-text button, and an expandable chat body.

The communication between the application and the model is facilitated using APIs. The Azure OpenAI Service provides access to these models by making calls through a REST API, which take a set of parameters and are accessed by using secrets in the form of an endpoint and an API key. Since the latter are, well, secrets, we need to ensure these are not exposed when we carry out the API call. Thus, we create an Express server to handle these API requests.

The server will receive a call from the main chatbot interface once the user presses the “Send” button and receives the list of previous messages in order to feed these to the model when making the API call. Note that I said the list of previous messages, instead of just the latest message. This is intentional, as one of the most interesting features of GPT models is their ability to retain context throughout a conversation as long as it is provided. This list is coded as a list of objects, each possessing two properties: ‘role’ and ‘content’. The ‘role’ property differentiates messages from the user and the assistant, while the ‘content’ property simply contains the message text.

With the communication process between the application and the GPT model now defined, the final step is to add audio support. For this, we use Azure AI Speech, which provides an easy-to-use SDK for JavaScript. The SDK provides methods to detect live microphone audio, recognize speech and synthesize speech from text.

To access the Speech SDK and its methods from the client, we require an Authorization Token. This token is generated by authenticating our application with the Azure AI Speech service, using its API key and region on the server side. Once we have the token, we can use the SpeechRecognizer method from the SDK to detect and transcribe audio from the microphone. This transcription is then added to our list of messages as user input and sent to the server in the same manner as written input.

Upon generating a response, the model’s output is sent back to the chat interface by the server. A new entry is added to the list of messages under the assistant’s role and the chat window is updated with the chatbot’s message reflecting the new entry. This new entry can then be passed through the Speech SDK’s SpeechSynthesizer method, enabling the application to read the chatbot’s response aloud through the speakers.

Although the application does not go much further, there are still many ways to tinker around and modify the ways we communicate with the model and how it responds back to us.

As a firm believer in learning through practice, I highly suggest trying out the application itself by cloning the repository and experimenting with the application parameters. For example, try changing the System Prompt (initial instructions) to make the model respond in prose or haikus, change the voice synthesizer language to Spanish, or limit the number of messages kept for context and see how the model responds!

--

--

Gustavo Cordido
Microsoft Azure

Cloud Advocate in Artificial Intelligence @ Microsoft. Venezuelan