Mastering Inworld.ai on the Web: Technical Tips for Seamless Integration

Tricia Becker
5 min readAug 18, 2023

--

Inworld.ai has emerged as a powerful platform for creating interactive and dynamic virtual characters, enabling developers to craft engaging conversations and experiences. While working with Inworld on the web brings numerous advantages — particularly the ability to bring your characters to any hardware or platform — it also presents unique technical challenges that require a more nuanced approach than its game engine counterparts. This article dives into some of the top technical tips I learned while integrating Inworld into a web application, focusing on effective audio processing, file format considerations, speech pacing, optimal connection configuration, and handling responses based on your conversation goals.

1. Adding Audio Chunking and a Silence Stream for Optimal Audio Processing

When dealing with audio data, it’s essential to optimize it for processing and transcoding. Two proven techniques that came in handy in my project were chunking and adding a silence stream at the end of audio files.

Chunking involves breaking down long audio files into smaller, manageable chunks, a common practice when working with APIs handling speech streams. When employing chunking, remember to specify the highWaterMark parameter when calling createReadSream, and send the chunks sequentially using a timeout. By doing these two things, you ensure a smooth ingestion process for the API.

Adding a silence stream at the end of your audio file serves as a signal to Inworld that the audio stream has concluded. This helps to ensure accurate response generation timing by helping their system clearly identify the end of input. By incorporating both chunking and a silence stream, you improve the ingestion of your audio data and ensure better quality and more timely responses.

2. Ensuring Correct Audio Format

On the topic of audio, Inworld expects audio files to adhere to specific format standards for successful processing. Ensure that your audio files have a sample rate of 16000 Hz and are in the LINEAR16 format. If your file doesn’t meet these specifications, it will lead to processing errors or invalid responses.

You can employ tools like SoX (I used the npm package node-mic-record which leverages SoX) or ffmpeg for format conversion, guaranteeing that your audio files can be ingested properly.

From ‘node-mic-record’, Index.js

3. Customizing the Connection Configuration

The connection configuration should be tailored to suit your specific interaction and conversation needs. The disconnectTimeout value dictates how long the connection remains active in the absence of activity. Its default value is 60,000 milliseconds, but depending on your conversation needs, you might want to increase this number. For instance, to give myself ample time to respond after an interaction, I set my disconnectTimeout to a custom value (1000 * 1000000) to provide me more flexibility in my response timing.

4. Crafting Pacing with Pauses

Depending on the character’s persona and the conversational flow, you might need to introduce pauses in the speech. While pauses weren’t essential for my character, incorporating them can enhance the naturalness of interactions. By modifying the packet.text.text responses with dashes (-) or multiple dashes (- -), you can create pauses that add natural rhythm to the dialogue.

This technique allows you to control the pacing of the conversation, creating more dynamic and lifelike exchanges.

5. Handling Core Conversation Functionality

The heart of your interaction lies in the setOnMessage function; this is where your core conversation functionality is implemented. By observing packet.isText and packet.text.final conditions, you can orchestrate actions based on each returned text response.

An important callout here is packet.isInteractionEnd. The function is called once all of the response text has been received from Inworld; since you’re not working with a built in voice integration, this timing will typically not align with the end of your audio conversion processes. It’s important to factor this information into how you organize your actions to prevent any interaction mishaps.

6. Response Management: Aggregation vs. Queuing

Managing and delivering responses is a critical component of conversational systems. You’ll likely need to decide between aggregating responses or implementing a queuing system, particularly if you decide to use voice responses. Aggregating responses simplifies the process (and is what I did for the sake of time), allowing you to consolidate text before converting it to audio, which also ensures you generate a cohesive and emotionally consistent verbal reply (one batch of text is easier to maintain context and tone than several). However, this is a heavy text-to-speech conversion operation and can cause an unnatural delay in your character’s responses. The ideal solution for audio responses is a queuing system; with queuing you get quicker audio conversion and can strategically time playback of responses to prevent overlapping audio files.

Conclusion

In the early days of new technology, it can be easy to get discouraged by minor technical stumbling blocks. Hopefully these tips allow you to bypass common challenges, so that you can move fast and b̶r̶e̶a̶k̶ build things.

Read also “Bringing Characters to Life: Combining the Power of Inworld AI and ElevenLabs” to learn more.

--

--