A Holiday Experiment: Developing a Real-Time Digital Human Interface for LLMs

5 min readMay 6, 2024

Introduction

I’ve been fascinated by the rapid evolution of Large Language Models (LLMs), which are redefining our interactions with machines. What really excites me is not just what these models can do, but how we interact with them. That’s why, over a recent vacation, I decided to bring my laptop along and dive into a project that was brewing in my mind: creating a real-time digital human system that could serve as a front-end for LLMs, making our interactions with AI not just smart, but also strikingly human.

ShowCase

Background and Motivation

The idea hit me while pondering over the future of operating systems, which Karpathy mentions in his twitter. Imagine an OS where your interface isn’t just a series of windows or icons but a digital avatar — someone you can talk to, interact with, and who understands you better than any GUI could. This vision drove me to develop a system that could make digital humans not just possible but practical for everyday computing.

Given the short holiday period I had, the challenge was thrilling: to build something groundbreaking in just a few days only myself. This wasn’t just about coding; it was about envisioning a future where human-computer interaction is as natural and intuitive as chatting with a friend.

Technical Implementation

System Overview

I built a system that integrates ASR(Audio Speech Recoginition), LipSync, TTS(Text To Speech), and streaming technologies with LLMs to animate a digital human that can talk and interact in real time. This setup ensures that the avatar can not only understand and respond to verbal cues but also uses lifelike lip movements that make the interaction feel real.

Technology Choices

For speech input, I integrated Speech Recognition for WebKitGTK, which allows the system to accurately recognize user voice commands. This tool is crucial for enabling our digital human to understand and respond to verbal interactions in real-time, further enhancing the naturalness of the communication.

For lip synchronization, I opted for Wav2Lip. Despite the recent rise of NERF-based digital humans, such as ER-NERF, which leverage 3D priors and generally perform better in real-time scenarios compared to Wav2Lip, I found that the 2D-based approach of Wav2Lip offers superior stability in output quality.

For the text-to-speech (TTS) component, I selected Bert-VITS2. This project boasts a vibrant community with many individuals creating personalized voices using the framework. The results are impressive, and it requires only a few minutes of training material to produce a high-quality personalized voice, which made it an ideal choice for this endeavor. BTW, it even supports multi languages!

Audio and video streaming presented significant challenges for me, especially since I had no prior experience in this area. To achieve real-time interaction, I couldn’t simply save the Wav2Lip results as a video and stream it using ffmpeg(exe). Instead, I needed to synchronize the generated audio and video frames directly in memory before streaming. This task was daunting, especially as I primarily use Python, and I nearly gave up. However, one late night, I stumbled upon this project: opencv_ffmpeg_streaming. It was exactly what I needed: a way to package ffmpeg/avcodec into a Python-callable interface. With some modifications, it perfectly met my requirements, enabling me to stream batches of neural network inference results effectively.

For handling the streaming of audio and video frames, I utilized srs (Simple Real-time Server) Docker to manage the flow of data. The Python server(Flask) side handles inference and sends the stream to srs Docker, which then forwards this stream to OBS (Open Broadcaster Software). Ultimately, OBS distributes the stream to various end-user devices. This setup ensures that the streaming process is robust and scalable, capable of reaching a wide audience without sacrificing real-time interaction quality.

Implementation Details

Wav2Lip operates in two main stages. The first stage involves face detection, where the facial region is extracted from the video frame. This part is typically the most time-consuming, but it can be handled through preprocessing. We save the results of face detection for each video frame, which allows us to streamline the real-time processing. The second stage is lip synthesis, where the extracted images are manipulated to match the lip movements with the audio input. To enhance the performance of this stage, I have implemented TensorRT acceleration.

The TTS component runs inference directly on the GPU, which is already quite fast.

To further reduce the latency of the initial frame, I implemented a method where the output from the LLM is segmented at commas and periods. This allows for stream-based generation, which helps in minimizing the delay in speech delivery, ensuring a smoother and more interactive user experience.

Future Outlook

This sprint to build a digital human interface over my holiday break is just the beginning. I’m looking at this as a stepping stone to a larger goal — developing a full-fledged operating system based around LLMs where digital humans are not just assistants but the core of user interaction.

This project, developed over a brief holiday, is not just a standalone achievement — it’s a stepping stone towards a much larger vision. What began as a sprint to create a real-time digital human interface has evolved into the foundation for KoppieOS, my formal endeavor into redefining operating systems through the use of LLMs.

KoppieOS aims to make the interaction between humans and computers as natural and intuitive as human-to-human communication, using avatars as the core interface. This system underscores the feasibility and impact of integrating sophisticated AI technologies in everyday computing environments. By enhancing how we interact with machines, KoppieOS is set to transform our digital experiences into something more personal and engaging.

Let’s push the boundaries of what we think is possible and reimagine our interaction paradigms. Let’s not just use technology; let’s interact with it.