A Virtual YouTuber System Without Deep Learning

Create your VTuber system with just CPU and webcam!

Published in

Geek Culture

4 min readMay 16, 2021

In this article, I am going to present a Virtual YouTuber (VTuber) system just using an ordinary CPU and webcam written in C++. I will also talk about the phenomenon of VTuber and some basics of gaze tracking. At last, I will leave some personal thoughts after working on this system.

Introduction of VTuber

VTuber means a person streaming or uploading to YouTube with an anime avatar following his movement in real-time. [¹] Though VTuber originated in mid-2010 in Japan, it started to boom from the beginning of 2020 with the help of COVID-19 and the push of some commercial companies such as hololive [²] and nijisanji. [³]

Motivation

One day, my friend sent me a project in which he created a VTuber system written in Python. After glancing at his code, I came up with an idea: what not implement this VTuber system in C++ for performance boosting as well as showing off my ability of C++? Furthermore, I want to strengthen my image processing knowledge by implementing this VTuber project.

Also, many of the VTuber systems use deep learning techniques or specialized hardware for pretty precise movement catch [⁴], but not everyone can afford the devices such as specialized hardware or GPU. What’s more, it requires more set up procedures or environment which may not be feasible in some scenes. In comparison, this VTuber system only needs a normal CPU as well as a webcam for the ease of setting up and cost-efficiency.

An example of an open-source VTuber system using deep learning. Adopted from [⁵].

Architecture

The architecture of this system is described as follows:

Server

The server is responsible for detecting the user’s face and its landmarks. After detecting the landmarks, it will calculate the movement of the eyes and mouth. Next, the movement data will be transmitted to the client via WebSocket.

Client

The client is responsible for displaying the VTuber as well as its movements. For the real-time displaying purpose, the client uses WebSocket for receiving the movement data. After the client receives the data sent from the server, it will display the VTyber and its movements in the browser.

Processing Procedures of Gaze Tracking and Mouth Movement

I guess you would like to realize how the server detects the face and calculates the movement under the hood. So in this paragraph, I am going to tell the details from webcam to face movement stream data.

Capture the video stream from a webcam to picture by picture.
Resize the input picture.
It is common knowledge that resizing the image can boost the image processing speed as there are fewer pixels. After testing, resizing the input image to half of width and height can have 2x speed up while not hugely affect the face detection and gaze tracking processes.
Grayscale the input picture.
Again, you should grayscale your image if you do not need the color channels for further processing. What’s more, the Dlib face detector runs faster in grayscale compared to RGB.
Run face landmark detection via Dlib.
Detect the regions of the eyes.
In this procedure, we are going to detect the eyes. To speed up the processing and for a more accurate result. I crop the images only containing the eyes.
Retrieve pupil of each eye region.
As indicated in [⁶], using a threshold between 5 to 100 then choose the last second one after sorting by pupil contour area can suit most cases. Therefore, I adapt this Python package then re-written it in C++ to suit my project.
Calculate the face movement (gaze and mouth).
After getting the face landmark and the pupil location, it’s time to calculate the face movement. The calculation is adapted from my friend’s VTuber project in [⁷].
Stream the data through WebSocket to the client.

System Set up

In this part, I will summarize the setup of this vface-server-cpp project.

Server

Download the vface-server-cpp from here: https://github.com/Cuda-Chen/vface-server-cpp
Install the dependencies, namely:
* OpenCV
* websocketd
Compile the project by typing:
mkdir build && cd build && cmake .. && make
Execute the program by typing:
$ websocketd --port=5566 ./vface_server_cpp
Run the client (namely, vface-web).

Client

For client set up, as the repo is maintained by my friend, you can visit the repo for setting up: https://github.com/c910335/vface-web

Result

Thanks to the common ways such as resizing, threshold, and region of interest (ROI) used in image processing, my VTuber system can detect and calculate the face key points in about 10 ms. After the calculation, the data will be transmitted to the client then draw the animate character you choose.

The following image uses my face to create this adorable character:

As such, you can notice that the character cannot close her eyes entirely. The reason is that I sit too far away from the webcam and the eye closing movement needs further adjustment for each individual.

Summary

In this post, I show you guys my VTuber system and introduce the background of VTuber. I also list the processing procedures from framing your face, detect and calculating the face key points. At last, I demonstrate the result and leave some marks for further improvements.

References

[¹] https://www.urbandictionary.com/define.php?term=VTuber

[²] https://en.hololive.tv/

[³] https://www.nijisanji.jp/

[⁴] https://gist.github.com/emilianavt/cbf4d6de6f7fb01a42d4cce922795794

[⁵] https://github.com/DeepVTuber/DeepVTB#head-pose-estimation

[⁶] https://github.com/antoinelame/GazeTracking

[⁷] https://github.com/c910335/vface-server/blob/master/calculator.py

Originally published at https://cuda-chen.github.io on May 16, 2021.

If you have any thoughts and questions to share, please contact me at clh960524[at]gmail.com. Also, you can check my GitHub repositories for other works. If you are, like me passionate about machine learning, image processing and parallel computing, feel free to add me on LinkedIn.

A Virtual YouTuber System Without Deep Learning

Create your VTuber system with just CPU and webcam!

Introduction of VTuber

Motivation

Architecture

Server

Client

Processing Procedures of Gaze Tracking and Mouth Movement

System Set up

Server

Client

Result

Summary

References

Written by Lu-Hsuan Chen