A few insights on humanizing digital interactions using computer graphics…

Published in

Armilar Blog

12 min readDec 11, 2020

… and why Armilar invested in Didimo

By now we should all be familiar with expressions such as “software is eating the world” or “Artificial Intelligence is the new electricity”. More than just famous quotes, they illustrate the dimension of the digital revolution that is happening. From remote working using immersive collaborative tools, to socializing on social media and gaming platforms, we are in the process of “digitizing humans” giving birth to a new era of digital interaction.

But those digital interactions should not be devoid of “real-life” elements. Actually, they tend to be more effective the closer they are to real physical experiences. For example: (i) online fashion shopping is significantly enhanced if there is a virtual try-on process in which a realistic representation of the shopper and the garments is used; (ii) remote social interaction is more impactful if based on digital lifelike looking and speaking human representations.

What this means is that while we continue the process of digitizing humans, we are already feeling the need to “humanize” digital interactions. This humanization factor may be added through multiple dimensions according to the type of digital interaction. In an ultra-simplified model:

Human-to-human digital interactions (like social media, remote work teams …) are improved by the use of realistic representations of humans, objects and scenarios;
Human-to-machine digital interactions (like digital customer care, gaming …) benefit from using both realistic representations and intelligent, emotionally aware, computer agents.

Easier said than done … the introduction of any humanizing dimension in a computer system is a really complex technical challenge, which ends up constraining many market opportunities. Let’s try to understand why this is the case by focusing on one of the most relevant and impactful dimensions of humanization: generating realistic representations of human faces (aka lifelike avatars).

Why is it so difficult to create a lifelike avatar?

Moving from a picture or video of a real human face and building a lifelike animated 3D avatar requires the use of complex Computer Graphics (CG) techniques with 2 current fundamental challenges: (i) achieving realism requires significant manual work and the use of a complex sequence of compute-intensive algorithms; (ii) the execution of those CG algorithms is bounded by finite computational resources and an execution window, forcing trade-offs.

Challenge 1: Manual dependency and the use of complex compute-intensive algorithms

Generating and animating a lifelike looking avatar face requires the execution of a CG pipeline with 3 major steps, as summarized in the chart below.

It all starts with a media equipment (2D, depth or high-resolution camera…) capturing initial data (photo, scan, video) of the face being digitized. With this data, a point cloud (spatial data points associated with the person’s facial anatomy) is created. As with any input to a process, the “quality” of this point cloud is of great importance … and by quality we mean resolution (number of points), detail (information per point) and the lighting conditions. Even though depth cameras (providing richer resolution and detail) are emerging fast, the market today is still dominated by standard resolution 2D cameras, being an immediate constraint to the “quality” of the input.

With the point cloud generated, it’s time to move to the 2nd stage: modelling. It begins by transforming the acquired independent spatial data points into mathematical geometries such as convex polygons whose vertices are a subset of the acquired points. We can think of these polygonal surfaces as our digital skin, having associated texture maps informing about the color (and other parameters) to be displayed in different lighting and viewing scenarios.

Well, but this is where the bigger problems start … our face is of extreme complexity. If there are regions where large polygons can do the job (e.g. cheek surface), there are other regions (e.g. eyes, mouth, wrinkles …) that require much smaller polygons to better capture the specificities of the topology. This is a very complex process to automate and it is highly dependent on the density of the acquired point cloud, meaning that in a low-resolution media device, with a sparser point cloud, there might not be enough data to support the creation of such smaller polygons, thus generating uncanny avatars.

Once this static mesh model is generated (aka wire-frame mesh model), a final process of this 2nd stage is launched: adding animation structures to the model. There are basically 2 options:

Bone-based approach — Sculpting skeleton elements (joints, bones …) in the mesh model to trigger different movements. It goes without saying that the placement and tuning of those motion structures (aka rig) is of extreme complexity, requiring manual trial-and-error;
Blend shapes approach — Creating a set of facial expressions, (e.g. smiling), mostly by manual manipulation of the mesh polygons, which can then be deformed to generate new animations.

These 2 options may be used together to generate faster and more complete animations: on the one hand, blend shapes accelerate the process of simulating the most common facial expressions; on the other hand, the rig adds the morphology granularity to fine-tune or generate non-modelled movements.

With our ready-to-animate model completed, we can now start to create animations, displaying them in multiple devices. This is precisely the 3rd and last stage of this pipeline: rendering. Considering that most common displays are 2D, and we have a 3D ready-to-animate model, this final stage is based on: (i) selecting the conditions and viewing angle to be displayed; (ii) converting the model’s data structures (convex polygons) into a discrete matrix of pixels ready to be displayed. Given the need to sweep the full content of each frame, generating the pixels matrix with minimal loss of detail, this rendering process is very compute-intensive, which is why several rendering engines still struggle to run on general propose HW architectures, like the ones we find on standard mobile devices.

Moreover, in addition to the intrinsic complexity and manual dependency of these algorithms, the execution of this CG pipeline also requires the use of several commercial software packages, each specialized in a particular modelling or rendering function, which makes the handling of this process more complex and prone to integration problems.

Challenge 2: The graphical fidelity triangle trade-off

At this moment we are aware that achieving a lifelike avatar requires executing a very complex set of algorithms. But why is this complexity a real issue? That is because we have 2 finite execution resources: one is the available computational power (the HW) and the other is the available time to complete the avatar creation. These finite resources may be exhausted if the computation is highly complex. Let’s say you’re playing a game on your mobile device. If you would like to have lifelike avatars playing, the game would become so compute-intensive (so many instructions and data to be processed) on a millisecond window (time to render a new frame after an action by the gamer), that the chipset in your mobile device might be unable to deliver the result (new frame) on time.

For a given HW capacity, the way to handle this is to sacrifice some CG variables in order to make the pipeline less compute-intensive. There are 3 main CG variables affecting the quality/complexity (let’s call it fidelity as a proxy to quality) of the output: resolution (number of pixels), detail (information per pixel) and number of frames per second (FPS). The problem is that when we sacrifice one variable, we immediately lose quality on the image or video being displayed. And this is exactly the trade-off that CG engineers need to deal with: find a feasible combination of the 3 variables (and not maximize each individually) that ensures the timely delivery of results for a given execution architecture.

The Graphical Fidelity Triangle presented below illustrates this trade-off: (i) the size of the triangle is defined by the available resources (computation and time); (ii) only combinations of the 3 variables within the triangle are available. Getting back to our mobile gaming example, it might not be feasible to maximize together the detail and resolution of each frame, without sacrificing FPS. Choosing a more responsive gaming feeling, on the other hand, would result in a lower frame quality (potential pixelization).

Even though, in the last 15 to 20 years, we have witnessed the introduction of dedicated Graphical Processing Units (GPUs), which increased the available computational power in multiple devices (particularly in mobile units), the fact is that the requirements for detail and resolution have also increased, keeping this trade-off always very demanding.

In summary, generating a realistic avatar is currently very challenging since:

It relies on a complex pipeline of algorithms, with several components requiring manual artistic work, which results in a time consuming and non-scalable process;
Those algorithms are very compute-intense when dealing with the generation of lifelike avatars, forcing image and motion quality trade-offs;
The CG pipeline is highly dependent on the quality of the input captured data;
There is no end-to-end commercial platform incorporating all the process steps.

How is the market being impacted by these technical challenges?

The inexistence of effective and fast solutions to generate lifelike avatars is hindering the development of new applications based on digital interactions. Let’s look at a few examples:

In the gaming industry, gamers want more realistic experiences, particularly by having their lifelike avatar as the central character. To achieve this realism today, the gamer would most likely have to go to a studio, take several light-controlled pictures, and then wait (maybe a few days) for the execution of the manual-based avatar creation process. The same would happen if we wanted the gaming scenario to be populated by other lifelike characters. This is highly ineffective, constraining a market valued at +$150b (according to Mordor Intelligence);
In the industrial manufacturing market, companies want to expand their shop floor workforce training processes by using virtual training platforms. Today, these CG-based industrial immersive platforms still lack the required lifelike resolution and live motion to be considered viable HR training options, constraining the virtual training and simulation market valued at +$200b (according to ResearchAndMarket).

If the 2 previous examples require the rendering process to be done “online”, meaning that the system has a few milliseconds to generate the next frame considering an action taken by the user on the previous frame, the fact is that there are also “offline” market applications impacted by these current technical CG challenges. The visual effects (VFX) industry is one of those cases. We might think that by having the possibility of rendering each frame offline (without a real-time execution window) and by using massive parallel-computing HW architectures, the VFX industry would not be heavily impacted by the current challenges in the generation of lifelike avatars. But that’s not the case … the problem here is that for the fidelity required in recent video formats, the existing computing algorithms become so demanding that rendering a frame may take several hours to complete, which has a huge economic impact on movie production costs. Just as an example, just check how much time Pixar took to render a single frame in Toy Story 4 … 60 to 160h hours!

How is the industry responding to such a market need?

Let’s take a bird’s eye view on the avatar generation space according to 2 critical dimensions: (i) the fidelity of the avatar generated and; (ii) the throughput of the underlying CG system. These 2 dimensions split the market landscape in 3 blocks as illustrated in the chart below:

“Emoji Arena” — these are solutions that provide a “cartoonish” avatar. In essence, some CG computations and input variables (resolution, detail) are simplified, enabling results in real time. Apple’s Memoji application is an example of this class of solutions;

“VFX Arena” — solutions to deliver hyper realistic avatars, at the cost of using top-notch acquisition equipment and several days / weeks of artistic work. VFX houses working for the film industry play in this arena;

“New tech” — emerging solutions (mostly still under development) aiming to provide the best of the 2 worlds: realism, fast. In order to overcome the mentioned technical challenges, 2 approaches are being explored:

Use non-geometric processes to provide scalability while removing some processing complexity. Engineers do this by creating large datasets of input (face captured) and output (lifelike avatar) data, training machine learning models to infer the components of the modelling step. But for this promising approach to be effective it would require huge amounts of high quality data (people from different regions, different capturing lighting conditions …) which would only be accessible to a few big companies;
The other approach is geometry-based and is focused on introducing AI to optimize both manual processes and geometric brute-force algorithms (used to process images without any prior knowledge). By optimizing several geometry-based algorithms, the computational complexity is decreased while the system throughput increases.

Meet Didimo … and its revolutionary technology to unlock lifelike digital interactions

Complex technical challenges … constrained business opportunities … ineffective commercial solutions; where some see problems, others see opportunities. Let me introduce Didimo, a spin-off of the University of Porto, leveraging +15 years of R&D on computer graphics led by the company founder and CEO, prof. Verónica Orvalho. The company has a patented geometry-based CG technology for the fast generation of lifelike digital representations of humans and objects.

Didimo’s technology is based on a very simple approach of “Create Once, Use Many”. The idea is to remove all the unscalable manual work and most “brute-force” compute-intense algorithms used each time a model is created. To embody that idea, Didimo developed: (i) manually optimized databases (DB) containing facial mesh models and general purpose motion structures; (ii) a set of AI-powered and geometry-based algorithms to reuse the DB data in each new avatar being created.

Let me briefly explain why Didimo solves the core technical challenges introduced above, with the help of the illustrative chart below:

Didimo has a fully automated and scalable CG pipeline. For each avatar being generated it reuses already modelled and optimized data, avoiding manual modelling steps;
By using pre-optimized mesh models and general purpose motion structures, the complexity of the modelling stage is significantly reduced, enabling the execution of Didimo’s CG pipeline in lighter and more standard HW architectures;
Didimo is more immune to the quality of the captured data, since it has the potential to interpolate some parts of the facial topology using models in the DB;
Didimo developed its own resources-light rendering engine, applicable in multiple HW architectures and in different existing CG platforms, further expanding the possibilities of animating and displaying lifelike avatars.

Leveraging on this unique technology, Didimo brings a powerful value proposition to the market: (i) lifelike avatars can be generated even from standard resolution single images, in just a few minutes (compared to days, if not weeks); an end-to-end platform is provided to execute the full 3D pipeline, or just the missing components on a client’s pre-existing CG process.

Didimo has the potential to unlock new business opportunities, taking a giant leap forward for humanizing digital interactions, with a few examples being:

Remote collaboration and socialization are going to be further potentiated by the new 5G infrastructure. Didimo provides those applications with “on-the-fly” capabilities to generate lifelike avatars, which can redefine the way people communicate and work remotely;
The gaming industry will be able to further empower the gamer, by providing him with tools to make his lifelike avatar the central character in the game, without having to leave his room;
Ecommerce platforms will be able to revamp their virtual try-on processes, providing the shopper with lifelike interactions of how a product would fit and look;
Industrial companies can better train its workforce by using lifelike shop floor scenarios and realistic representations of the interactions between workers and the machinery.

In this scenario of revolutionizing and humanizing ever-growing digital interactions, it’s with great excitement that Armilar joins Didimo’s inspiring journey!

Article written by João Dias, Principal at Armilar Venture Partners