Written in collaboration with Andre Sato
The Vertex Animation Texture (VAT) technique, also known as Morphing Animation, consists of baking animations into a texture map for use later in the vertex shader. This technique enables us to keep the mesh completely static on the CPU side (changing the SkinnedMeshRenderer component to MeshRenderer component). And here at Wildlife, we are using this technique in our games.
If you are a Tennis Clash player, you probably have seen the crowds moving when you score. Yep, this is the technique being used. It’s nice, isn’t it? Let’s check how the technique works.
We ran a test using a low-end device (Samsung S6), the scene has more than 2000 instances. Each one has 800 vertices, you can see there is no frame drop even with many other systems running at the same time:
Detailing the technique
By now, you should be wondering how this technique works. We will tell you. We run the animation, and for each frame, we read the local-space vertex position (x, y, z). We store this information in a matrix. This array will be (Num_Verts x Num_Frames) in size where each cell is a Vector3. Because normals are different at each frame of the animation we must also store them in another matrix (same as the vertices).
The information of these arrays will be accessed in the vertex shader. The most common way to read matrix in the vertex shader is by using textures. Although textures are most commonly used to store color information (albedo, roughness map, specular map, tint map, etc.) textures can also be used to store any kind of information (e.g. normal map).
To store the vertical and normal position information we chose to use the RGBAHalf texture. This format gives us 16bits per channel. We, then, map our matrix in such a way that we write:
Vector3 (x, y, z) -> Vector3 (r, g, b)
In the vertex shader, we sample the texture using the tuple [vertexId, animFrame]. Below is an example of a simple vertex shader using vertexId and reading the first frame of the animation, the first line of the texture:
uint vertexId : SV_VertexID;
};v2f vert (appdata v)
float vertCoords = v.vertexId;
float animCoords = 0;
float4 texCoords = float4(vertCoords, animCoords, 0, 0);
float4 position = tex2Dlod(_AnimVertexTex, texCoords);
// Use position values as a standard local space coordinates
o.vertex = UnityObjectToClipPos(position); return o;
Note: Why tex2Dlod? In the vertex shader, it is not possible to use tex2D. tex2D is an abbreviation of tex2Dlod, however, letting the hardware choose the appropriate mip level. Such an assumption cannot be made in vertex shader. So we use tex2Dlod and set the lod level 0.
The fetched vertex position is in local space and must be transformed into the world and then clip space. This way the character will be rendered the first frame of the animation:
To animate the character, we just need to vary the sampled line based on the current time. We have modified the value of animCoords in the above code to:
float animCoords = frac (_Time.y * _AnimVertexTex_TexelSize.y);
_AnimVertexTex_TexelSize = Vector4 (1/width, 1/height, width, height)
We’re just normalizing time to the texture height so the animation plays in a 1-second loop.
In this case, our character has two animations. As they are vertically concatenated, they will execute both in sequence.
The mesh orange outline does not match the animation because in the CPU the character is static. Unity is not aware of the CPU that the vertices will be moved by the GPU.
Using a bilinear filter is very helpful to the technique as it will automatically interpolate between two frames making the animation smoother.
This can drastically reduce the number of frames needed to be stored in the texture. Low-frequency animations can use very few frames without noticeable quality loss due to the bilinear interpolation.
However, we must be very careful when using bilinear filtering as we only want to interpolate vertically (between frames). We can never interpolate horizontally, as they represent different vertices and will result in terrible artifacts.
Bilinear filtering interpolates half texel in each direction. So we need to ensure that we always sample the horizontal center of the texel. Let (u, v) be the coordinates of the sample, the value of u must always be such that it falls in the center of a texel. In practice we shift the value of vertexId half texel:
float mapCurentVertexToTextureCoords(int vertexId, float invTextureWidth)
// Normalize texture coords
float normalizedVertexId = vertexId * invTextureWidth; // Get half of x coord textel size
float halfTextelCoord = 0.5 * invTextureWidth; // Sum half of x coord textel size to sample
// middle of textel (uv snapping)
return normalizedVertexId + halfTextelCoord;
2048x2048 Limit (OpenGLES 2.0)
Since the texture has the width equal to the number of mesh vertices, we have the limitation of not being able to use meshes larger than 2048 vertices.
Because we use bilinear filtering over the animation frames, the texture height can be very small.
We can then divide the texture in half and concatenate it below making it as close to a square as possible. This can be repeated resulting in the image below using four splits.
This operation can be applied recursively in order to use most of the 2048x2048 space. We just need to send a uniform to the shader that tells how many times the texture has been divided.
To avoid generating one texture for each animation and consequently having a controller change this texture reference at runtime, we vertically concatenate multiple animations. We can send another uniform to the shader informing the current animation to play.
Below is shown a texture with 9 concatenated animations. We also use the technique described in the previous section to make better use of the 2048x2048 space:
Since we have several concatenated animations, we need to simulate a “Clamp Wrap Mode” vertically: we cannot sample the first half of the first texel nor the last half of the last texel since bilinear filtering will interpolate between animations.
Solution: We clamp they coordinate so as not to sample at the edges.
clamp (first_anim_textel half, last_anim_textel half, yCoord)
Thus, we simulate a “wrap mode clamp” for any region of the texture.
Blending Different Animations
Transitioning animations is a big problem for this technique: if the character is in the middle of a standing animation and we have him perform a sitting animation he will “pop” (teleport from standing up).
One way to mitigate this problem would be to sample both animations (the current and the next) and tween each other.
We have two problems, double the samples and interpolation will always be linear: if in the current animation the character has his arm down and on the next animation has his arm up, he will not perform the desired shoulder rotation: The vertices of his hand/arm will follow a straight line through the body.
An alternative and inexpensive solution is to use the technique only when there is no need to blend animations.
In Tennis Clash we ask the animators to ensure the animations have the same final and initial frames so they can be switched without popping.
The crowd starts by playing a loop idle animation. When an animation event occurs (e.g. clapping) each character waits for its current idle animation to finish and then starts the clap animation. At the end of the clap, it rolls back to the idle loop. That is, we guarantee that the character will never start an animation while in the middle of another one and since the first and last frames are the same, no popping is visible.
Since the idle animation is short and out of sync with each other (by using an individual start time offset), the effect generated is quite organic.
There are some problems using RGBAHalf (64 bits/pixel) textures:
- Slower to read compared to RGBA (32 bits/pixel) textures.
- Some devices do not support float textures (e.g. Mali 400 GPU devices crash the application if you try to load an RGBAHalf texture). A fallback solution is required for OpenGLES 2.0.
- No support for RGBHalf (48 bits/pixels): we only need three channels and the alpha is unused.
The ideal solution would be to use only RGB24 (24 bits/pixel) textures supported by every device.
Encoding Normal to RGB24 texture
Normals can be easily encoded into an RGB24 texture as they are normalized vectors in the range [-1, 1].
We just need to encode them to the range [0, 1]:
normal.xyz = normal.xyz * 0.5f + 0.5f;
And in the shader, after reading the texture, we go back to the range [-1, 1]:
normal.xyz = (normal.xyz -0.5h) * 2.0h;
Encoding Vertices to RGB24 Texture
To be able to use only RGB24 textures to encode the vertices, we need to encode each RGBAHalf value to RGB24.
- RGBAHalf uses 16bits per channel representing a float.
- RGB24 uses 8 bits per channel representing an int.
One way to encode/decode a Float into an RGBA texture:
// Encoding / decoding [0..1) floats into 8 bit / channel RGBA. Note that 1.0 will not be encoded properly.inline float4 EncodeFloatRGBA (float v)
float4 kEncodeMul = float4 (1.0, 255.0, 65025.0, 16581375.0);
float kEncodeBit = 1.0 / 255.0;
float4 enc = kEncodeMul * v;
enc = frac (enc);
enc — = enc.yzww * kEncodeBit;
}inline float DecodeFloatRGBA (float4 enc)
float4 kDecodeDot = float4 (1.0, 1 / 255.0, 1 / 65025.0, 1 / 16581375.0); return dot (enc, kDecodeDot);
As our floats use only 16-bits, we could optimize the above encoding to use only two 8 bit channels instead of four:
// Encoding / decoding [0..1) floats into 8 bit / channel RG.
// Note that 1.0 will not be encoded properly.inline float2 EncodeFloatRG (float v)
float2 kEncodeMul = float2 (1.0, 255.0);
float kEncodeBit = 1.0 / 255.0;
float2 enc = kEncodeMul * v;
enc = frac (enc);
enc.x — = enc.y * kEncodeBit; return enc;
}inline float DecodeFloatRG (float2 enc)
float2 kDecodeDot = float2 (1.0, 1 / 255.0);
return dot (enc, kDecodeDot);
Thus we can use only 2 channels to encode 1 float, so we will need 6 channels (2 RGB24 textures) to encode 3 floats.
A possible distribution of channels would be:
X float encoded in (R_texture_1 + G_texture_1) -> (X1, X2)
Y float encoded in (B_texture_1 + R_texture_2) -> (Y1, Y2)
Z float encoded in (G_texture_2 + B_texture_2) -> (Z1, Z2)
Texture 1 (RGB24) = (X1, X2, Y1)
Texture 2 (RGB24) = (Y2, Z1, Z2)
Then, to retrieve the original floats in the shader we use the DecodeFloatRG defined above:
float3 tex1 = tex2Dlod (_AnimVertexTex1, texCoords);
float3 tex2 = tex2Dlod (_AnimVertexTex2, texCoords); float positionX = DecodeFloatRG (tex1.rg);
float positionY = DecodeFloatRG (float2 (tex1.b, tex2.r));
float positionZ = DecodeFloatRG (tex2.gb); float3 position = float3 (positionX, positionY, positionZ);
However, this technique is only possible if the floats are in the range [0, 1) (http://marcodiiga.github.io/encoding-normalized-floats-to-rgba8-vectors). Thus we generate a bound box for our animation and normalize the vertices within that box. At runtime, after sampling the textures we scale back the normalized vertices to object space.
Although we have mitigated many of the problems related to compatibility, the technique still requires a specific hardware capability: sample texture in the vertex shader (tex2Dlod instruction).
If the hardware doesn’t have this capability, we have no option but to fall back to another SubShader that just converts the input vertex to clip space without animation.
At unity, we can check this capability using
#pragma target 3.0#pragma require samplelod
Mind that target 3.0 covers OpenGL ES 2.0 with extensions. (In particular, the technique will be supported if the device has the extension EXT_shader_texture_lod). OpenGL ES devices version 3.0 or above supports it natively.
If the device supports instancing it is possible to send different parameters for each instance, then each instance can run a different animation at a different time.
#pragma target 3.5#pragma require instancing samplelod
If not, in the case of Tennis Clash where the audience doesn’t move (transform matrix is static), we can still do static batch sending the animation parameters as uniform, however, everyone will animate in sync (which is not a big problem in the case of audiences).
The Vertex Animation Technique is very optimized and well suitable when you need a huge quantity of animated objects in the scene and you don’t need animations blending: you can wait for an animation finish before starting another (as the case of the audience in Tennis Clash).
With this post, we hope to have collaborated with the explanation and dissemination of the technique through examples and details on the operation of vertex animation.