Decompiling Nvidia shaders, and optimizing

Nvidia instructions for Shadertoy New shader, envydis output


  1. Build nvcachetools.
  2. Build envytools.
    Non opensource alternative, proprietary software made by Nvidia — install CUDA or download cuda-nvdisasm and extract nvdisasm file.

Short instruction

  1. ./nvcachedec nv_bin/*.toc objs
  2. ./nvucdump objs/object00000.nvuc sections
    or object00001.nvuc or other number
    Look Advanced usage — File names below.
  3. ./envydis -i -mgm107 sections/section4_0001.bin
    or /usr/local/cuda-11.8/bin/nvdisasm --binary SM87 objs/object00000.nvuc
nvdisasm output for Shadertoy New shader


Get compiled shaders from shader cache.

Example of single application compiled shaders, after launching this app.

Advanced usage

Instruction Set Reference:

To calculate number of instructions:

File names to shader names

Only when code in each shader is unique:

  • object00000.nvuc is shaders/shadertoy/buf0.glsl
  • object00001.nvuc is shaders/src/buf.vert
  • object00002.nvuc is shaders/shadertoy/buf1.glsl
  • object00003.nvuc is shaders/shadertoy/buf2.glsl
  • object00004.nvuc is shaders/shadertoy/buf3.glsl
  • object00005.nvuc is shaders/shadertoy/main_image.glsl
  • object00006.nvuc is shaders/src/main.vert
3 decompiled *.nvuc files from empty-template where 4 buffers is just discard, Vertex shader, and image shader is simple New Shader code.

Example usage

To see source of major slowdown in shaders:

Original vs optimized shader

Comparing same GLSL-code shader compiled in OpenGL and Vulkan:

Comparing two shaders compiled in Vulkan and OpenGL

STL is Always bad! (smaller arrays and less array read/write is always better)

OpenGL statistic of shader in BufA
Vulkan statistic of same shader

Optimization of this my GLSL Auto Tetris shader:

Analyze and optimize neural(ML) shaders:

Statistic of instructions in each shader

From this I can assume:

  • This is not because sin/cos instruction, same number of sin/cos in both shaders.
  • Uses of FFMAFP32 — Fused Multiply and Add instructions about 2x more in first shader.
Rename original mainImage to mainImage0 and run it twice — I have two monkeys.
2x of everything
Section CONST in nvdisasm output

My conclusion out of all this — around One or Two Kbyte of CONST is limit for “good performance” for this my Nvidia GPU.

My try to fix this:


Optimized ML/Neural shader

Final optimized shader — Optimized ML/Neural shader.

Yes there are some quality loss:

Optimized ML/Neural shader



