Geek Culture
Published in

Geek Culture

Decompiling Nvidia shaders, and optimizing

Nvidia instructions for Shadertoy New shader, envydis output

Tools

  1. Build nvcachetools.
  2. Build envytools.
    Non opensource alternative, proprietary software made by Nvidia — install CUDA or download cuda-nvdisasm and extract nvdisasm file.

Short instruction

  1. ./nvcachedec nv_bin/*.toc objs
  2. ./nvucdump objs/object00000.nvuc sections
    or object00001.nvuc or other number
    Look Advanced usage — File names below.
  3. ./envydis -i -mgm107 sections/section4_0001.bin
    or /usr/local/cuda-11.8/bin/nvdisasm --binary SM87 objs/object00000.nvuc
nvdisasm output for Shadertoy New shader

Instruction

Get compiled shaders from shader cache.

Example of single application compiled shaders, after launching this app.

Advanced usage

Instruction Set Reference:

To calculate number of instructions:

File names to shader names

Only when code in each shader is unique:

  • object00000.nvuc is shaders/shadertoy/buf0.glsl
  • object00001.nvuc is shaders/src/buf.vert
  • object00002.nvuc is shaders/shadertoy/buf1.glsl
  • object00003.nvuc is shaders/shadertoy/buf2.glsl
  • object00004.nvuc is shaders/shadertoy/buf3.glsl
  • object00005.nvuc is shaders/shadertoy/main_image.glsl
  • object00006.nvuc is shaders/src/main.vert
3 decompiled *.nvuc files from empty-template where 4 buffers is just discard, Vertex shader, and image shader is simple New Shader code.

Example usage

To see source of major slowdown in shaders:

Original vs optimized shader

Comparing same GLSL-code shader compiled in OpenGL and Vulkan:

Comparing two shaders compiled in Vulkan and OpenGL

STL is Always bad! (smaller arrays and less array read/write is always better)

OpenGL statistic of shader in BufA
Vulkan statistic of same shader

Optimization of this my GLSL Auto Tetris shader:

Analyze and optimize neural(ML) shaders:

Statistic of instructions in each shader

From this I can assume:

  • This is not because sin/cos instruction, same number of sin/cos in both shaders.
  • Uses of FFMAFP32 — Fused Multiply and Add instructions about 2x more in first shader.
Rename original mainImage to mainImage0 and run it twice — I have two monkeys.
2x of everything
Section CONST in nvdisasm output

My conclusion out of all this — around One or Two Kbyte of CONST is limit for “good performance” for this my Nvidia GPU.

My try to fix this:

Result:

Optimized ML/Neural shader

Final optimized shader — Optimized ML/Neural shader.

Yes there are some quality loss:

Optimized ML/Neural shader

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store