Fix your RTX 4090’s poor performance in Stable Diffusion with new PyTorch 2.0 and Cuda 11.8

J Night
5 min readMar 20, 2023

--

UPDATE: October 2023

All the issues described below are now resolved in the latest versions of the stable-diffusion webui and the below guide is kinda obsolete except options you can (but don’t have to) put into webgui-user.bat. I will keep the guide online for people who are stuck with old versions of webgui or for those who need a reference to relevant options.

So, if you are starting fresh just make sure you have the latest version of AUTOMATIC1111's stable-diffusion webui installed. Installation instruction are on his github: https://github.com/AUTOMATIC1111/stable-diffusion-webui

ORIGINAL ARTICLE:

TLDR;

Upgrade PyTorch in your Automatic1111’s webgui installation to 2.0 with cuda 118 using following command

pip install torch==2.0.0 torchvision --extra-index-url https://download.pytorch.org/whl/cu118

then disable xformers and add

--opt-sdp-attention 

to COMMANDLINE_ARGS instead.

For troubleshooting and more context read on.

For last few months Stable Diffusion community been fighting with RTX 40xx series poor performance and by now everyone is familiar with the magic CUDNN dlls swap fix. While it works and can more than double your 4090s performance there is now a…

New fix in the house

This fix squeezes even more juice from your Nvidia graphic cards than the old CUDNN dlls swap (and this extra performance jump applies not only to 40xx series now but 30s and 20s can benefit from it too). I saw additional 25% jump on RTX 4090 over the pure CUDNN dlls swap.

Pytorch finaly released official version of their 2.0 library together with official support for Cuda 11.8. Installing it brings performance improvement on 2 layers. PyTorch 2.0 is generally faster that version 1.3 and Cuda 11.8 is required to squeeze all the potential of new 40xx graphic card series.

The Fix

Below is written for Windows environments but I reckon Linux folks won’t have problem to translate it to their environments.

In simplest terms all you have to do is upgrade your AUTOMATIC1111’s webgui to use PyTorch 2.0 with cu118 and disable — xformers if you had it enabled.

To upgrade PyTorch run command

pip install torch==2.0.0 torchvision --extra-index-url https://download.pytorch.org/whl/cu118

from your stable-diffusion-webui folder.

If you ever enabled xformers you now need to disable them. Remove --xformers and related switches from COMMANDLINE_ARGS in your webui-user.bat file as PyTorch has their own optimisations in place.

To enable the new PyTorch optimisations to maximize your its/s add --opt-sdp-attentionto COMMANDLINE_ARGS.

Note: If you forget removing xformers from your command args then it will break your pytorch back to the old version.

That’s it. Start your webgui and enjoy the performance boost.

To verify that update was successful check bottom of your webgui it should say something like this:

python: 3.10.6 • torch: 2.0.0+cu118 • xformers: N/A

If it says torch: 1.13 (or anything less than 2.0) or cu117 (or anything less than 118) then something went wrong. Read on below.

Caveats, troubleshooting and known workarounds

Above fix won’t work for everyone out of the box and some extensions might be incompatible

Fix might be blocked by incompatible Python version or by some already installed incompatible python dependencies (most likely added when installing some Stable Diffusion or AUTOMATIC1111's gui extensions).

So, the more extensions you have installed in your webgui the more likely the above pytorch upgrade will fail.

If that happens then your best bet is to have a fresh clone of the webgui repo pulled from github and try again. After that install your favourite extensions one by one. Some might fail unfortunately and it’s up to you to decide if this is a show stopper.

You might run out of VRAM much faster than before

Some people with weaker cards (read: with lower VRAM) reported that after the fix they are now failing to render with Highres fix enabled or that they have to reduce the upscaling amount significantly. This is probably due to different optimization applied by sdb compared to what xformers were doing. No known workaround for that yet and xformers won’t work for PyTorch 2.0 for now. Perhaps in coming weeks we will see some optimisation fix for that or perhaph we will get an official wheel for compatible xformers.

TIP 1: If you starting from scratch.

If you just cloned the webgui repo, you can prepare it with PyTorch before running it for the first time. Modify your webui-user.bat to have the following lines:

set COMMANDLINE_ARGS= --opt-sdp-attention 
set TORCH_COMMAND=pip install torch==2.0.0 torchvision --extra-index-url https://download.pytorch.org/whl/cu118

This will ensure that when you start gui for the first or every time you wipe /venv folder, the correct version of PyTorch will get installed automatically for you. Notice there is no --xformersflag and you shouldn’t be adding it.

TIP 2: If you try to apply fix on existing installation that ain’t new

If you already started webgui-user.bat before updating it like in TIP 1 then you already have old PyTorch installed. Modify your webui-user.bat to have the following lines:

set COMMANDLINE_ARGS= --opt-sdp-attention --reinstall-torch
set TORCH_COMMAND=pip install torch==2.0.0 torchvision --extra-index-url https://download.pytorch.org/whl/cu118

Notice the --reinstall-torch flag. This will force pip to re-install PyTorch to the specified version.

Run your modified webui-user.bat, let it do its thing and when done remove the --reinstall-torch flag from it, otherwise you will endup reinstalling PyTorch (good few gigs) every time you start webgui.

TIP 3: The most sure way to get this working

The surest, conflict free, fool proof way to get this fix working is to apply TIP 1 from scratch on a very fresh clone copy of the latest version of the AUTOMATIC1111’ gui repo from github.

TIP 4: Deterministic optimisations

The optimisation option

--opt-sdp-attention

is a replacement for --xformers one and as such comes with same caveats. It produces undetermined results (i.e. different images for the same seed). To get deterministic results for the price of slight performance hit use

--opt-sdp-no-mem-attention

instead.

Original CUDNN dlls swap fix

If the PyTorch 2.0 fix creates issues for you that you can’t overcome (you might need your SD working in low VRAM, some incompatible extension is a must for you) then you can use the original fix.

The original fix is discussed here: 4090 cuDNN Performance/Speed Fix (AUTOMATIC1111). Scroll down to this digested pill comment in there for breakdown that is most relevant 5 months later since the original post was created.

Resources

The new PyTorch 2.0 fix is discussed on GitHub here: PyTorch 2.0.0 is now GA and on Reddit Torch 2.0 just went GA. All the credit goes to hard working folks from there who came up with all the answers. They cover both Windows and Linux scenarios and troubleshoot more edge case scenarios.

--

--