TensorFlow šŸ¤CondašŸ¤NVIDIA GPU on Ubuntu 22.04.3 LTS

Anna
11 min readDec 26, 2023

--

šŸŽµā€Make friends, make friends, never, never break friendsā€¦ā€šŸŽµ

my meme

Personally, I cannot stand a barrage of Tensorflow warning messages (W) or, even worse, error alerts (E) related to GPU devices every time I run my AI projects with this library. This article is about the issue many Data Scientists encounter: the challenging compatibility between TensorFlow and CUDA tools. In details, in this article I want to address and resolve the following errors/warning messages thrown at the initialization process of TensorFlow:

###-------------------------------1------------------------------------
E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
###-------------------------------2------------------------------------
W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
###-------------------------------3------------------------------------
I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] uccessful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355

In the last Deep Learning project, I worked with TensorFlow 2.10. After I have navigated through the dense forest of NVIDIAā€™s official documentation for the CUDA Toolkit and cuDNN installations, TensorFlow recognized a GPU device, but I kept receiving warning/info messages about the absence of TensorRT and not identified NUMA node. However, I was working on a remote server, and these warnings appeared to not affect the overall process of DL model training, so I have decided to ignore them.

Recently, I upgraded my personal PC with the NVIDIA GeForce RTX 3060. My PC runs on Ubuntu 22.04.3 LTS, and I have the official NVIDIA drivers for my GPU installed. When I check the status of my GPU using the nvidia-smi command, here's what I observe:

Status of my GPU using the nvidia-smi command

Here is the useful command that can verify if you have a a CUDA-Capable GPU. If in the output you see that your graphics card is from NVIDIA and it is listed in NVIDIAā€™s CUDA GPUs, your GPU is CUDA-capable.

lspci | grep -i nvidia
01:00.0 VGA compatible controller: NVIDIA Corporation GA106 [GeForce RTX 3060 Lite Hash Rate] (rev a1)
01:00.1 Audio device: NVIDIA Corporation Device 228e (rev a1)

Installing TensorFlow latest version following the official guide

The official TensorFlow installation guide had changed significantly since my last visit. The familiar Anaconda-based instructions were no longer there. Having become accustomed to Conda, I decided to proceed with it nonetheless. Thatā€™s what I have found in the ā€œInstall TensorFlow with pipā€

Source: https://www.tensorflow.org/install/pip

At this stage, I took a misstep ā€” I have read the requirements for GPU use and I was about to install CUDA Toolkit 11.8 and cuDNN 8.6.0, along with a compatible version of TensorRT. However, starting from TensorFlow version 2.14, thereā€™s a noteworthy change in the installation process for Linux users:

The tensorflow pip package has a new, optional installation method for Linux that installs necessary Nvidia CUDA libraries through pip. As long as the Nvidia driver is already installed on the system, you may now run pip install tensorflow[and-cuda] to install TensorFlow's Nvidia CUDA library dependencies in the Python environment. Aside from the Nvidia driver, no other pre-existing Nvidia CUDA packages are necessary.

I wanted to install the latest version of TensorFlow (currently it is 2.15), and I decided to proceed without manual pre-installation directly on my PC of CUDA Toolkit and cuDNN.

Moving on to the TensorFlow installation, I prefer using Anaconda for my Python projects due to its convenience. I began by creating a Conda environment based on Python 3.10. Following that, I executed the command from TensorFlowā€™s official guide to install the latest version, which is currently 2.15, along with its CUDA dependencies.

conda create -n tf-test-1 python=3.10
conda activate tf-test-1
python -m pip install tensorflow[and-cuda]
conda list

Upon inspecting the list of installed packages by the command from the snippet above in my Conda environment, I noticed that CUDA version 12.2, cuDNN 8.9.4, and TensorRT version 8.6.1 were automatically installed.

Output of conda list command (tensorflow-related installed libraries)

Here I want to mention one thing, the CUDA version displayed in the nvidia-smi output matched the version installed from the PyPI repository. But it is not always like this! The CUDA version shown in nvidia-smi and the CUDA Toolkit version are two distinct entities. For further clarity, refer to the diagram below and read more about CUDA compatibility at NVIDIA's documentation.

CUDA user mode driver (displayed on nvidia-smi output) and CUDA Toolkit are different things. Source:https://docs.nvidia.com/deploy/cuda-compatibility/

NB! Installing CUDA and cuDNN from pip package manager makes them reside inside a Conda environment only and when I run nvcc -V command globally , I see this: command 'nvcc' not found, but can be installed with sudo apt install nvidia-cuda-toolkit.

Finally, to assess how well TensorFlow 2.15 functions with these pre-installed drivers, I ran the test code suggested in TensorFlowā€™s official pip installation guide. Letā€™s see the output:

(tf-test-1) alpony@alpony:~$ python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
2023-12-24 14:10:34.169720: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-12-24 14:10:34.190123: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-24 14:10:34.190150: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-24 14:10:34.190697: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-12-24 14:10:34.193989: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-12-24 14:10:34.541756: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2023-12-24 14:10:34.771530: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-12-24 14:10:34.791450: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] uccessful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-12-24 14:10:34.791543: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

Bam! The issues with registering cuDNN, cuFFT, cuBLAS, and a glaring absence of TensorRT! However, it turns out that I am not alone facing these problems.

Honestly, exactly this StackOverflow post motivated me to write this article. After spending countless hours debugging and scouring TensorFlow, NVIDIA forums, and StackOverflow, I felt compelled to share my findings and a viable solution.

Source: TensorFlow GitHub Issues: cuDNN, cuFFT, and cuBLAS Errors

One interesting point raised in the discussion on TensorFlow GitHubā€™s Issues section was that with such kind of errors (unregistered cuDNN, cuFFT, cuBLAS) TensorFlow will run without cuDNN. To put this to the test, I run command from TensorFlow Test module:

Checking whether tensorflow in td-test-1 conda env is build with cuda and gpu_support; these tests are passed (returned value True)

The outputs of these test commands confirmed that TensorFlow was indeed utilizing CUDA, but the warnings about missing TensorRT were accurate ā€” when Iā€™ve tried to check the version of loaded tensorrt library, the python session got aborted:

Output of a command that checks loaded tensorrt version

Installing TensorFlow 2.13 and CUDA libraries manually inside Conda environment

Determined to resolve all issues that TensorFlow throws upon each initialization, I turned to manual installation via Anacondaā€™s Conda repositories. This approach I used for earlier TensorFlow versions.

1First, I manually installed CUDA Toolkit and cuDNN from the conda-forge repository, ensuring compatibility as per NVIDIAā€™s guidelines.

###############################################################################
# The available versions in the conda-forge repository
# (https://conda.anaconda.org/conda-forge/linux-64/):
# cudnn-(7.6.5, 8.0.5, 8.1.0, 8.2.0, 8.2.1, 8.3.2, 8.4.0, 8.4.1, 8.8.0)
# cudatoolkit-(9.2, 10.0, 10.1, 10.2, 11.0, 11.1, 11.2, 11.2, 11.3, 11.4, 11.5, 11.6, 11.7, 11.8)
###############################################################################
conda create -n tf-env-ultimate python=3.10
conda activate tf-env-ultimate
conda install -c conda-forge cudatoolkit=11.8 cudnn=8.8

NB! Setting the PATH correctly is a crucial step in this process:

mkdir -p $CONDA_PREFIX/etc/conda/activate.d
echo ā€˜export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/ā€™ > $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
conda deactivate

2Next, I installed TensorRT via pip manager. After reviewing the available versions, I opted for version 8.5.3.1, aligning with the CUDA package mentioned in TensorFlowā€™s code.

##############################################################################
# Availible version of tensorrrt library:
# python -m pip index versions tensorrt
# Available versions: 8.6.1.post1, 8.6.1, 8.6.0, 8.5.3.1, 8.5.2.2, 8.5.1.7
##############################################################################

conda activate tf-env-ultimate
#checking that LD_LIBRARY_PATH is correct
echo $LD_LIBRARY_PATH

python -m pip install tensorrt==8.5.3.1
TENSORRT_PATH=$(dirname $(python -c ā€œimport tensorrt;print(tensorrt.__file__)ā€))
echo $TENSORRT_PATH
#linking tensorrt library files to LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/alpony/anaconda3/envs/tf-env-ultimate/lib/python3.10/site-packages/tensorrt
conda deactivate

3Finally, I installed TensorFlow library. I chose to install TensorFlow 2.13, as versions starting from 2.14 seemed to have the aforementioned issues with CUDA libraries.

conda activate tf-env-ultimate
echo $LD_LIBRARY_PATH
python -m pip install tensorflow==2.13
python3 -c ā€œimport tensorflow as tf; print(tf.config.list_physical_devices(ā€˜GPUā€™))ā€

šŸ£Voila! šŸ£ The issues with registering cuDNN, cuFFT, cuBLAS, and not-found TensorRT are finally resolved:

(tf-env-ultimate) alpony@alpony:~$python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
2023ā€“12ā€“24 15:40:53.509643: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023ā€“12ā€“24 15:40:53.531641: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023ā€“12ā€“24 15:40:54.171272: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023ā€“12ā€“24 15:40:54.186033: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023ā€“12ā€“24 15:40:54.186132: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

But what about the NUMA node information? I found a partial solution detailed in the article ā€œFixing the ā€œsuccessful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zeroā€ problemā€ by
Zukhriddin
, which I followed. After implementing these steps, hereā€™s the outcome:

Importing tensorflow library to python and testing it
(tf-env-ultimate) alpony@alpony:~$ python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
2023-12-24 16:45:10.414244: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-12-24 16:45:10.436806: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

Voila! A couple of Information prints (I) and thatā€™s all.

Testing the performance of both Conda environments

Reaching the conclusion of this experience, I reflect on the hours invested in addressing those warnings and ask myself: was it truly worth the effort? To find an answer, I performed a test using a piece of Python code that builds a simple Neural Network and trains it on random data. The aim was to compare the performance (in time) of two TensorFlow setups: one from the official guide and the other, a product of my Ģ¶bĢ¶lĢ¶oĢ¶oĢ¶dĢ¶ Ģ¶aĢ¶nĢ¶dĢ¶ Ģ¶tĢ¶eĢ¶aĢ¶rĢ¶sĢ¶ manual installation.

Python code (inspired by this code published on StackOverflow):

import numpy as np
import tensorflow as tf
from timeit import default_timer as timer

model = tf.keras.Sequential()
model.add(tf.keras.layers.Dense(2048, activation='relu'))
model.add(tf.keras.layers.Dense(2048, activation='relu'))
model.add(tf.keras.layers.Dense(10, activation='softmax'))

model.compile(optimizer=tf.compat.v1.train.AdamOptimizer(0.001),
loss='categorical_crossentropy',
metrics=['accuracy'])

def load_data():
data = np.load('data.npy')
labels = np.load('labels.npy')
return data, labels

#I have used this code seperately to generate random dataset:
#import numpy as np
#seed_value=42
#def random_one_hot_labels(shape, seed=None):
#n, n_class = shape
#np.random.seed(seed)
#classes = np.random.randint(0, n_class, n)
#tmp_labels = np.zeros((n, n_class))
#tmp_labels[np.arange(n), classes] = 1
#return tmp_labels
#def generate_and_save_data():
#data = np.random.random((10000, 32))
#labels = random_one_hot_labels((10000, 10), seed=seed_value)

#np.save('data.npy', data)
#np.save('labels.npy', labels)
#generate_and_save_data()

data, labels = load_data()

durations = []
for i in range(3):
start = timer()
model.fit(data, labels, epochs=100, batch_size=32)
durations.append(timer() - start)

print(f"model.fit durations: {durations}")

The results, measured in terms of execution time (seconds), were revealing:

  • Conda Environment (tf-test-1) with TensorFlow 2.15 (Automatic Installation with CUDA Libraries):
model.fit durations: [48.48324873100137, 48.309404147999885, 48.39296498600015]
(tf-test-1) alpony@alpony:~$
  • Manually build Conda Environment (tf-env-ultimate ) with TensorFlow 2.13 and manually selected CUDA Libraries:
model.fit durations: [46.51688312400074, 47.9981527480013, 49.01901237399943]
(tf-env-ultimate) alpony@alpony:~$

The outcomes were closer than I anticipated. It seems that sometimes, allowing minor warnings to pass without deep analysis can be the more prudent approach. Of course, TensorFlow provides some testing tools to ensure its proper functioning (for example, a check whether TensorFlow is built with CUDA).

A friend of mine often cautions me against getting too absorbed in every single warning, jokingly suggesting it could lead to spending every waking hour in front of the PC.

By sharing my story, I hope to aid others who may find themselves in a similar situation, demonstrating that sometimes, a easy-going approach can be equally effective.

--

--