Using the nVidia GT 1030 for CUDA workloads on Ubuntu 16.04

Recently nVidia released a new low-end card, the nVidia GT1030. Its specs are so low that it is not even listed on the official CUDA supported cards page! The thing is it is also the cheapest card you can find on the market, around 70 bucks on Amazon, and the spec page lists 384 CUDA cores.

There are a fair number of people asking if this card can be used for CUDA workloads, and the answers are not always straightforward.

Luckily I got my hands on one of these. So… Can you or can you not run anything on it?

Short answer: yes, but not much.

Installation hurdles…

The problem

Most CUDA based systems will tell you to use the nvidia-375 driver package available from the official repos. It is the driver that is actually included in the repository version of CUDA 8.0.

If you do so with the GT1030, you will fail as the drivers do not support it. When calling nvidia-smi the logs with say:

Aug  3 10:26:32 ubuntu kernel: [  682.768627] NVRM: RmInitAdapter failed! (0x26:0xffff:1097)
Aug 3 10:26:32 ubuntu kernel: [ 682.768696] NVRM: rm_init_adapter failed for device bearing minor number 0

and deviceQuery will report

deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
cudaGetDeviceCount returned 38
-> no CUDA-capable device is detected
Result = FAIL

The solution

To solve the issue, we need to install CUDA as the standalone runfile, then install the latest drivers from the 381 branch as well as libcuda. You can do it on a fresh install via:

$ wget https://developer.nvidia.com/compute/cuda/8.0/Prod2/local_installers/cuda_8.0.61_375.26_linux-run
$ chmod +x cuda_8.0.61_375.26_linux-run
$ sudo ./cuda_8.0.61_375.26_linux-run --silent --toolkit --samples

This will have the CUDA toolkit installed without the drivers. Then

$ sudo add-apt-repository -y ppa:graphics-drivers/ppa
$ sudo apt update
$ sudo apt install --no-install-recommends nvidia-381 nvidia-381-dev libcuda1-381 docker.io

will install the drivers that will work for us. We will use the Docker package later in the post.

The validation

All we have to do now is run nvidia-smi and deviceQuery to check everything:

$ nvidia-smi
Fri Aug 4 09:26:54 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 381.22 Driver Version: 381.22 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GT 1030 Off | 0000:03:00.0 Off | N/A |
| 30% 35C P0 12W / 30W | 0MiB / 2000MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

and

$ cd NVIDIA_CUDA-8.0_Samples/1_Utilities/deviceQuery
$ make
$ ./deviceQuery
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GT 1030"
CUDA Driver Version / Runtime Version 8.0 / 8.0
CUDA Capability Major/Minor version number: 6.1
Total amount of global memory: 2000 MBytes (2097217536 bytes)
( 3) Multiprocessors, (128) CUDA Cores/MP: 384 CUDA Cores
GPU Max Clock rate: 1468 MHz (1.47 GHz)
Memory Clock rate: 3004 Mhz
Memory Bus Width: 64-bit
L2 Cache Size: 524288 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 3 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GeForce GT 1030
Result = PASS

So YES, we do have CUDA on the board (as expected but not officially recognized)

CUDA Workloads

Crypto Currency Mining #1: Ethereum (fails)

A while a go I dockerized the Claymore Miner, and I keep updating the image from times to times. I usually do scale out tests on Kubernetes with it, but it’s fine with Docker alone.

Note I am not using nvidia-docker because it expects the upstream Docker and I am using the version from the Ubuntu archive.

$ docker run -it \
-v /usr/lib/nvidia-381/bin:/usr/local/nvidia/bin \
-v /usr/lib/nvidia-381:/usr/lib/nvidia \
-v /usr/lib/x86_64-linux-gnu:/usr/lib/cuda \
-e LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/lib/nvidia:/usr/lib/cuda" \
-e POD_NAME=1nvidia \
--device /dev/nvidia0:/dev/nvidia0 \
--device /dev/nvidiactl:/dev/nvidiactl \
--device /dev/nvidia-uvm:/dev/nvidia-uvm \
samnco/claymore-miner:9.7-nvidia \
/entrypoint.sh
-di detect is not supported anymore (obsolete)
****************************************************************Í»
* Claymore's Dual ETH + DCR/SC/LBC/PASC GPU Miner v9.7 *
****************************************************************ͼ
ETH: 1 pool is specified
Main Ethereum pool is eu1.ethermine.org:4444
AMD OpenCL platform not found
Driver 368.81 is recommended for best performance and compatibility
Be careful with overclocking, use default clocks for first tests
Press "s" for current statistics, "0".."9" to turn on/off cards, "r" to reload pools, "e" or "d" to select current pool
CUDA initializing...
NVIDIA Cards available: 1 
CUDA Driver Version/Runtime Version: 8.0/8.0
GPU #0: GeForce GT 1030, 2000 MB available, 3 compute units, capability: 6.1
Total cards: 1 
ETH: Stratum - connecting to 'eu1.ethermine.org' <46.105.121.53> port 4444
ETH: Stratum - Connected (eu1.ethermine.org:4444)
ETHEREUM-ONLY MINING MODE ENABLED (-mode 1)
ETH: eth-proxy stratum mode
Watchdog enabled
Remote management (READ-ONLY MODE) is enabled on port 3333
ETH: Authorized
Setting DAG epoch #137...
ETH: 08/04/17-08:36:04 - New job from eu1.ethermine.org:4444
ETH - Total Speed: 0.000 Mh/s, Total Shares: 0, Rejected: 0, Time: 00:00
ETH: GPU0 0.000 Mh/s
Setting DAG epoch #137 for GPU0
Create GPU buffer for GPU0
CUDA error - cannot allocate big buffer for DAG. Check readme.txt for possible solutions.
GPU 0 failed
Setting DAG epoch #137 for GPU0
GPU 0, CUDA error 11 - cannot write buffer for DAG
GPU 0 failed
ETH: 08/04/17-08:36:26 - New job from eu1.ethermine.org:4444
ETH - Total Speed: 0.000 Mh/s, Total Shares: 0, Rejected: 0, Time: 00:00
ETH: GPU0 0.000 Mh/s
Quit signal received...
GPU0 t=34C fan=36%
WATCHDOG: GPU error, you need to restart miner :(

OK, we have a fail. This was expected though as the DAG for ETH at this point is bigger than 2GB, so it doesn’t fit in the memory of the GPU. No ETH mining with it!

Crypto Currency Mining #2: Monero (works)

For this I have an image of xmrminer which I use to showcase how handling sigint in Docker images is a necessity. Monero is definitely a CPU oriented currency: you do not get more from a GPU you would from a high end CPU. A single Xeon hyperthreaded core will give you 10 to 20 hashes/sec, while a high end Radeon or nVidia Pascal card will give about 500 to 1000 H/s.

Here, with a 96x8 or a 48x16, I get

docker run -it \
-v /usr/lib/nvidia-381/bin:/usr/local/nvidia/bin \
-v /usr/lib/nvidia-381:/usr/lib/nvidia \
-v /usr/lib/x86_64-linux-gnu:/usr/lib/cuda \
-e LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/lib/nvidia:/usr/lib/cuda" \
--device /dev/nvidia0:/dev/nvidia0 \
--device /dev/nvidiactl:/dev/nvidiactl \
--device /dev/nvidia-uvm:/dev/nvidia-uvm \
-v $PWD/config.nvidia:/config/config.txt \
samnco/xmrminer:0.1.5-nvidia \
/usr/bin/xmr-stak-nvidia
-------------------------------------------------------------------
XMR-Stak-NVIDIA mining software, NVIDIA Version.
NVIDIA mining code was written by KlausT and psychocrypt.
Brought to you by fireice_uk under GPLv3.
Configurable dev donation level is set to 1.0 %
You can use following keys to display reports:
'h' - hashrate
'r' - results
'c' - connection
-------------------------------------------------------------------
[2017-08-04 08:46:11] : Connecting to pool xmr-eu.dwarfpool.com:8050 ...
[2017-08-04 08:46:11] : Connected. Logging in...
[2017-08-04 08:46:11] : Difficulty changed. Now: 50000.
[2017-08-04 08:46:11] : New block detected.
HASHRATE REPORT
| ID | 10s | 60s | 15m |
| 0 | 117.4 | (na) | (na) |
---------------------------
Totals: 117.4 (na) (na) H/s
Highest: 117.4 H/s

These settings consume about 75% of the card memory, and 100% of its compute capability. I spent a short amount of time tweaking this, but do not expect to get more than 200H/s on this card. Not too bad but not great either.

Tensorflow

For tensorflow we will just run the default examples

docker run -it -p 8888:8888 \
-v /usr/lib/nvidia-381/bin:/usr/local/nvidia/bin \
-v /usr/lib/nvidia-381:/usr/lib/nvidia \
-v /usr/lib/x86_64-linux-gnu:/usr/lib/cuda \
-e LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/lib/nvidia:/usr/lib/cuda" \
-e PASSWORD=admin \
--device /dev/nvidia0:/dev/nvidia0 \
--device /dev/nvidiactl:/dev/nvidiactl \
--device /dev/nvidia-uvm:/dev/nvidia-uvm \
tensorflow/tensorflow:latest-gpu

Then we connect on the UI and run the Hello World

Tensorflow Hello World

What we do here is that we use the Kernel menu and do a “restart and run all cells”, so that Jupyter will play the whole notebook. The result is that it consumes just about 1800MB of RAM on the card:

Memory Consumption from Tensorflow

So now if we go on another notebook and do the same, we’ll end up with 2 notebooks loaded at the same time, which will result in a failure:

Second notebook coming in

From the tensorflow logs, when we start the second notebook:

2017-08-04 09:04:59.467813: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties: 
name: GeForce GT 1030
major: 6 minor: 1 memoryClockRate (GHz) 1.468
pciBusID 0000:03:00.0
Total memory: 1.95GiB
Free memory: 138.06MiB
2017-08-04 09:04:59.467856: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0
2017-08-04 09:04:59.467866: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: Y
2017-08-04 09:04:59.467885: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GT 1030, pci bus id: 0000:03:00.0)
2017-08-04 09:04:59.468853: E tensorflow/stream_executor/cuda/cuda_driver.cc:924] failed to allocate 138.06M (144769024 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2017-08-04 09:04:59.783213: E tensorflow/stream_executor/cuda/cuda_blas.cc:365] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2017-08-04 09:04:59.783283: W tensorflow/stream_executor/stream.cc:1601] attempting to perform BLAS operation using StreamExecutor without BLAS support
[I 09:06:31.837 NotebookApp](B Saving file at /2_getting_started.ipynb

Totally uncool.

Conclusion

You can’t do much more than watching movies with this board unfortunately. If you want to do some deep learning but have little money to spend, then even this is not the best option.

You can find the GTX 1050 Ti, which doubles all the specs for double of the money at $150. Not too expensive considering 2GB of RAM today seems to be really the bare minimum.

If can spend up to $300, then the latest GTX 1060 6GB with extended memory bandwidth will be your best option. It’s what I used for recent posts.

That being said, the GT 1030 carries some interest if:

  • You have a requirement on the passive cooling system
  • You have a single slot requirement in density environments

Any question, let me know :)