Driver for GTX 1080 Ti

I have purchased a 1080 Ti which I intend to use for deep learning. The problem is that none of the frameworks (Theano, Tensorflow) etc currently detect it. I am using CUDA 8.0 with cuDNN 5.1.

nvidia-smi lists it as a 'Graphics Device" (using driver version 378.13)

I want to start testing my models using this card. Any pointers are appreciated.

What does the deviceQuery output for it look like?

Here’s the output of deviceQuery -

watts@Magnus:~/NVIDIA_CUDA-8.0_Samples/1_Utilities/deviceQuery$ ./deviceQuery ./deviceQuery Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: “Graphics Device”
CUDA Driver Version / Runtime Version 8.0 / 8.0
CUDA Capability Major/Minor version number: 6.1
Total amount of global memory: 11169 MBytes (11711938560 bytes)
(28) Multiprocessors, (128) CUDA Cores/MP: 3584 CUDA Cores
GPU Max Clock rate: 1582 MHz (1.58 GHz)
Memory Clock rate: 5505 Mhz
Memory Bus Width: 352-bit
L2 Cache Size: 2883584 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 3 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = Graphics Device
Result = PASS
watts@Magnus:~/NVIDIA_CUDA-8.0_Samples/1_Utilities/deviceQuery$

It appears to be working correctly.

As far as I know, Tensorflow is not keyed to individual GPUs. Whatever problem you’re having with tensorflow may have nothing to do with the specific GPU you have. Current versions of Tensorflow should work with any cc 3.0 or greater GPU that has an appropriate CUDA 8 install.

How do I know if CUDA compute is enabled for this device?

It’s enabled for every GPU. In particular this output:

Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

GPU Default Compute Mode means it is capable of running compute tasks.

And the fact that deviceQuery (and probably any other sample code) runs correctly on it means that it is certainly capable of running compute tasks.

OP,

Like txbob said, if the samples run then you should be able to use CUDA with the GTX 1080ti.

Would be interested to see how the GTX 1080ti memory bandwidth compares to the Pascal Titan X.

You could run the ‘bandwidthTest’ application from the CUDA samples. While that test’s numbers tend to be lower than those from an optimized vectorized kernel they still give a good general idea of the global memory bandwidth.

For example using a Maxwell Titan X running the bandwidth test from the CUDA 8.0 samples;

[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: GeForce GTX TITAN X
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     11884.2

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     12674.0

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     267232.8

Result = PASS

So that shows 267 GBs, but using more optimal reduction code that number for the Maxwell Titan X gets up to 306 GBs.

If I remember correctly the Pascal Titan X was maxed out at 389 GBs for the same optimized reduction code, but cannot remember the results of the bandwidth test from the CUDA SDK 8.0. There does seem to be an issue with Pascal DDR5x memory bandwith and I am curious if that issue is still present for the GTX 1080ti.

When I have run bandwidthTest on the Pascal Titan X’s that I used to have access to, I was getting numbers right around 350GB/s. I didn’t fiddle with extreme clock boosting efforts or anything like that, just an out-of-the-box run.

Did you succeed in running a TF script or benchmark on 1080 Ti?

I did not run a TF script or benchmark on the 1080 Ti. I was able to run some basic programs after installing tensorflow-gpu (as opposed to just tensorflow). I am getting ready to finally run some convnets. All good so far.

Thanks for reply. Please let us know if you encounter issues.

Yes, you need to use tensorflow-gpu if you want to take advantage of GPUs.

I have the same problem when using gtx1080ti with tensorflow. The default driver with cuda 8.0 is 375.26, which does not support gtx1080ti. So I take the following steps to solve the problem. Firstly, I install nvidia driver 378.13, which supports gtx 1080ti. Secondly, I install cuda 8.0 using runfile(local) but I choose not to install the default driver with cuda 8.0. Then I install cudnn and tensorflow. 1080ti runs faster than 1080 on big models. But on small models, their speed is almost the same.
It’s very weird that ‘nvidia-smi’ lists gtx1080ti as ‘Graphic Card’. But ‘lspci | grep -i nvidia’ lists it as ‘gtx1080ti’.
Anyway, the card works.

FWIW, I installed a GTX 1080 ti and ran the bandwidth test. Here are the results:

[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: Graphics Device
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			12505.4

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			12840.6

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			346078.1

Result = PASS

gevermann, thanks for that! The 338GB/sec your test reports means the GTX 1080 Ti’s bandwidth (in CUDA at least) is only about 70% of the theoretical 484 GB/sec. This roughly matches Cudaaduc’s tests with the Titan X Pascal.

I came to this thread because I was having similar odd results and deviceQuery calling my 1080ti card “Graphics Device”. Huge thanks to Chong666 for the note about how to install nvidia driver 378.13 but then choose NOT to install the default driver during the cuda 8.0 install.

FWIW, here are my GTX 1080ti results - very similar to gevermann.

[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: Graphics Device
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			12816.2

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			12862.7

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			345588.1

Result = PASS

Honestly the bandwidthTest has always been pretty far from optimal in my experience…

Yes, your reduction code from 3 years ago is still a better test than the sample in the CUDA SDK. Your implementation gets over 90% of theoretical maximum for Maxwell using a large power of 2 input array for the reduction (2^26 for example). The best number I got out of the Pascal Titan X was about 80%.

Looks like the GTX 1080ti has the same DDR5x bandwidth issue, not that I should have expected anything different.
Really would like to see a bandwidth test from a Tesla P100 or the GP100 with the HBM memory.

If you are referring to bandwidthTest, it wouldn’t make you feel any better. It reports about 450GB/s against a theoretical 732GB/s (assuming you are not running shmoo mode).

https://devtalk.nvidia.com/default/topic/979182/cuda-setup-and-installation/ibm-power8-cuda-driver-version-is-insufficient-for-cuda-runtime-version/post/5028506/#5028506

Like Jimmy Petterson’s code, there are ways to exceed this number. I haven’t ever witnessed anything above 550GB/s, though.

I don’t consider bandwidthTest to be a perfect measurement (obviously), but for me it serves two useful purposes:

  1. Use it as a relative yardstick. If you use the same yardstick, you can still tell which things are longer than others, even if the calibration on the yardstick is off.
  2. Use it as a conservative estimate of what is achievable, for tuning efforts. I would say, based on my experience, that the numbers reported by Jimmy’s code are only achievable for specific coding patterns. If your code indicates an achieved bandwidth that is comparable to bandwidthTest, then you are in pretty good shape, IMO. I’m sure that ninjas/freakish optimizers will disagree. If your code has a built-in autotuner, then this line of thinking is clearly not for you.

Unless I am missing something, there is a specific difference between bandwidthTest and “competing” tests: the read-to-write ratio. The well-known STREAM benchmark partially addresses this by offering two families of tests, namely COPY and SCALE that use equal amounts of read and write bandwidth, and SUM and TRIAD with read bandwidth twice the write bandwidth:

COPY:       a(i) = b(i)
SCALE:      a(i) = q*b(i)
SUM:        a(i) = b(i) + c(i)
TRIAD:      a(i) = b(i) + q*c(i)

As I recall, when just one STREAM result is cited, it is the TRIAD number. bandwidthTest on the other hand basically implements COPY.

Benchmarks using an even higher read-to-write ratio than TRIAD, such as reductions, should achieve higher sustained bandwidth, for example by reducing the time lost to read/write turnaround when accessing DRAM.

I don’t know whether bandwidthTest uses a best-of-ten-runs filter as STREAM does, that could also make a difference, as performance measurements for large DRAMS systems are notoriously noisy.