Nvidia-smi not recognizing Titan V

Hello,

I recently replaced a Titan X board with a Titan V board in a computer running Ubuntu 16.04. Upon installing the latest CUDA Toolkit v9.1 with display driver 387.26, nvidia-smi returns, “No devices were found”. (CUDA 9.0, which was installed on the machine with the Titan X and worked, had the same result when I installed the Titan V.)

In case it’s relevant, the machine has an AST2400 BMC on it, and the primary display is set up to go out the VGA port on the BMC and not through the Nvidia GPU. The GPU is for compute only.

I found another thread with a similar situation some time ago, and the resolution was a driver update. ("RmInitAdapter failed" with 370.23 but 367.35 works fine - Linux - NVIDIA Developer Forums)

Any ideas on how to proceed?

Thanks,
Aaron

Relevant output from dmesg includes:
[ 6.755454] nvidia: module license ‘NVIDIA’ taints kernel.
[ 6.755455] Disabling lock debugging due to kernel taint
[ 6.761032] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 6.765634] ipmi_si IPI0001:00: Found new BMC (man_id: 0x000000, prod_id: 0xaabb, dev_id: 0x20)
[ 6.766213] nvidia-nvlink: Nvlink Core is being initialized, major device number 243
[ 6.766381] nvidia 0000:04:00.0: enabling device (0100 → 0103)
[ 6.766448] vgaarb: device changed decodes: PCI:0000:04:00.0,olddecodes=io+mem,decodes=none:owns=none
[ 6.766508] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 387.26 Thu Nov 2 21:20:16 PDT 2017 (using threaded interrupts)
[ 7.175641] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:02.2/0000:04:00.1/sound/card0/input2
[ 7.175685] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:02.2/0000:04:00.1/sound/card0/input3
[ 7.175735] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:02.2/0000:04:00.1/sound/card0/input4
[ 7.175769] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:02.2/0000:04:00.1/sound/card0/input5
[ 8.229133] nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 242
[ 8.653254] NVRM: RmInitAdapter failed! (0x30:0x56:685)
[ 8.653280] NVRM: rm_init_adapter failed for device bearing minor number 0
[ 17.811810] NVRM: RmInitAdapter failed! (0x30:0x56:685)
[ 17.811839] NVRM: rm_init_adapter failed for device bearing minor number 0
[ 252.420234] NVRM: RmInitAdapter failed! (0x30:0x56:685)
[ 252.420254] NVRM: rm_init_adapter failed for device bearing minor number 0

Also relevant:
cat /proc/driver/nvidia/gpus/0000:02:00.0/information
Model: Graphics Device
IRQ: 57
GPU UUID: GPU-???-???-???-???-???
Video BIOS: ??.??.??.??.??
Bus Type: PCIe
DMA Size: 47 bits
DMA Mask: 0x7fffffffffff
Bus Location: 0000:02:00.0
Device Minor: 0

Also:
uname -r
4.4.0-98-generic

Also:
sudo dmidecode
[sudo] password for agreenblatt:

dmidecode 3.0

Getting SMBIOS data from sysfs.
SMBIOS 3.0 present.
36 structures occupying 2136 bytes.
Table at 0x000ED9B0.

Handle 0x0000, DMI type 0, 24 bytes
BIOS Information
Vendor: American Megatrends Inc.
Version: P2.10
Release Date: 06/17/2016
Address: 0xF0000
Runtime Size: 64 kB
ROM Size: 8192 kB
Characteristics:
PCI is supported
BIOS is upgradeable
BIOS shadowing is allowed
Boot from CD is supported
Selectable boot is supported
BIOS ROM is socketed
EDD is supported
5.25"/1.2 MB floppy services are supported (int 13h)
3.5"/720 kB floppy services are supported (int 13h)
3.5"/2.88 MB floppy services are supported (int 13h)
Print screen service is supported (int 5h)
8042 keyboard services are supported (int 9h)
Serial services are supported (int 14h)
Printer services are supported (int 17h)
ACPI is supported
USB legacy is supported
BIOS boot specification is supported
Targeted content distribution is supported
UEFI is supported
BIOS Revision: 5.11

Handle 0x0001, DMI type 1, 27 bytes
System Information
Manufacturer: To Be Filled By O.E.M.
Product Name: To Be Filled By O.E.M.
Version: To Be Filled By O.E.M.
Serial Number: To Be Filled By O.E.M.
UUID: 00000000-0000-0000-0000-D05099C16889
Wake-up Type: Power Switch
SKU Number: To Be Filled By O.E.M.
Family: To Be Filled By O.E.M.

Handle 0x0002, DMI type 2, 15 bytes
Base Board Information
Manufacturer: ASRockRack
Product Name: EPC612D8
Version:
Serial Number:
Asset Tag:
Features:
Board is a hosting board
Board is replaceable
Location In Chassis:
Chassis Handle: 0x0003
Type: Motherboard
Contained Object Handles: 0

Handle 0x0003, DMI type 3, 22 bytes
Chassis Information
Manufacturer: To Be Filled By O.E.M.
Type: Desktop
Lock: Not Present
Version: To Be Filled By O.E.M.
Serial Number: To Be Filled By O.E.M.
Asset Tag: To Be Filled By O.E.M.
Boot-up State: Safe
Power Supply State: Safe
Thermal State: Safe
Security Status: None
OEM Information: 0x00000000
Height: Unspecified
Number Of Power Cords: 1
Contained Elements: 0
SKU Number: To Be Filled By O.E.M.

Handle 0x0004, DMI type 9, 17 bytes
System Slot Information
Designation: PCIE1
Type: x8 PCI Express
Current Usage: In Use
Length: Long
ID: 17
Characteristics:
3.3 V is provided
Opening is shared
PME signal is supported
Bus Address: ffff:04:1f.7

Handle 0x0005, DMI type 9, 17 bytes
System Slot Information
Designation: PCIE3
Type: x16 PCI Express
Current Usage: Available
Length: Long
ID: 19
Characteristics:
3.3 V is provided
Opening is shared
PME signal is supported
Bus Address: ffff:03:1f.7

Handle 0x0006, DMI type 9, 17 bytes
System Slot Information
Designation: PCIE5
Type: x8 PCI Express
Current Usage: Available
Length: Long
ID: 21
Characteristics:
3.3 V is provided
Opening is shared
PME signal is supported

Handle 0x0007, DMI type 9, 17 bytes
System Slot Information
Designation: PCIE6
Type: x8 PCI Express
Current Usage: Available
Length: Long
ID: 22
Characteristics:
3.3 V is provided
Opening is shared
PME signal is supported
Bus Address: ffff:01:1f.7

Handle 0x0008, DMI type 9, 17 bytes
System Slot Information
Designation: PCIE7
Type: x16 PCI Express
Current Usage: In Use
Length: Long
ID: 23
Characteristics:
3.3 V is provided
Opening is shared
PME signal is supported
Bus Address: ffff:02:1f.7

Handle 0x0009, DMI type 9, 17 bytes
System Slot Information
Designation: PCIE8
Type: x4 PCI Express
Current Usage: Available
Length: Long
ID: 33
Characteristics:
3.3 V is provided
Opening is shared
PME signal is supported

Handle 0x000A, DMI type 11, 5 bytes
OEM Strings
String 1: To Be Filled By O.E.M.

Handle 0x000B, DMI type 32, 20 bytes
System Boot Information
Status: No errors detected

Handle 0x000C, DMI type 15, 73 bytes
System Event Log
Area Length: 65535 bytes
Header Start Offset: 0x0000
Header Length: 16 bytes
Data Start Offset: 0x0010
Access Method: Memory-mapped physical 32-bit address
Access Address: 0xFF850000
Status: Valid, Not Full
Change Token: 0x00000203
Header Format: Type 1
Supported Log Type Descriptors: 25
Descriptor 1: Single-bit ECC memory error
Data Format 1: Multiple-event handle
Descriptor 2: Multi-bit ECC memory error
Data Format 2: Multiple-event handle
Descriptor 3: Parity memory error
Data Format 3: None
Descriptor 4: Bus timeout
Data Format 4: None
Descriptor 5: I/O channel block
Data Format 5: None
Descriptor 6: Software NMI
Data Format 6: None
Descriptor 7: POST memory resize
Data Format 7: None
Descriptor 8: POST error
Data Format 8: POST results bitmap
Descriptor 9: PCI parity error
Data Format 9: Multiple-event handle
Descriptor 10: PCI system error
Data Format 10: Multiple-event handle
Descriptor 11: CPU failure
Data Format 11: None
Descriptor 12: EISA failsafe timer timeout
Data Format 12: None
Descriptor 13: Correctable memory log disabled
Data Format 13: None
Descriptor 14: Logging disabled
Data Format 14: None
Descriptor 15: System limit exceeded
Data Format 15: None
Descriptor 16: Asynchronous hardware timer expired
Data Format 16: None
Descriptor 17: System configuration information
Data Format 17: None
Descriptor 18: Hard disk information
Data Format 18: None
Descriptor 19: System reconfigured
Data Format 19: None
Descriptor 20: Uncorrectable CPU-complex error
Data Format 20: None
Descriptor 21: Log area reset/cleared
Data Format 21: None
Descriptor 22: System boot
Data Format 22: None
Descriptor 23: End of log
Data Format 23: None
Descriptor 24: OEM-specific
Data Format 24: OEM-specific
Descriptor 25: OEM-specific
Data Format 25: OEM-specific

Handle 0x000D, DMI type 16, 23 bytes
Physical Memory Array
Location: System Board Or Motherboard
Use: System Memory
Error Correction Type: Multi-bit ECC
Maximum Capacity: 256 GB
Error Information Handle: Not Provided
Number Of Devices: 4

Handle 0x000E, DMI type 19, 31 bytes
Memory Array Mapped Address
Starting Address: 0x00000000000
Ending Address: 0x00FFFFFFFFF
Range Size: 64 GB
Physical Array Handle: 0x000D
Partition Width: 2

Handle 0x000F, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x000D
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 72 bits
Size: 32 GB
Form Factor: RIMM
Set: None
Locator: DIMM_A1
Bank Locator: NODE 1
Type: DDR4
Type Detail: Synchronous
Speed: 2400 MHz
Manufacturer: Undefined
Serial Number: EE0A7016
Asset Tag: DIMM_A1_AssetTag
Part Number: 9965640-006.A01G
Rank: 2
Configured Clock Speed: 2400 MHz
Minimum Voltage: Unknown
Maximum Voltage: Unknown
Configured Voltage: Unknown

Handle 0x0010, DMI type 20, 35 bytes
Memory Device Mapped Address
Starting Address: 0x00000000000
Ending Address: 0x007FFFFFFFF
Range Size: 32 GB
Physical Device Handle: 0x000F
Memory Array Mapped Address Handle: 0x000E
Partition Row Position: 1

Handle 0x0011, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x000D
Error Information Handle: Not Provided
Total Width: Unknown
Data Width: Unknown
Size: No Module Installed
Form Factor: RIMM
Set: None
Locator: DIMM_A2
Bank Locator: NODE 1
Type: DDR4
Type Detail: Synchronous
Speed: Unknown
Manufacturer: NO DIMM
Serial Number: NO DIMM
Asset Tag: NO DIMM
Part Number: NO DIMM
Rank: Unknown
Configured Clock Speed: Unknown
Minimum Voltage: Unknown
Maximum Voltage: Unknown
Configured Voltage: Unknown

Handle 0x0012, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x000D
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 72 bits
Size: 32 GB
Form Factor: RIMM
Set: None
Locator: DIMM_B1
Bank Locator: NODE 1
Type: DDR4
Type Detail: Synchronous
Speed: 2400 MHz
Manufacturer: Undefined
Serial Number: EF087482
Asset Tag: DIMM_B1_AssetTag
Part Number: 9965640-006.A01G
Rank: 2
Configured Clock Speed: 2400 MHz
Minimum Voltage: Unknown
Maximum Voltage: Unknown
Configured Voltage: Unknown

Handle 0x0013, DMI type 20, 35 bytes
Memory Device Mapped Address
Starting Address: 0x00800000000
Ending Address: 0x00FFFFFFFFF
Range Size: 32 GB
Physical Device Handle: 0x0012
Memory Array Mapped Address Handle: 0x000E
Partition Row Position: 1

Handle 0x0014, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x000D
Error Information Handle: Not Provided
Total Width: Unknown
Data Width: Unknown
Size: No Module Installed
Form Factor: RIMM
Set: None
Locator: DIMM_B2
Bank Locator: NODE 1
Type: DDR4
Type Detail: Synchronous
Speed: Unknown
Manufacturer: NO DIMM
Serial Number: NO DIMM
Asset Tag: NO DIMM
Part Number: NO DIMM
Rank: Unknown
Configured Clock Speed: Unknown
Minimum Voltage: Unknown
Maximum Voltage: Unknown
Configured Voltage: Unknown

Handle 0x0015, DMI type 16, 23 bytes
Physical Memory Array
Location: System Board Or Motherboard
Use: System Memory
Error Correction Type: Multi-bit ECC
Maximum Capacity: 256 GB
Error Information Handle: Not Provided
Number Of Devices: 4

Handle 0x0016, DMI type 19, 31 bytes
Memory Array Mapped Address
Starting Address: 0x01000000000
Ending Address: 0x01FFFFFFFFF
Range Size: 64 GB
Physical Array Handle: 0x0015
Partition Width: 2

Handle 0x0017, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x0015
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 72 bits
Size: 32 GB
Form Factor: RIMM
Set: None
Locator: DIMM_C1
Bank Locator: NODE 2
Type: DDR4
Type Detail: Synchronous
Speed: 2400 MHz
Manufacturer: Undefined
Serial Number: EB084E82
Asset Tag: DIMM_C1_AssetTag
Part Number: 9965640-006.A01G
Rank: 2
Configured Clock Speed: 2400 MHz
Minimum Voltage: Unknown
Maximum Voltage: Unknown
Configured Voltage: Unknown

Handle 0x0018, DMI type 20, 35 bytes
Memory Device Mapped Address
Starting Address: 0x01000000000
Ending Address: 0x017FFFFFFFF
Range Size: 32 GB
Physical Device Handle: 0x0017
Memory Array Mapped Address Handle: 0x0016
Partition Row Position: 1

Handle 0x0019, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x0015
Error Information Handle: Not Provided
Total Width: Unknown
Data Width: Unknown
Size: No Module Installed
Form Factor: RIMM
Set: None
Locator: DIMM_C2
Bank Locator: NODE 2
Type: DDR4
Type Detail: Synchronous
Speed: Unknown
Manufacturer: NO DIMM
Serial Number: NO DIMM
Asset Tag: NO DIMM
Part Number: NO DIMM
Rank: Unknown
Configured Clock Speed: Unknown
Minimum Voltage: Unknown
Maximum Voltage: Unknown
Configured Voltage: Unknown

Handle 0x001A, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x0015
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 72 bits
Size: 32 GB
Form Factor: RIMM
Set: None
Locator: DIMM_D1
Bank Locator: NODE 2
Type: DDR4
Type Detail: Synchronous
Speed: 2400 MHz
Manufacturer: Undefined
Serial Number: E819480C
Asset Tag: DIMM_D1_AssetTag
Part Number: 9965640-006.A01G
Rank: 2
Configured Clock Speed: 2400 MHz
Minimum Voltage: Unknown
Maximum Voltage: Unknown
Configured Voltage: Unknown

Handle 0x001B, DMI type 20, 35 bytes
Memory Device Mapped Address
Starting Address: 0x01800000000
Ending Address: 0x01FFFFFFFFF
Range Size: 32 GB
Physical Device Handle: 0x001A
Memory Array Mapped Address Handle: 0x0016
Partition Row Position: 1

Handle 0x001C, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x0015
Error Information Handle: Not Provided
Total Width: Unknown
Data Width: Unknown
Size: No Module Installed
Form Factor: RIMM
Set: None
Locator: DIMM_D2
Bank Locator: NODE 2
Type: DDR4
Type Detail: Synchronous
Speed: Unknown
Manufacturer: NO DIMM
Serial Number: NO DIMM
Asset Tag: NO DIMM
Part Number: NO DIMM
Rank: Unknown
Configured Clock Speed: Unknown
Minimum Voltage: Unknown
Maximum Voltage: Unknown
Configured Voltage: Unknown

Handle 0x001D, DMI type 7, 19 bytes
Cache Information
Socket Designation: CPU Internal L1
Configuration: Enabled, Not Socketed, Level 1
Operational Mode: Write Back
Location: Internal
Installed Size: 896 kB
Maximum Size: 896 kB
Supported SRAM Types:
Unknown
Installed SRAM Type: Unknown
Speed: Unknown
Error Correction Type: Parity
System Type: Other
Associativity: 8-way Set-associative

Handle 0x001E, DMI type 7, 19 bytes
Cache Information
Socket Designation: CPU Internal L2
Configuration: Enabled, Not Socketed, Level 2
Operational Mode: Write Back
Location: Internal
Installed Size: 3584 kB
Maximum Size: 3584 kB
Supported SRAM Types:
Unknown
Installed SRAM Type: Unknown
Speed: Unknown
Error Correction Type: Single-bit ECC
System Type: Unified
Associativity: 8-way Set-associative

Handle 0x001F, DMI type 7, 19 bytes
Cache Information
Socket Designation: CPU Internal L3
Configuration: Enabled, Not Socketed, Level 3
Operational Mode: Write Back
Location: Internal
Installed Size: 35840 kB
Maximum Size: 35840 kB
Supported SRAM Types:
Unknown
Installed SRAM Type: Unknown
Speed: Unknown
Error Correction Type: Single-bit ECC
System Type: Unified
Associativity: 20-way Set-associative

Handle 0x0020, DMI type 4, 42 bytes
Processor Information
Socket Designation: CPUSocket
Type: Central Processor
Family: Xeon
Manufacturer: Intel
ID: F1 06 04 00 FF FB EB BF
Signature: Type 0, Family 6, Model 79, Stepping 1
Flags:
FPU (Floating-point unit on-chip)
VME (Virtual mode extension)
DE (Debugging extension)
PSE (Page size extension)
TSC (Time stamp counter)
MSR (Model specific registers)
PAE (Physical address extension)
MCE (Machine check exception)
CX8 (CMPXCHG8 instruction supported)
APIC (On-chip APIC hardware supported)
SEP (Fast system call)
MTRR (Memory type range registers)
PGE (Page global enable)
MCA (Machine check architecture)
CMOV (Conditional move instruction supported)
PAT (Page attribute table)
PSE-36 (36-bit page size extension)
CLFSH (CLFLUSH instruction supported)
DS (Debug store)
ACPI (ACPI supported)
MMX (MMX technology supported)
FXSR (FXSAVE and FXSTOR instructions supported)
SSE (Streaming SIMD extensions)
SSE2 (Streaming SIMD extensions 2)
SS (Self-snoop)
HTT (Multi-threading)
TM (Thermal monitor supported)
PBE (Pending break enabled)
Version: Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
Voltage: 0.0 V
External Clock: 100 MHz
Max Speed: 4000 MHz
Current Speed: 2400 MHz
Status: Populated, Enabled
Upgrade: Socket LGA2011-3
L1 Cache Handle: 0x001D
L2 Cache Handle: 0x001E
L3 Cache Handle: 0x001F
Serial Number: Not Specified
Asset Tag: Not Specified
Part Number: Not Specified
Core Count: 14
Core Enabled: 14
Thread Count: 28
Characteristics:
64-bit capable
Multi-Core
Hardware Thread
Execute Protection
Enhanced Virtualization
Power/Performance Control

Handle 0x0021, DMI type 130, 20 bytes
OEM-specific Type
Header and Data:
82 14 21 00 24 41 4D 54 01 01 01 01 01 A5 2F 02
00 00 00 00

Handle 0x0022, DMI type 131, 64 bytes
OEM-specific Type
Header and Data:
83 40 22 00 35 00 00 00 09 00 00 00 00 00 1D 00
F8 00 44 8D 00 00 00 00 09 80 00 00 01 00 09 00
EA 03 25 00 00 00 00 00 C8 00 3A 15 00 00 00 00
00 00 00 00 22 00 00 00 76 50 72 6F 00 00 00 00

Handle 0x0023, DMI type 127, 4 bytes
End Of Table

Additional information:

I moved the Titan X board to a different system. The first system was a Xeon E5 2680 v4 CPU, while the second box is a Threadripper 1950X, running CUDA 9.0 on Ubuntu 16.04 with the same kernel version. Same issues as described above.

Thanks for your help.

Best,
Aaron

Upgrade to 387.34 drivers which fully support Titan V.

NVRM: RmInitAdapter failed! (0x30:0x56:685)
Doesn’t look good, across different systems would point to the Titan being broken,check power connectors, try lastest driver, then try to RMA.

Hello again,

Thank you!

Downloading the latest driver fixes this. I did not realize that the Cuda Toolkit v9.1 does not include the latest driver.

Using Ubuntu 16.4.3, driver 387.34, nvidia-smi still prints “Graphics Device” instead of Titan V. Is that normal?

Here’s the output I get from nvidia-smi, which also has “Graphics Device.” The 100% GPU use is because I’m running a simulation.

(CUDA programs seem to run fine, so I’m not too worried.)

±----------------------------------------------------------------------------+
| NVIDIA-SMI 387.34 Driver Version: 387.34 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Graphics Device Off | 00000000:41:00.0 Off | N/A |
| 52% 72C P2 143W / 250W | 2755MiB / 12057MiB | 100% Default |
±------------------------------±---------------------±---------------------+

I am having a similar issue. I am using the newest driver (387.34), I have cuda 9 installed, and am running tensorflow via nvidia docker container all on an ubuntu 16.04 machine. Just like marton, when I enter nvidia-smi “Graphics Device” is printed instead of “Titan V”. The card seems to be working but isnt as fast as the V100s I use at work. I’m not sure if that is due to the fact that V100s are just faster or if the titan v isn’t utilizing its tensor cores. Is there a way to determine if the card is utilizing tensorcores?

I’m having the same issue with 387.34 + CUDA 9.1 + nvidia-smi on a fresh Ubuntu 16.04 LTS. The 1080 Ti in my system is reported correctly but the Titan V receives “Graphics Device” as others have noted above.

Running [url]https://github.com/salesforce/awd-lstm-lm[/url], a language modeling codebase I made, the 1080 Ti gets ~26 secs / epoch (similar to the P100) and the Titan V hits ~20 secs / epoch, so it certainly works even if it’s not properly recognized. Note that the codebase is running PyTorch but is not optimized for the Titan V. Off topic but interesting: the Titan V is pulling about 160 watts vs 1080 Ti pulling 230 watts - +1 for power efficiency :)

My cuda-driver package is 387.26-1 from the repository but I realized it’s a pseudo-package and that’s only stating the minimal version number of the other packages, all of which are 387.34 too. I’ll keep investigating.

Using latest drivers from web site versus repo, nvidia-smi still prints the generic label.

Using latest windows drivers, cards show up as Titan V.

getting the same - updated to latest driver - 387.34 - but my shiny new TitanV is only recognised as a lowly ‘graphics device’. Where’s the respect? nvidia-smi output:

Fri Dec 29 10:15:42 2017
±----------------------------------------------------------------------------+
| NVIDIA-SMI 387.34 Driver Version: 387.34 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Graphics Device Off | 00000000:01:00.0 On | N/A |
| 28% 37C P2 26W / 250W | 327MiB / 12057MiB | 1% Default |
±------------------------------±---------------------±---------------------+

on a fully updated Ubuntu 16.04 LTS install .

If anyone at Nvidia is reading this, I’d like to re-emphasize a point made above: It would be helpful if there were an easy way to confirm that the tensor cores are actually being used. This functionality could be added to, for example, the nvvp program.

(I was using the tensor cores with Theano and ended up getting much less of a performance increase than I expected. In my use case, it turns out that float16 storage with float32 compute is actually running faster than float16 storage with the tensor cores. I think the tensor core routines in cudnn have kernels ending with _tn_v1 (as shown in nvvp), and intentionally disabling the tensor cores slowed my simulations down further. Therefore, I’m pretty sure that my convolutions were being performed using the tensor cores. When running float32 compute, I could use a FFT convolution algorithm, resulting in substantial speedup.)

I second agreenblatt’s request - I would also love a tool to help monitor the ultilization of tensorcores. I’ve just been re-writing a couple networks casting to FP16 and but needs some concrete tensorcore diagnostics, otherwise I feel like I am wandering in the dark.

I’ve had some very bad experiences with the 387.34 driver and Ubuntu 16.04. I did one install using all disk drives in Ubuntu. Had similar results as you guys had (though I’m largely using Matlab Parallel computing toolbox). But Matlab choked on the gpuDevice command. I found a workaround using Matlab system objects. I thought all was well, until I installed VMWare - which I need because I need a couple of Windows pieces of software. And VMWare croaked the Xorg server. Hung it completely. If I didn’t load the NVidia driver, VMWare worked. Okay - so, I ended up installing Windows, which works. Then I set aside a single drive for experimenting with Linux distros. And found that none of them worked once I installed the driver. They all had one piece of software or another that would cause Xorg to hang. In some cases, it was even Firefox causing the hang! Ugh.

This is normal for first instantiation of a GPU of a MATLAB version whose PCT version has no fat binary support for the GPUs compute capability version. See this: gpuDevice command very slow - MATLAB Answers - MATLAB Central

Answer is talking about a different MATLAB version / GPU compute capability but the same idea applies.

Same here, GPU naming (TITAN V) was working before with old drivers (384.111) now with 387.34 still displaying Graphics Device. Is this a symptom that something is not working properly?

384.111 is actually newer than 387.34 since 384 was a long term series.

… so shall I downgrade the drivers to make it work properly? I don’t get it.

Buy a second Titan V, run some NN on each and let them talk that out…lol.
Joking aside, it’s like Ubuntu 16.04.3 (LTS) being newer than 16.10, though 16.04 being older. So by going from 384.111 to 387.34 you actually downgraded. 384=LTS.
Apart from that, the ‘Graphics Device’ issue is plain optical. They just forgot to fill the proper name in.

Yeah, I do have 2 of them! :)

Ok, so I can keep it as it is. I didn’t know this is an LTS, on NVIDIA website http://www.nvidia.com/download/driverResults.aspx/128010/en-us I understood that I should upgrade the drivers in order to get Titan V supported.

Apart from the naming problem it seems working just fine so far (I did not do a full comparison with Titan Xp yet though).