Nvidia-smi not recognizing Titan V
Hello, I recently replaced a Titan X board with a Titan V board in a computer running Ubuntu 16.04. Upon installing the latest CUDA Toolkit v9.1 with display driver 387.26, nvidia-smi returns, "No devices were found". (CUDA 9.0, which was installed on the machine with the Titan X and worked, had the same result when I installed the Titan V.) In case it's relevant, the machine has an AST2400 BMC on it, and the primary display is set up to go out the VGA port on the BMC and not through the Nvidia GPU. The GPU is for compute only. I found another thread with a similar situation some time ago, and the resolution was a driver update. (https://devtalk.nvidia.com/default/topic/959156/linux/-quot-rminitadapter-failed-quot-with-370-23-but-367-35-works-fine/1) Any ideas on how to proceed? Thanks, Aaron Relevant output from dmesg includes: [ 6.755454] nvidia: module license 'NVIDIA' taints kernel. [ 6.755455] Disabling lock debugging due to kernel taint [ 6.761032] nvidia: module verification failed: signature and/or required key missing - tainting kernel [ 6.765634] ipmi_si IPI0001:00: Found new BMC (man_id: 0x000000, prod_id: 0xaabb, dev_id: 0x20) [ 6.766213] nvidia-nvlink: Nvlink Core is being initialized, major device number 243 [ 6.766381] nvidia 0000:04:00.0: enabling device (0100 -> 0103) [ 6.766448] vgaarb: device changed decodes: PCI:0000:04:00.0,olddecodes=io+mem,decodes=none:owns=none [ 6.766508] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 387.26 Thu Nov 2 21:20:16 PDT 2017 (using threaded interrupts) [ 7.175641] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:02.2/0000:04:00.1/sound/card0/input2 [ 7.175685] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:02.2/0000:04:00.1/sound/card0/input3 [ 7.175735] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:02.2/0000:04:00.1/sound/card0/input4 [ 7.175769] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:02.2/0000:04:00.1/sound/card0/input5 [ 8.229133] nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 242 [ 8.653254] NVRM: RmInitAdapter failed! (0x30:0x56:685) [ 8.653280] NVRM: rm_init_adapter failed for device bearing minor number 0 [ 17.811810] NVRM: RmInitAdapter failed! (0x30:0x56:685) [ 17.811839] NVRM: rm_init_adapter failed for device bearing minor number 0 [ 252.420234] NVRM: RmInitAdapter failed! (0x30:0x56:685) [ 252.420254] NVRM: rm_init_adapter failed for device bearing minor number 0 Also relevant: cat /proc/driver/nvidia/gpus/0000\:02\:00.0/information Model: Graphics Device IRQ: 57 GPU UUID: GPU-????????-????-????-????-???????????? Video BIOS: ??.??.??.??.?? Bus Type: PCIe DMA Size: 47 bits DMA Mask: 0x7fffffffffff Bus Location: 0000:02:00.0 Device Minor: 0 Also: uname -r 4.4.0-98-generic Also: sudo dmidecode [sudo] password for agreenblatt: # dmidecode 3.0 Getting SMBIOS data from sysfs. SMBIOS 3.0 present. 36 structures occupying 2136 bytes. Table at 0x000ED9B0. Handle 0x0000, DMI type 0, 24 bytes BIOS Information Vendor: American Megatrends Inc. Version: P2.10 Release Date: 06/17/2016 Address: 0xF0000 Runtime Size: 64 kB ROM Size: 8192 kB Characteristics: PCI is supported BIOS is upgradeable BIOS shadowing is allowed Boot from CD is supported Selectable boot is supported BIOS ROM is socketed EDD is supported 5.25"/1.2 MB floppy services are supported (int 13h) 3.5"/720 kB floppy services are supported (int 13h) 3.5"/2.88 MB floppy services are supported (int 13h) Print screen service is supported (int 5h) 8042 keyboard services are supported (int 9h) Serial services are supported (int 14h) Printer services are supported (int 17h) ACPI is supported USB legacy is supported BIOS boot specification is supported Targeted content distribution is supported UEFI is supported BIOS Revision: 5.11 Handle 0x0001, DMI type 1, 27 bytes System Information Manufacturer: To Be Filled By O.E.M. Product Name: To Be Filled By O.E.M. Version: To Be Filled By O.E.M. Serial Number: To Be Filled By O.E.M. UUID: 00000000-0000-0000-0000-D05099C16889 Wake-up Type: Power Switch SKU Number: To Be Filled By O.E.M. Family: To Be Filled By O.E.M. Handle 0x0002, DMI type 2, 15 bytes Base Board Information Manufacturer: ASRockRack Product Name: EPC612D8 Version: Serial Number: Asset Tag: Features: Board is a hosting board Board is replaceable Location In Chassis: Chassis Handle: 0x0003 Type: Motherboard Contained Object Handles: 0 Handle 0x0003, DMI type 3, 22 bytes Chassis Information Manufacturer: To Be Filled By O.E.M. Type: Desktop Lock: Not Present Version: To Be Filled By O.E.M. Serial Number: To Be Filled By O.E.M. Asset Tag: To Be Filled By O.E.M. Boot-up State: Safe Power Supply State: Safe Thermal State: Safe Security Status: None OEM Information: 0x00000000 Height: Unspecified Number Of Power Cords: 1 Contained Elements: 0 SKU Number: To Be Filled By O.E.M. Handle 0x0004, DMI type 9, 17 bytes System Slot Information Designation: PCIE1 Type: x8 PCI Express Current Usage: In Use Length: Long ID: 17 Characteristics: 3.3 V is provided Opening is shared PME signal is supported Bus Address: ffff:04:1f.7 Handle 0x0005, DMI type 9, 17 bytes System Slot Information Designation: PCIE3 Type: x16 PCI Express Current Usage: Available Length: Long ID: 19 Characteristics: 3.3 V is provided Opening is shared PME signal is supported Bus Address: ffff:03:1f.7 Handle 0x0006, DMI type 9, 17 bytes System Slot Information Designation: PCIE5 Type: x8 PCI Express Current Usage: Available Length: Long ID: 21 Characteristics: 3.3 V is provided Opening is shared PME signal is supported Handle 0x0007, DMI type 9, 17 bytes System Slot Information Designation: PCIE6 Type: x8 PCI Express Current Usage: Available Length: Long ID: 22 Characteristics: 3.3 V is provided Opening is shared PME signal is supported Bus Address: ffff:01:1f.7 Handle 0x0008, DMI type 9, 17 bytes System Slot Information Designation: PCIE7 Type: x16 PCI Express Current Usage: In Use Length: Long ID: 23 Characteristics: 3.3 V is provided Opening is shared PME signal is supported Bus Address: ffff:02:1f.7 Handle 0x0009, DMI type 9, 17 bytes System Slot Information Designation: PCIE8 Type: x4 PCI Express Current Usage: Available Length: Long ID: 33 Characteristics: 3.3 V is provided Opening is shared PME signal is supported Handle 0x000A, DMI type 11, 5 bytes OEM Strings String 1: To Be Filled By O.E.M. Handle 0x000B, DMI type 32, 20 bytes System Boot Information Status: No errors detected Handle 0x000C, DMI type 15, 73 bytes System Event Log Area Length: 65535 bytes Header Start Offset: 0x0000 Header Length: 16 bytes Data Start Offset: 0x0010 Access Method: Memory-mapped physical 32-bit address Access Address: 0xFF850000 Status: Valid, Not Full Change Token: 0x00000203 Header Format: Type 1 Supported Log Type Descriptors: 25 Descriptor 1: Single-bit ECC memory error Data Format 1: Multiple-event handle Descriptor 2: Multi-bit ECC memory error Data Format 2: Multiple-event handle Descriptor 3: Parity memory error Data Format 3: None Descriptor 4: Bus timeout Data Format 4: None Descriptor 5: I/O channel block Data Format 5: None Descriptor 6: Software NMI Data Format 6: None Descriptor 7: POST memory resize Data Format 7: None Descriptor 8: POST error Data Format 8: POST results bitmap Descriptor 9: PCI parity error Data Format 9: Multiple-event handle Descriptor 10: PCI system error Data Format 10: Multiple-event handle Descriptor 11: CPU failure Data Format 11: None Descriptor 12: EISA failsafe timer timeout Data Format 12: None Descriptor 13: Correctable memory log disabled Data Format 13: None Descriptor 14: Logging disabled Data Format 14: None Descriptor 15: System limit exceeded Data Format 15: None Descriptor 16: Asynchronous hardware timer expired Data Format 16: None Descriptor 17: System configuration information Data Format 17: None Descriptor 18: Hard disk information Data Format 18: None Descriptor 19: System reconfigured Data Format 19: None Descriptor 20: Uncorrectable CPU-complex error Data Format 20: None Descriptor 21: Log area reset/cleared Data Format 21: None Descriptor 22: System boot Data Format 22: None Descriptor 23: End of log Data Format 23: None Descriptor 24: OEM-specific Data Format 24: OEM-specific Descriptor 25: OEM-specific Data Format 25: OEM-specific Handle 0x000D, DMI type 16, 23 bytes Physical Memory Array Location: System Board Or Motherboard Use: System Memory Error Correction Type: Multi-bit ECC Maximum Capacity: 256 GB Error Information Handle: Not Provided Number Of Devices: 4 Handle 0x000E, DMI type 19, 31 bytes Memory Array Mapped Address Starting Address: 0x00000000000 Ending Address: 0x00FFFFFFFFF Range Size: 64 GB Physical Array Handle: 0x000D Partition Width: 2 Handle 0x000F, DMI type 17, 40 bytes Memory Device Array Handle: 0x000D Error Information Handle: Not Provided Total Width: 72 bits Data Width: 72 bits Size: 32 GB Form Factor: RIMM Set: None Locator: DIMM_A1 Bank Locator: NODE 1 Type: DDR4 Type Detail: Synchronous Speed: 2400 MHz Manufacturer: Undefined Serial Number: EE0A7016 Asset Tag: DIMM_A1_AssetTag Part Number: 9965640-006.A01G Rank: 2 Configured Clock Speed: 2400 MHz Minimum Voltage: Unknown Maximum Voltage: Unknown Configured Voltage: Unknown Handle 0x0010, DMI type 20, 35 bytes Memory Device Mapped Address Starting Address: 0x00000000000 Ending Address: 0x007FFFFFFFF Range Size: 32 GB Physical Device Handle: 0x000F Memory Array Mapped Address Handle: 0x000E Partition Row Position: 1 Handle 0x0011, DMI type 17, 40 bytes Memory Device Array Handle: 0x000D Error Information Handle: Not Provided Total Width: Unknown Data Width: Unknown Size: No Module Installed Form Factor: RIMM Set: None Locator: DIMM_A2 Bank Locator: NODE 1 Type: DDR4 Type Detail: Synchronous Speed: Unknown Manufacturer: NO DIMM Serial Number: NO DIMM Asset Tag: NO DIMM Part Number: NO DIMM Rank: Unknown Configured Clock Speed: Unknown Minimum Voltage: Unknown Maximum Voltage: Unknown Configured Voltage: Unknown Handle 0x0012, DMI type 17, 40 bytes Memory Device Array Handle: 0x000D Error Information Handle: Not Provided Total Width: 72 bits Data Width: 72 bits Size: 32 GB Form Factor: RIMM Set: None Locator: DIMM_B1 Bank Locator: NODE 1 Type: DDR4 Type Detail: Synchronous Speed: 2400 MHz Manufacturer: Undefined Serial Number: EF087482 Asset Tag: DIMM_B1_AssetTag Part Number: 9965640-006.A01G Rank: 2 Configured Clock Speed: 2400 MHz Minimum Voltage: Unknown Maximum Voltage: Unknown Configured Voltage: Unknown Handle 0x0013, DMI type 20, 35 bytes Memory Device Mapped Address Starting Address: 0x00800000000 Ending Address: 0x00FFFFFFFFF Range Size: 32 GB Physical Device Handle: 0x0012 Memory Array Mapped Address Handle: 0x000E Partition Row Position: 1 Handle 0x0014, DMI type 17, 40 bytes Memory Device Array Handle: 0x000D Error Information Handle: Not Provided Total Width: Unknown Data Width: Unknown Size: No Module Installed Form Factor: RIMM Set: None Locator: DIMM_B2 Bank Locator: NODE 1 Type: DDR4 Type Detail: Synchronous Speed: Unknown Manufacturer: NO DIMM Serial Number: NO DIMM Asset Tag: NO DIMM Part Number: NO DIMM Rank: Unknown Configured Clock Speed: Unknown Minimum Voltage: Unknown Maximum Voltage: Unknown Configured Voltage: Unknown Handle 0x0015, DMI type 16, 23 bytes Physical Memory Array Location: System Board Or Motherboard Use: System Memory Error Correction Type: Multi-bit ECC Maximum Capacity: 256 GB Error Information Handle: Not Provided Number Of Devices: 4 Handle 0x0016, DMI type 19, 31 bytes Memory Array Mapped Address Starting Address: 0x01000000000 Ending Address: 0x01FFFFFFFFF Range Size: 64 GB Physical Array Handle: 0x0015 Partition Width: 2 Handle 0x0017, DMI type 17, 40 bytes Memory Device Array Handle: 0x0015 Error Information Handle: Not Provided Total Width: 72 bits Data Width: 72 bits Size: 32 GB Form Factor: RIMM Set: None Locator: DIMM_C1 Bank Locator: NODE 2 Type: DDR4 Type Detail: Synchronous Speed: 2400 MHz Manufacturer: Undefined Serial Number: EB084E82 Asset Tag: DIMM_C1_AssetTag Part Number: 9965640-006.A01G Rank: 2 Configured Clock Speed: 2400 MHz Minimum Voltage: Unknown Maximum Voltage: Unknown Configured Voltage: Unknown Handle 0x0018, DMI type 20, 35 bytes Memory Device Mapped Address Starting Address: 0x01000000000 Ending Address: 0x017FFFFFFFF Range Size: 32 GB Physical Device Handle: 0x0017 Memory Array Mapped Address Handle: 0x0016 Partition Row Position: 1 Handle 0x0019, DMI type 17, 40 bytes Memory Device Array Handle: 0x0015 Error Information Handle: Not Provided Total Width: Unknown Data Width: Unknown Size: No Module Installed Form Factor: RIMM Set: None Locator: DIMM_C2 Bank Locator: NODE 2 Type: DDR4 Type Detail: Synchronous Speed: Unknown Manufacturer: NO DIMM Serial Number: NO DIMM Asset Tag: NO DIMM Part Number: NO DIMM Rank: Unknown Configured Clock Speed: Unknown Minimum Voltage: Unknown Maximum Voltage: Unknown Configured Voltage: Unknown Handle 0x001A, DMI type 17, 40 bytes Memory Device Array Handle: 0x0015 Error Information Handle: Not Provided Total Width: 72 bits Data Width: 72 bits Size: 32 GB Form Factor: RIMM Set: None Locator: DIMM_D1 Bank Locator: NODE 2 Type: DDR4 Type Detail: Synchronous Speed: 2400 MHz Manufacturer: Undefined Serial Number: E819480C Asset Tag: DIMM_D1_AssetTag Part Number: 9965640-006.A01G Rank: 2 Configured Clock Speed: 2400 MHz Minimum Voltage: Unknown Maximum Voltage: Unknown Configured Voltage: Unknown Handle 0x001B, DMI type 20, 35 bytes Memory Device Mapped Address Starting Address: 0x01800000000 Ending Address: 0x01FFFFFFFFF Range Size: 32 GB Physical Device Handle: 0x001A Memory Array Mapped Address Handle: 0x0016 Partition Row Position: 1 Handle 0x001C, DMI type 17, 40 bytes Memory Device Array Handle: 0x0015 Error Information Handle: Not Provided Total Width: Unknown Data Width: Unknown Size: No Module Installed Form Factor: RIMM Set: None Locator: DIMM_D2 Bank Locator: NODE 2 Type: DDR4 Type Detail: Synchronous Speed: Unknown Manufacturer: NO DIMM Serial Number: NO DIMM Asset Tag: NO DIMM Part Number: NO DIMM Rank: Unknown Configured Clock Speed: Unknown Minimum Voltage: Unknown Maximum Voltage: Unknown Configured Voltage: Unknown Handle 0x001D, DMI type 7, 19 bytes Cache Information Socket Designation: CPU Internal L1 Configuration: Enabled, Not Socketed, Level 1 Operational Mode: Write Back Location: Internal Installed Size: 896 kB Maximum Size: 896 kB Supported SRAM Types: Unknown Installed SRAM Type: Unknown Speed: Unknown Error Correction Type: Parity System Type: Other Associativity: 8-way Set-associative Handle 0x001E, DMI type 7, 19 bytes Cache Information Socket Designation: CPU Internal L2 Configuration: Enabled, Not Socketed, Level 2 Operational Mode: Write Back Location: Internal Installed Size: 3584 kB Maximum Size: 3584 kB Supported SRAM Types: Unknown Installed SRAM Type: Unknown Speed: Unknown Error Correction Type: Single-bit ECC System Type: Unified Associativity: 8-way Set-associative Handle 0x001F, DMI type 7, 19 bytes Cache Information Socket Designation: CPU Internal L3 Configuration: Enabled, Not Socketed, Level 3 Operational Mode: Write Back Location: Internal Installed Size: 35840 kB Maximum Size: 35840 kB Supported SRAM Types: Unknown Installed SRAM Type: Unknown Speed: Unknown Error Correction Type: Single-bit ECC System Type: Unified Associativity: 20-way Set-associative Handle 0x0020, DMI type 4, 42 bytes Processor Information Socket Designation: CPUSocket Type: Central Processor Family: Xeon Manufacturer: Intel ID: F1 06 04 00 FF FB EB BF Signature: Type 0, Family 6, Model 79, Stepping 1 Flags: FPU (Floating-point unit on-chip) VME (Virtual mode extension) DE (Debugging extension) PSE (Page size extension) TSC (Time stamp counter) MSR (Model specific registers) PAE (Physical address extension) MCE (Machine check exception) CX8 (CMPXCHG8 instruction supported) APIC (On-chip APIC hardware supported) SEP (Fast system call) MTRR (Memory type range registers) PGE (Page global enable) MCA (Machine check architecture) CMOV (Conditional move instruction supported) PAT (Page attribute table) PSE-36 (36-bit page size extension) CLFSH (CLFLUSH instruction supported) DS (Debug store) ACPI (ACPI supported) MMX (MMX technology supported) FXSR (FXSAVE and FXSTOR instructions supported) SSE (Streaming SIMD extensions) SSE2 (Streaming SIMD extensions 2) SS (Self-snoop) HTT (Multi-threading) TM (Thermal monitor supported) PBE (Pending break enabled) Version: Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz Voltage: 0.0 V External Clock: 100 MHz Max Speed: 4000 MHz Current Speed: 2400 MHz Status: Populated, Enabled Upgrade: Socket LGA2011-3 L1 Cache Handle: 0x001D L2 Cache Handle: 0x001E L3 Cache Handle: 0x001F Serial Number: Not Specified Asset Tag: Not Specified Part Number: Not Specified Core Count: 14 Core Enabled: 14 Thread Count: 28 Characteristics: 64-bit capable Multi-Core Hardware Thread Execute Protection Enhanced Virtualization Power/Performance Control Handle 0x0021, DMI type 130, 20 bytes OEM-specific Type Header and Data: 82 14 21 00 24 41 4D 54 01 01 01 01 01 A5 2F 02 00 00 00 00 Handle 0x0022, DMI type 131, 64 bytes OEM-specific Type Header and Data: 83 40 22 00 35 00 00 00 09 00 00 00 00 00 1D 00 F8 00 44 8D 00 00 00 00 09 80 00 00 01 00 09 00 EA 03 25 00 00 00 00 00 C8 00 3A 15 00 00 00 00 00 00 00 00 22 00 00 00 76 50 72 6F 00 00 00 00 Handle 0x0023, DMI type 127, 4 bytes End Of Table
Hello,

I recently replaced a Titan X board with a Titan V board in a computer running Ubuntu 16.04. Upon installing the latest CUDA Toolkit v9.1 with display driver 387.26, nvidia-smi returns, "No devices were found". (CUDA 9.0, which was installed on the machine with the Titan X and worked, had the same result when I installed the Titan V.)

In case it's relevant, the machine has an AST2400 BMC on it, and the primary display is set up to go out the VGA port on the BMC and not through the Nvidia GPU. The GPU is for compute only.

I found another thread with a similar situation some time ago, and the resolution was a driver update. (https://devtalk.nvidia.com/default/topic/959156/linux/-quot-rminitadapter-failed-quot-with-370-23-but-367-35-works-fine/1)

Any ideas on how to proceed?

Thanks,
Aaron

Relevant output from dmesg includes:
[ 6.755454] nvidia: module license 'NVIDIA' taints kernel.
[ 6.755455] Disabling lock debugging due to kernel taint
[ 6.761032] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 6.765634] ipmi_si IPI0001:00: Found new BMC (man_id: 0x000000, prod_id: 0xaabb, dev_id: 0x20)
[ 6.766213] nvidia-nvlink: Nvlink Core is being initialized, major device number 243
[ 6.766381] nvidia 0000:04:00.0: enabling device (0100 -> 0103)
[ 6.766448] vgaarb: device changed decodes: PCI:0000:04:00.0,olddecodes=io+mem,decodes=none:owns=none
[ 6.766508] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 387.26 Thu Nov 2 21:20:16 PDT 2017 (using threaded interrupts)
[ 7.175641] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:02.2/0000:04:00.1/sound/card0/input2
[ 7.175685] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:02.2/0000:04:00.1/sound/card0/input3
[ 7.175735] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:02.2/0000:04:00.1/sound/card0/input4
[ 7.175769] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:02.2/0000:04:00.1/sound/card0/input5
[ 8.229133] nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 242
[ 8.653254] NVRM: RmInitAdapter failed! (0x30:0x56:685)
[ 8.653280] NVRM: rm_init_adapter failed for device bearing minor number 0
[ 17.811810] NVRM: RmInitAdapter failed! (0x30:0x56:685)
[ 17.811839] NVRM: rm_init_adapter failed for device bearing minor number 0
[ 252.420234] NVRM: RmInitAdapter failed! (0x30:0x56:685)
[ 252.420254] NVRM: rm_init_adapter failed for device bearing minor number 0


Also relevant:
cat /proc/driver/nvidia/gpus/0000\:02\:00.0/information
Model: Graphics Device
IRQ: 57
GPU UUID: GPU-????????-????-????-????-????????????
Video BIOS: ??.??.??.??.??
Bus Type: PCIe
DMA Size: 47 bits
DMA Mask: 0x7fffffffffff
Bus Location: 0000:02:00.0
Device Minor: 0

Also:
uname -r
4.4.0-98-generic


Also:
sudo dmidecode
[sudo] password for agreenblatt:
# dmidecode 3.0
Getting SMBIOS data from sysfs.
SMBIOS 3.0 present.
36 structures occupying 2136 bytes.
Table at 0x000ED9B0.

Handle 0x0000, DMI type 0, 24 bytes
BIOS Information
Vendor: American Megatrends Inc.
Version: P2.10
Release Date: 06/17/2016
Address: 0xF0000
Runtime Size: 64 kB
ROM Size: 8192 kB
Characteristics:
PCI is supported
BIOS is upgradeable
BIOS shadowing is allowed
Boot from CD is supported
Selectable boot is supported
BIOS ROM is socketed
EDD is supported
5.25"/1.2 MB floppy services are supported (int 13h)
3.5"/720 kB floppy services are supported (int 13h)
3.5"/2.88 MB floppy services are supported (int 13h)
Print screen service is supported (int 5h)
8042 keyboard services are supported (int 9h)
Serial services are supported (int 14h)
Printer services are supported (int 17h)
ACPI is supported
USB legacy is supported
BIOS boot specification is supported
Targeted content distribution is supported
UEFI is supported
BIOS Revision: 5.11

Handle 0x0001, DMI type 1, 27 bytes
System Information
Manufacturer: To Be Filled By O.E.M.
Product Name: To Be Filled By O.E.M.
Version: To Be Filled By O.E.M.
Serial Number: To Be Filled By O.E.M.
UUID: 00000000-0000-0000-0000-D05099C16889
Wake-up Type: Power Switch
SKU Number: To Be Filled By O.E.M.
Family: To Be Filled By O.E.M.

Handle 0x0002, DMI type 2, 15 bytes
Base Board Information
Manufacturer: ASRockRack
Product Name: EPC612D8
Version:
Serial Number:
Asset Tag:
Features:
Board is a hosting board
Board is replaceable
Location In Chassis:
Chassis Handle: 0x0003
Type: Motherboard
Contained Object Handles: 0

Handle 0x0003, DMI type 3, 22 bytes
Chassis Information
Manufacturer: To Be Filled By O.E.M.
Type: Desktop
Lock: Not Present
Version: To Be Filled By O.E.M.
Serial Number: To Be Filled By O.E.M.
Asset Tag: To Be Filled By O.E.M.
Boot-up State: Safe
Power Supply State: Safe
Thermal State: Safe
Security Status: None
OEM Information: 0x00000000
Height: Unspecified
Number Of Power Cords: 1
Contained Elements: 0
SKU Number: To Be Filled By O.E.M.

Handle 0x0004, DMI type 9, 17 bytes
System Slot Information
Designation: PCIE1
Type: x8 PCI Express
Current Usage: In Use
Length: Long
ID: 17
Characteristics:
3.3 V is provided
Opening is shared
PME signal is supported
Bus Address: ffff:04:1f.7

Handle 0x0005, DMI type 9, 17 bytes
System Slot Information
Designation: PCIE3
Type: x16 PCI Express
Current Usage: Available
Length: Long
ID: 19
Characteristics:
3.3 V is provided
Opening is shared
PME signal is supported
Bus Address: ffff:03:1f.7

Handle 0x0006, DMI type 9, 17 bytes
System Slot Information
Designation: PCIE5
Type: x8 PCI Express
Current Usage: Available
Length: Long
ID: 21
Characteristics:
3.3 V is provided
Opening is shared
PME signal is supported

Handle 0x0007, DMI type 9, 17 bytes
System Slot Information
Designation: PCIE6
Type: x8 PCI Express
Current Usage: Available
Length: Long
ID: 22
Characteristics:
3.3 V is provided
Opening is shared
PME signal is supported
Bus Address: ffff:01:1f.7

Handle 0x0008, DMI type 9, 17 bytes
System Slot Information
Designation: PCIE7
Type: x16 PCI Express
Current Usage: In Use
Length: Long
ID: 23
Characteristics:
3.3 V is provided
Opening is shared
PME signal is supported
Bus Address: ffff:02:1f.7

Handle 0x0009, DMI type 9, 17 bytes
System Slot Information
Designation: PCIE8
Type: x4 PCI Express
Current Usage: Available
Length: Long
ID: 33
Characteristics:
3.3 V is provided
Opening is shared
PME signal is supported

Handle 0x000A, DMI type 11, 5 bytes
OEM Strings
String 1: To Be Filled By O.E.M.

Handle 0x000B, DMI type 32, 20 bytes
System Boot Information
Status: No errors detected

Handle 0x000C, DMI type 15, 73 bytes
System Event Log
Area Length: 65535 bytes
Header Start Offset: 0x0000
Header Length: 16 bytes
Data Start Offset: 0x0010
Access Method: Memory-mapped physical 32-bit address
Access Address: 0xFF850000
Status: Valid, Not Full
Change Token: 0x00000203
Header Format: Type 1
Supported Log Type Descriptors: 25
Descriptor 1: Single-bit ECC memory error
Data Format 1: Multiple-event handle
Descriptor 2: Multi-bit ECC memory error
Data Format 2: Multiple-event handle
Descriptor 3: Parity memory error
Data Format 3: None
Descriptor 4: Bus timeout
Data Format 4: None
Descriptor 5: I/O channel block
Data Format 5: None
Descriptor 6: Software NMI
Data Format 6: None
Descriptor 7: POST memory resize
Data Format 7: None
Descriptor 8: POST error
Data Format 8: POST results bitmap
Descriptor 9: PCI parity error
Data Format 9: Multiple-event handle
Descriptor 10: PCI system error
Data Format 10: Multiple-event handle
Descriptor 11: CPU failure
Data Format 11: None
Descriptor 12: EISA failsafe timer timeout
Data Format 12: None
Descriptor 13: Correctable memory log disabled
Data Format 13: None
Descriptor 14: Logging disabled
Data Format 14: None
Descriptor 15: System limit exceeded
Data Format 15: None
Descriptor 16: Asynchronous hardware timer expired
Data Format 16: None
Descriptor 17: System configuration information
Data Format 17: None
Descriptor 18: Hard disk information
Data Format 18: None
Descriptor 19: System reconfigured
Data Format 19: None
Descriptor 20: Uncorrectable CPU-complex error
Data Format 20: None
Descriptor 21: Log area reset/cleared
Data Format 21: None
Descriptor 22: System boot
Data Format 22: None
Descriptor 23: End of log
Data Format 23: None
Descriptor 24: OEM-specific
Data Format 24: OEM-specific
Descriptor 25: OEM-specific
Data Format 25: OEM-specific

Handle 0x000D, DMI type 16, 23 bytes
Physical Memory Array
Location: System Board Or Motherboard
Use: System Memory
Error Correction Type: Multi-bit ECC
Maximum Capacity: 256 GB
Error Information Handle: Not Provided
Number Of Devices: 4

Handle 0x000E, DMI type 19, 31 bytes
Memory Array Mapped Address
Starting Address: 0x00000000000
Ending Address: 0x00FFFFFFFFF
Range Size: 64 GB
Physical Array Handle: 0x000D
Partition Width: 2

Handle 0x000F, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x000D
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 72 bits
Size: 32 GB
Form Factor: RIMM
Set: None
Locator: DIMM_A1
Bank Locator: NODE 1
Type: DDR4
Type Detail: Synchronous
Speed: 2400 MHz
Manufacturer: Undefined
Serial Number: EE0A7016
Asset Tag: DIMM_A1_AssetTag
Part Number: 9965640-006.A01G
Rank: 2
Configured Clock Speed: 2400 MHz
Minimum Voltage: Unknown
Maximum Voltage: Unknown
Configured Voltage: Unknown

Handle 0x0010, DMI type 20, 35 bytes
Memory Device Mapped Address
Starting Address: 0x00000000000
Ending Address: 0x007FFFFFFFF
Range Size: 32 GB
Physical Device Handle: 0x000F
Memory Array Mapped Address Handle: 0x000E
Partition Row Position: 1

Handle 0x0011, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x000D
Error Information Handle: Not Provided
Total Width: Unknown
Data Width: Unknown
Size: No Module Installed
Form Factor: RIMM
Set: None
Locator: DIMM_A2
Bank Locator: NODE 1
Type: DDR4
Type Detail: Synchronous
Speed: Unknown
Manufacturer: NO DIMM
Serial Number: NO DIMM
Asset Tag: NO DIMM
Part Number: NO DIMM
Rank: Unknown
Configured Clock Speed: Unknown
Minimum Voltage: Unknown
Maximum Voltage: Unknown
Configured Voltage: Unknown

Handle 0x0012, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x000D
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 72 bits
Size: 32 GB
Form Factor: RIMM
Set: None
Locator: DIMM_B1
Bank Locator: NODE 1
Type: DDR4
Type Detail: Synchronous
Speed: 2400 MHz
Manufacturer: Undefined
Serial Number: EF087482
Asset Tag: DIMM_B1_AssetTag
Part Number: 9965640-006.A01G
Rank: 2
Configured Clock Speed: 2400 MHz
Minimum Voltage: Unknown
Maximum Voltage: Unknown
Configured Voltage: Unknown

Handle 0x0013, DMI type 20, 35 bytes
Memory Device Mapped Address
Starting Address: 0x00800000000
Ending Address: 0x00FFFFFFFFF
Range Size: 32 GB
Physical Device Handle: 0x0012
Memory Array Mapped Address Handle: 0x000E
Partition Row Position: 1

Handle 0x0014, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x000D
Error Information Handle: Not Provided
Total Width: Unknown
Data Width: Unknown
Size: No Module Installed
Form Factor: RIMM
Set: None
Locator: DIMM_B2
Bank Locator: NODE 1
Type: DDR4
Type Detail: Synchronous
Speed: Unknown
Manufacturer: NO DIMM
Serial Number: NO DIMM
Asset Tag: NO DIMM
Part Number: NO DIMM
Rank: Unknown
Configured Clock Speed: Unknown
Minimum Voltage: Unknown
Maximum Voltage: Unknown
Configured Voltage: Unknown

Handle 0x0015, DMI type 16, 23 bytes
Physical Memory Array
Location: System Board Or Motherboard
Use: System Memory
Error Correction Type: Multi-bit ECC
Maximum Capacity: 256 GB
Error Information Handle: Not Provided
Number Of Devices: 4

Handle 0x0016, DMI type 19, 31 bytes
Memory Array Mapped Address
Starting Address: 0x01000000000
Ending Address: 0x01FFFFFFFFF
Range Size: 64 GB
Physical Array Handle: 0x0015
Partition Width: 2

Handle 0x0017, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x0015
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 72 bits
Size: 32 GB
Form Factor: RIMM
Set: None
Locator: DIMM_C1
Bank Locator: NODE 2
Type: DDR4
Type Detail: Synchronous
Speed: 2400 MHz
Manufacturer: Undefined
Serial Number: EB084E82
Asset Tag: DIMM_C1_AssetTag
Part Number: 9965640-006.A01G
Rank: 2
Configured Clock Speed: 2400 MHz
Minimum Voltage: Unknown
Maximum Voltage: Unknown
Configured Voltage: Unknown

Handle 0x0018, DMI type 20, 35 bytes
Memory Device Mapped Address
Starting Address: 0x01000000000
Ending Address: 0x017FFFFFFFF
Range Size: 32 GB
Physical Device Handle: 0x0017
Memory Array Mapped Address Handle: 0x0016
Partition Row Position: 1

Handle 0x0019, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x0015
Error Information Handle: Not Provided
Total Width: Unknown
Data Width: Unknown
Size: No Module Installed
Form Factor: RIMM
Set: None
Locator: DIMM_C2
Bank Locator: NODE 2
Type: DDR4
Type Detail: Synchronous
Speed: Unknown
Manufacturer: NO DIMM
Serial Number: NO DIMM
Asset Tag: NO DIMM
Part Number: NO DIMM
Rank: Unknown
Configured Clock Speed: Unknown
Minimum Voltage: Unknown
Maximum Voltage: Unknown
Configured Voltage: Unknown

Handle 0x001A, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x0015
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 72 bits
Size: 32 GB
Form Factor: RIMM
Set: None
Locator: DIMM_D1
Bank Locator: NODE 2
Type: DDR4
Type Detail: Synchronous
Speed: 2400 MHz
Manufacturer: Undefined
Serial Number: E819480C
Asset Tag: DIMM_D1_AssetTag
Part Number: 9965640-006.A01G
Rank: 2
Configured Clock Speed: 2400 MHz
Minimum Voltage: Unknown
Maximum Voltage: Unknown
Configured Voltage: Unknown

Handle 0x001B, DMI type 20, 35 bytes
Memory Device Mapped Address
Starting Address: 0x01800000000
Ending Address: 0x01FFFFFFFFF
Range Size: 32 GB
Physical Device Handle: 0x001A
Memory Array Mapped Address Handle: 0x0016
Partition Row Position: 1

Handle 0x001C, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x0015
Error Information Handle: Not Provided
Total Width: Unknown
Data Width: Unknown
Size: No Module Installed
Form Factor: RIMM
Set: None
Locator: DIMM_D2
Bank Locator: NODE 2
Type: DDR4
Type Detail: Synchronous
Speed: Unknown
Manufacturer: NO DIMM
Serial Number: NO DIMM
Asset Tag: NO DIMM
Part Number: NO DIMM
Rank: Unknown
Configured Clock Speed: Unknown
Minimum Voltage: Unknown
Maximum Voltage: Unknown
Configured Voltage: Unknown

Handle 0x001D, DMI type 7, 19 bytes
Cache Information
Socket Designation: CPU Internal L1
Configuration: Enabled, Not Socketed, Level 1
Operational Mode: Write Back
Location: Internal
Installed Size: 896 kB
Maximum Size: 896 kB
Supported SRAM Types:
Unknown
Installed SRAM Type: Unknown
Speed: Unknown
Error Correction Type: Parity
System Type: Other
Associativity: 8-way Set-associative

Handle 0x001E, DMI type 7, 19 bytes
Cache Information
Socket Designation: CPU Internal L2
Configuration: Enabled, Not Socketed, Level 2
Operational Mode: Write Back
Location: Internal
Installed Size: 3584 kB
Maximum Size: 3584 kB
Supported SRAM Types:
Unknown
Installed SRAM Type: Unknown
Speed: Unknown
Error Correction Type: Single-bit ECC
System Type: Unified
Associativity: 8-way Set-associative

Handle 0x001F, DMI type 7, 19 bytes
Cache Information
Socket Designation: CPU Internal L3
Configuration: Enabled, Not Socketed, Level 3
Operational Mode: Write Back
Location: Internal
Installed Size: 35840 kB
Maximum Size: 35840 kB
Supported SRAM Types:
Unknown
Installed SRAM Type: Unknown
Speed: Unknown
Error Correction Type: Single-bit ECC
System Type: Unified
Associativity: 20-way Set-associative

Handle 0x0020, DMI type 4, 42 bytes
Processor Information
Socket Designation: CPUSocket
Type: Central Processor
Family: Xeon
Manufacturer: Intel
ID: F1 06 04 00 FF FB EB BF
Signature: Type 0, Family 6, Model 79, Stepping 1
Flags:
FPU (Floating-point unit on-chip)
VME (Virtual mode extension)
DE (Debugging extension)
PSE (Page size extension)
TSC (Time stamp counter)
MSR (Model specific registers)
PAE (Physical address extension)
MCE (Machine check exception)
CX8 (CMPXCHG8 instruction supported)
APIC (On-chip APIC hardware supported)
SEP (Fast system call)
MTRR (Memory type range registers)
PGE (Page global enable)
MCA (Machine check architecture)
CMOV (Conditional move instruction supported)
PAT (Page attribute table)
PSE-36 (36-bit page size extension)
CLFSH (CLFLUSH instruction supported)
DS (Debug store)
ACPI (ACPI supported)
MMX (MMX technology supported)
FXSR (FXSAVE and FXSTOR instructions supported)
SSE (Streaming SIMD extensions)
SSE2 (Streaming SIMD extensions 2)
SS (Self-snoop)
HTT (Multi-threading)
TM (Thermal monitor supported)
PBE (Pending break enabled)
Version: Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
Voltage: 0.0 V
External Clock: 100 MHz
Max Speed: 4000 MHz
Current Speed: 2400 MHz
Status: Populated, Enabled
Upgrade: Socket LGA2011-3
L1 Cache Handle: 0x001D
L2 Cache Handle: 0x001E
L3 Cache Handle: 0x001F
Serial Number: Not Specified
Asset Tag: Not Specified
Part Number: Not Specified
Core Count: 14
Core Enabled: 14
Thread Count: 28
Characteristics:
64-bit capable
Multi-Core
Hardware Thread
Execute Protection
Enhanced Virtualization
Power/Performance Control

Handle 0x0021, DMI type 130, 20 bytes
OEM-specific Type
Header and Data:
82 14 21 00 24 41 4D 54 01 01 01 01 01 A5 2F 02
00 00 00 00

Handle 0x0022, DMI type 131, 64 bytes
OEM-specific Type
Header and Data:
83 40 22 00 35 00 00 00 09 00 00 00 00 00 1D 00
F8 00 44 8D 00 00 00 00 09 80 00 00 01 00 09 00
EA 03 25 00 00 00 00 00 C8 00 3A 15 00 00 00 00
00 00 00 00 22 00 00 00 76 50 72 6F 00 00 00 00

Handle 0x0023, DMI type 127, 4 bytes
End Of Table

#1
Posted 12/16/2017 05:28 AM   
Additional information: I moved the Titan X board to a different system. The first system was a Xeon E5 2680 v4 CPU, while the second box is a Threadripper 1950X, running CUDA 9.0 on Ubuntu 16.04 with the same kernel version. Same issues as described above. Thanks for your help. Best, Aaron
Additional information:

I moved the Titan X board to a different system. The first system was a Xeon E5 2680 v4 CPU, while the second box is a Threadripper 1950X, running CUDA 9.0 on Ubuntu 16.04 with the same kernel version. Same issues as described above.

Thanks for your help.

Best,
Aaron

#2
Posted 12/16/2017 05:53 AM   
Upgrade to 387.34 drivers which fully support Titan V.
Answer Accepted by Original Poster
Upgrade to 387.34 drivers which fully support Titan V.

Artem S. Tashkinov
Linux and Open Source advocate

#3
Posted 12/16/2017 08:04 AM   
NVRM: RmInitAdapter failed! (0x30:0x56:685) Doesn't look good, across different systems would point to the Titan being broken,check power connectors, try lastest driver, then try to RMA.
NVRM: RmInitAdapter failed! (0x30:0x56:685)
Doesn't look good, across different systems would point to the Titan being broken,check power connectors, try lastest driver, then try to RMA.

#4
Posted 12/16/2017 03:24 PM   
Hello again, Thank you! Downloading the latest driver fixes this. I did not realize that the Cuda Toolkit v9.1 does not include the latest driver.
Hello again,

Thank you!

Downloading the latest driver fixes this. I did not realize that the Cuda Toolkit v9.1 does not include the latest driver.

#5
Posted 12/16/2017 03:27 PM   
[quote=""]Upgrade to 387.34 drivers which fully support Titan V.[/quote] Using Ubuntu 16.4.3, driver 387.34, nvidia-smi still prints "Graphics Device" instead of Titan V. Is that normal?
said:Upgrade to 387.34 drivers which fully support Titan V.


Using Ubuntu 16.4.3, driver 387.34, nvidia-smi still prints "Graphics Device" instead of Titan V. Is that normal?

#6
Posted 12/20/2017 07:39 AM   
Here's the output I get from nvidia-smi, which also has "Graphics Device." The 100% GPU use is because I'm running a simulation. (CUDA programs seem to run fine, so I'm not too worried.) +-----------------------------------------------------------------------------+ | NVIDIA-SMI 387.34 Driver Version: 387.34 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Graphics Device Off | 00000000:41:00.0 Off | N/A | | 52% 72C P2 143W / 250W | 2755MiB / 12057MiB | 100% Default | +-------------------------------+----------------------+----------------------+
Here's the output I get from nvidia-smi, which also has "Graphics Device." The 100% GPU use is because I'm running a simulation.

(CUDA programs seem to run fine, so I'm not too worried.)

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 387.34 Driver Version: 387.34 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Graphics Device Off | 00000000:41:00.0 Off | N/A |
| 52% 72C P2 143W / 250W | 2755MiB / 12057MiB | 100% Default |
+-------------------------------+----------------------+----------------------+

#7
Posted 12/20/2017 03:52 PM   
I am having a similar issue. I am using the newest driver (387.34), I have cuda 9 installed, and am running tensorflow via nvidia docker container all on an ubuntu 16.04 machine. Just like marton, when I enter nvidia-smi "Graphics Device" is printed instead of "Titan V". The card seems to be working but isnt as fast as the V100s I use at work. I'm not sure if that is due to the fact that V100s are just faster or if the titan v isn't utilizing its tensor cores. Is there a way to determine if the card is utilizing tensorcores?
I am having a similar issue. I am using the newest driver (387.34), I have cuda 9 installed, and am running tensorflow via nvidia docker container all on an ubuntu 16.04 machine. Just like marton, when I enter nvidia-smi "Graphics Device" is printed instead of "Titan V". The card seems to be working but isnt as fast as the V100s I use at work. I'm not sure if that is due to the fact that V100s are just faster or if the titan v isn't utilizing its tensor cores. Is there a way to determine if the card is utilizing tensorcores?

#8
Posted 12/21/2017 07:46 PM   
I'm having the same issue with 387.34 + CUDA 9.1 + nvidia-smi on a fresh Ubuntu 16.04 LTS. The 1080 Ti in my system is reported correctly but the Titan V receives "Graphics Device" as others have noted above. Running [url]https://github.com/salesforce/awd-lstm-lm[/url], a language modeling codebase I made, the 1080 Ti gets ~26 secs / epoch (similar to the P100) and the Titan V hits ~20 secs / epoch, so it certainly works even if it's not properly recognized. Note that the codebase is running PyTorch but is not optimized for the Titan V. Off topic but interesting: the Titan V is pulling about 160 watts vs 1080 Ti pulling 230 watts - +1 for power efficiency :) My cuda-driver package is 387.26-1 from the repository but I realized it's a pseudo-package and that's only stating the minimal version number of the other packages, all of which are 387.34 too. I'll keep investigating.
I'm having the same issue with 387.34 + CUDA 9.1 + nvidia-smi on a fresh Ubuntu 16.04 LTS. The 1080 Ti in my system is reported correctly but the Titan V receives "Graphics Device" as others have noted above.

Running https://github.com/salesforce/awd-lstm-lm, a language modeling codebase I made, the 1080 Ti gets ~26 secs / epoch (similar to the P100) and the Titan V hits ~20 secs / epoch, so it certainly works even if it's not properly recognized. Note that the codebase is running PyTorch but is not optimized for the Titan V. Off topic but interesting: the Titan V is pulling about 160 watts vs 1080 Ti pulling 230 watts - +1 for power efficiency :)

My cuda-driver package is 387.26-1 from the repository but I realized it's a pseudo-package and that's only stating the minimal version number of the other packages, all of which are 387.34 too. I'll keep investigating.

#9
Posted 12/21/2017 11:31 PM   
Using latest drivers from web site versus repo, nvidia-smi still prints the generic label. Using latest windows drivers, cards show up as Titan V.
Using latest drivers from web site versus repo, nvidia-smi still prints the generic label.

Using latest windows drivers, cards show up as Titan V.

#10
Posted 12/27/2017 04:47 PM   
getting the same - updated to latest driver - 387.34 - but my shiny new TitanV is only recognised as a lowly 'graphics device'. Where's the respect? nvidia-smi output: Fri Dec 29 10:15:42 2017 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 387.34 Driver Version: 387.34 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Graphics Device Off | 00000000:01:00.0 On | N/A | | 28% 37C P2 26W / 250W | 327MiB / 12057MiB | 1% Default | +-------------------------------+----------------------+----------------------+ on a fully updated Ubuntu 16.04 LTS install .
getting the same - updated to latest driver - 387.34 - but my shiny new TitanV is only recognised as a lowly 'graphics device'. Where's the respect? nvidia-smi output:

Fri Dec 29 10:15:42 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 387.34 Driver Version: 387.34 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Graphics Device Off | 00000000:01:00.0 On | N/A |
| 28% 37C P2 26W / 250W | 327MiB / 12057MiB | 1% Default |
+-------------------------------+----------------------+----------------------+

on a fully updated Ubuntu 16.04 LTS install .

#11
Posted 12/29/2017 10:24 AM   
If anyone at Nvidia is reading this, I'd like to re-emphasize a point made above: It would be helpful if there were an easy way to confirm that the tensor cores are actually being used. This functionality could be added to, for example, the nvvp program. (I was using the tensor cores with Theano and ended up getting much less of a performance increase than I expected. In my use case, it turns out that float16 storage with float32 compute is actually running faster than float16 storage with the tensor cores. I think the tensor core routines in cudnn have kernels ending with _tn_v1 (as shown in nvvp), and intentionally disabling the tensor cores slowed my simulations down further. Therefore, I'm pretty sure that my convolutions were being performed using the tensor cores. When running float32 compute, I could use a FFT convolution algorithm, resulting in substantial speedup.)
If anyone at Nvidia is reading this, I'd like to re-emphasize a point made above: It would be helpful if there were an easy way to confirm that the tensor cores are actually being used. This functionality could be added to, for example, the nvvp program.

(I was using the tensor cores with Theano and ended up getting much less of a performance increase than I expected. In my use case, it turns out that float16 storage with float32 compute is actually running faster than float16 storage with the tensor cores. I think the tensor core routines in cudnn have kernels ending with _tn_v1 (as shown in nvvp), and intentionally disabling the tensor cores slowed my simulations down further. Therefore, I'm pretty sure that my convolutions were being performed using the tensor cores. When running float32 compute, I could use a FFT convolution algorithm, resulting in substantial speedup.)

#12
Posted 12/29/2017 07:56 PM   
I second agreenblatt's request - I would also love a tool to help monitor the ultilization of tensorcores. I've just been re-writing a couple networks casting to FP16 and but needs some concrete tensorcore diagnostics, otherwise I feel like I am wandering in the dark.
I second agreenblatt's request - I would also love a tool to help monitor the ultilization of tensorcores. I've just been re-writing a couple networks casting to FP16 and but needs some concrete tensorcore diagnostics, otherwise I feel like I am wandering in the dark.

#13
Posted 12/30/2017 07:02 AM   
I've had some very bad experiences with the 387.34 driver and Ubuntu 16.04. I did one install using all disk drives in Ubuntu. Had similar results as you guys had (though I'm largely using Matlab Parallel computing toolbox). But Matlab choked on the gpuDevice command. I found a workaround using Matlab system objects. I thought all was well, until I installed VMWare - which I need because I need a couple of Windows pieces of software. And VMWare croaked the Xorg server. Hung it completely. If I didn't load the NVidia driver, VMWare worked. Okay - so, I ended up installing Windows, which works. Then I set aside a single drive for experimenting with Linux distros. And found that none of them worked once I installed the driver. They all had one piece of software or another that would cause Xorg to hang. In some cases, it was even Firefox causing the hang! Ugh.
I've had some very bad experiences with the 387.34 driver and Ubuntu 16.04. I did one install using all disk drives in Ubuntu. Had similar results as you guys had (though I'm largely using Matlab Parallel computing toolbox). But Matlab choked on the gpuDevice command. I found a workaround using Matlab system objects. I thought all was well, until I installed VMWare - which I need because I need a couple of Windows pieces of software. And VMWare croaked the Xorg server. Hung it completely. If I didn't load the NVidia driver, VMWare worked. Okay - so, I ended up installing Windows, which works. Then I set aside a single drive for experimenting with Linux distros. And found that none of them worked once I installed the driver. They all had one piece of software or another that would cause Xorg to hang. In some cases, it was even Firefox causing the hang! Ugh.

#14
Posted 01/02/2018 10:22 PM   
[quote="njudell"]But Matlab choked on the gpuDevice command.[/quote] This is normal for first instantiation of a GPU of a MATLAB version whose PCT version has no fat binary support for the GPUs compute capability version. See this: https://www.mathworks.com/matlabcentral/answers/79275-gpudevice-command-very-slow Answer is talking about a different MATLAB version / GPU compute capability but the same idea applies.
njudell said:But Matlab choked on the gpuDevice command.

This is normal for first instantiation of a GPU of a MATLAB version whose PCT version has no fat binary support for the GPUs compute capability version. See this: https://www.mathworks.com/matlabcentral/answers/79275-gpudevice-command-very-slow


Answer is talking about a different MATLAB version / GPU compute capability but the same idea applies.

#15
Posted 01/05/2018 12:55 AM   
Scroll To Top

Add Reply