Frequent catastrophic crashes on a multiple GPU machine

Hi,

We have a 6 GPU (Ubuntu 16.01) machine with 2 Titans and 4 GTX 780s. Ever since the update to CUDA_8.0.61 we’ve been experiencing periodic crashes. The kernel module becomes unavailable and all GPU-related processes stall. Calling nvidia-smi hangs and cannot be killed. I’ve upgraded the drivers twice and now am running 381.22. I’ve again experienced a crash. On reboot I tried just reloading the kernel modules but the system couldn’t find them. I had to reinstall the driver to get everything running again. The syslog output on crash is this:

Jun 20 14:14:57 server_name kernel: [699655.484089] INFO: task kworker/37:1:37384 blocked for more than 120 seconds.
Jun 20 14:14:57 server_name kernel: [699655.484104] Tainted: P OE 4.4.0-79-generic #100-Ubuntu
Jun 20 14:14:57 server_name kernel: [699655.484115] “echo 0 > /proc/sys/kernel/hung_task_timeout_secs” disables this message.
Jun 20 14:14:57 server_name kernel: [699655.484127] kworker/37:1 D ffff885f9535fb78 0 37384 2 0x00000000
Jun 20 14:14:57 server_name kernel: [699655.484228] Workqueue: events os_execute_work_item [nvidia]
Jun 20 14:14:57 server_name kernel: [699655.484230] ffff885f9535fb78 00000000c92994dc ffff886006020e00 ffff885fac65aa00
Jun 20 14:14:57 server_name kernel: [699655.484232] ffff885f95360000 ffff883f81975ca8 ffff885fac65aa00 ffff885f9535fe10
Jun 20 14:14:57 server_name kernel: [699655.484234] ffff883eaa4e41c8 ffff885f9535fb90 ffffffff8183c955 7fffffffffffffff
Jun 20 14:14:57 server_name kernel: [699655.484236] Call Trace:
Jun 20 14:14:57 server_name kernel: [699655.484241] [] schedule+0x35/0x80
Jun 20 14:14:57 server_name kernel: [699655.484243] [] schedule_timeout+0x1b5/0x270
Jun 20 14:14:57 server_name kernel: [699655.484321] [] ? os_acquire_spinlock+0x12/0x20 [nvidia]
Jun 20 14:14:57 server_name kernel: [699655.484384] [] ? os_acquire_spinlock+0x12/0x20 [nvidia]
Jun 20 14:14:57 server_name kernel: [699655.484488] [] ? _nv019565rm+0xc/0x20 [nvidia]
Jun 20 14:14:57 server_name kernel: [699655.484490] [] __down+0x7f/0xd0
Jun 20 14:14:57 server_name kernel: [699655.484494] [] ? do_gettimeofday+0x29/0x90
Jun 20 14:14:57 server_name kernel: [699655.484496] [] down+0x41/0x50
Jun 20 14:14:57 server_name kernel: [699655.484559] [] os_acquire_semaphore+0x37/0x40 [nvidia]
Jun 20 14:14:57 server_name kernel: [699655.484622] [] os_acquire_mutex+0xe/0x10 [nvidia]
Jun 20 14:14:57 server_name kernel: [699655.484720] [] _nv020054rm+0x5c/0x120 [nvidia]
Jun 20 14:14:57 server_name kernel: [699655.484843] [] ? _nv021658rm+0x3ac/0x8a0 [nvidia]
Jun 20 14:14:57 server_name kernel: [699655.484936] [] ? _nv000807rm+0x22b/0xcd0 [nvidia]
Jun 20 14:14:57 server_name kernel: [699655.485028] [] ? rm_execute_work_item+0x49/0xc0 [nvidia]
Jun 20 14:14:57 server_name kernel: [699655.485092] [] ? os_execute_work_item+0x11/0x70 [nvidia]
Jun 20 14:14:57 server_name kernel: [699655.485155] [] ? os_execute_work_item+0x46/0x70 [nvidia]
Jun 20 14:14:57 server_name kernel: [699655.485158] [] ? process_one_work+0x165/0x480
Jun 20 14:14:57 server_name kernel: [699655.485160] [] ? worker_thread+0x4b/0x4c0
Jun 20 14:14:57 server_name kernel: [699655.485161] [] ? process_one_work+0x480/0x480
Jun 20 14:14:57 server_name kernel: [699655.485163] [] ? kthread+0xe5/0x100
Jun 20 14:14:57 server_name kernel: [699655.485165] [] ? kthread_create_on_node+0x1e0/0x1e0
Jun 20 14:14:57 server_name kernel: [699655.485167] [] ? ret_from_fork+0x3f/0x70
Jun 20 14:14:57 server_name kernel: [699655.485169] [] ? kthread_create_on_node+0x1e0/0x1e0

==============NVSMI LOG==============

Timestamp : Tue Jun 20 14:40:11 2017
Driver Version : 381.22

Attached GPUs : 6
GPU 0000:04:00.0
Product Name : GeForce GTX 780
Product Brand : GeForce
Display Mode : N/A
Display Active : N/A
Persistence Mode : Disabled
Accounting Mode : N/A
Accounting Mode Buffer Size : N/A
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
GPU UUID : GPU-1dee2eb9-43de-be84-1759-c69480d674d8
Minor Number : 0
VBIOS Version : 80.80.21.00.53
MultiGPU Board : N/A
Board ID : N/A
GPU Part Number : N/A
Inforom Version
Image Version : N/A
OEM Object : N/A
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization mode : N/A
PCI
Bus : 0x04
Device : 0x00
Domain : 0x0000
Device Id : 0x100410DE
Bus Id : 0000:04:00.0
Sub System Id : 0x104B196E
GPU Link Info
PCIe Generation
Max : N/A
Current : N/A
Link Width
Max : N/A
Current : N/A
Bridge Chip
Type : N/A
Firmware : N/A
Replays since reset : 0
Tx Throughput : N/A
Rx Throughput : N/A
Fan Speed : 32 %
Performance State : P0
Clocks Throttle Reasons : N/A
FB Memory Usage
Total : 3019 MiB
Used : 0 MiB
Free : 3019 MiB
BAR1 Memory Usage
Total : N/A
Used : N/A
Free : N/A
Compute Mode : Default
Utilization
Gpu : N/A
Memory : N/A
Encoder : N/A
Decoder : N/A
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Aggregate
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending : N/A
Temperature
GPU Current Temp : 27 C
GPU Shutdown Temp : N/A
GPU Slowdown Temp : N/A
Power Readings
Power Management : N/A
Power Draw : N/A
Power Limit : N/A
Default Power Limit : N/A
Enforced Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Clocks
Graphics : N/A
SM : N/A
Memory : N/A
Video : N/A
Applications Clocks
Graphics : 1006 MHz
Memory : 3104 MHz
Default Applications Clocks
Graphics : 1006 MHz
Memory : 3104 MHz
Max Clocks
Graphics : N/A
SM : N/A
Memory : N/A
Video : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes : N/A

GPU 0000:05:00.0
Product Name : GeForce GTX TITAN Black
Product Brand : GeForce
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 1920
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0322414012083
GPU UUID : GPU-4f8989f6-366e-9b98-c5d2-9471418cc036
Minor Number : 1
VBIOS Version : 80.80.4E.00.90
MultiGPU Board : No
Board ID : 0x500
GPU Part Number : N/A
Inforom Version
Image Version : 2083.0031.00.03
OEM Object : 1.1
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : Low Double Precision
Pending : Low Double Precision
GPU Virtualization Mode
Virtualization mode : None
PCI
Bus : 0x05
Device : 0x00
Domain : 0x0000
Device Id : 0x100C10DE
Bus Id : 0000:05:00.0
Sub System Id : 0x37903842
GPU Link Info
PCIe Generation
Max : 3
Current : 3
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays since reset : 0
Tx Throughput : N/A
Rx Throughput : N/A
Fan Speed : 26 %
Performance State : P0
Clocks Throttle Reasons
Idle : Not Active
Applications Clocks Setting : Active
SW Power Cap : Not Active
HW Slowdown : Not Active
Sync Boost : Not Active
Unknown : Not Active
FB Memory Usage
Total : 6082 MiB
Used : 0 MiB
Free : 6082 MiB
BAR1 Memory Usage
Total : 128 MiB
Used : 2 MiB
Free : 126 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Aggregate
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending : N/A
Temperature
GPU Current Temp : 42 C
GPU Shutdown Temp : 100 C
GPU Slowdown Temp : 95 C
Power Readings
Power Management : Supported
Power Draw : 84.36 W
Power Limit : 250.00 W
Default Power Limit : 250.00 W
Enforced Power Limit : 250.00 W
Min Power Limit : 150.00 W
Max Power Limit : 265.00 W
Clocks
Graphics : 888 MHz
SM : 888 MHz
Memory : 3500 MHz
Video : 540 MHz
Applications Clocks
Graphics : 888 MHz
Memory : 3500 MHz
Default Applications Clocks
Graphics : 888 MHz
Memory : 3500 MHz
Max Clocks
Graphics : 1202 MHz
SM : 1202 MHz
Memory : 3500 MHz
Video : 540 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes : None

GPU 0000:09:00.0
Product Name : GeForce GTX 780
Product Brand : GeForce
Display Mode : N/A
Display Active : N/A
Persistence Mode : Disabled
Accounting Mode : N/A
Accounting Mode Buffer Size : N/A
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
GPU UUID : GPU-f89b5e8d-1e64-b5be-e9e8-f1b1157b824f
Minor Number : 2
VBIOS Version : 80.80.21.00.53
MultiGPU Board : N/A
Board ID : N/A
GPU Part Number : N/A
Inforom Version
Image Version : N/A
OEM Object : N/A
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization mode : N/A
PCI
Bus : 0x09
Device : 0x00
Domain : 0x0000
Device Id : 0x100410DE
Bus Id : 0000:09:00.0
Sub System Id : 0x104B196E
GPU Link Info
PCIe Generation
Max : N/A
Current : N/A
Link Width
Max : N/A
Current : N/A
Bridge Chip
Type : N/A
Firmware : N/A
Replays since reset : 0
Tx Throughput : N/A
Rx Throughput : N/A
Fan Speed : 32 %
Performance State : P0
Clocks Throttle Reasons : N/A
FB Memory Usage
Total : 3020 MiB
Used : 0 MiB
Free : 3020 MiB
BAR1 Memory Usage
Total : N/A
Used : N/A
Free : N/A
Compute Mode : Default
Utilization
Gpu : N/A
Memory : N/A
Encoder : N/A
Decoder : N/A
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Aggregate
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending : N/A
Temperature
GPU Current Temp : 26 C
GPU Shutdown Temp : N/A
GPU Slowdown Temp : N/A
Power Readings
Power Management : N/A
Power Draw : N/A
Power Limit : N/A
Default Power Limit : N/A
Enforced Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Clocks
Graphics : N/A
SM : N/A
Memory : N/A
Video : N/A
Applications Clocks
Graphics : 1006 MHz
Memory : 3104 MHz
Default Applications Clocks
Graphics : 1006 MHz
Memory : 3104 MHz
Max Clocks
Graphics : N/A
SM : N/A
Memory : N/A
Video : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes : N/A

GPU 0000:81:00.0
Product Name : GeForce GTX 780
Product Brand : GeForce
Display Mode : N/A
Display Active : N/A
Persistence Mode : Disabled
Accounting Mode : N/A
Accounting Mode Buffer Size : N/A
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
GPU UUID : GPU-68e67fe1-a1aa-d9b0-a3c6-f0a135e08a7d
Minor Number : 3
VBIOS Version : 80.80.21.00.53
MultiGPU Board : N/A
Board ID : N/A
GPU Part Number : N/A
Inforom Version
Image Version : N/A
OEM Object : N/A
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization mode : N/A
PCI
Bus : 0x81
Device : 0x00
Domain : 0x0000
Device Id : 0x100410DE
Bus Id : 0000:81:00.0
Sub System Id : 0x104B196E
GPU Link Info
PCIe Generation
Max : N/A
Current : N/A
Link Width
Max : N/A
Current : N/A
Bridge Chip
Type : N/A
Firmware : N/A
Replays since reset : 0
Tx Throughput : N/A
Rx Throughput : N/A
Fan Speed : 32 %
Performance State : P0
Clocks Throttle Reasons : N/A
FB Memory Usage
Total : 3020 MiB
Used : 0 MiB
Free : 3020 MiB
BAR1 Memory Usage
Total : N/A
Used : N/A
Free : N/A
Compute Mode : Default
Utilization
Gpu : N/A
Memory : N/A
Encoder : N/A
Decoder : N/A
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Aggregate
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending : N/A
Temperature
GPU Current Temp : 27 C
GPU Shutdown Temp : N/A
GPU Slowdown Temp : N/A
Power Readings
Power Management : N/A
Power Draw : N/A
Power Limit : N/A
Default Power Limit : N/A
Enforced Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Clocks
Graphics : N/A
SM : N/A
Memory : N/A
Video : N/A
Applications Clocks
Graphics : 1006 MHz
Memory : 3104 MHz
Default Applications Clocks
Graphics : 1006 MHz
Memory : 3104 MHz
Max Clocks
Graphics : N/A
SM : N/A
Memory : N/A
Video : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes : N/A

GPU 0000:84:00.0
Product Name : GeForce GTX TITAN Black
Product Brand : GeForce
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 1920
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0322514046003
GPU UUID : GPU-5cba0884-e167-ff09-42e6-563c5a2ad2e5
Minor Number : 4
VBIOS Version : 80.80.4E.00.90
MultiGPU Board : No
Board ID : 0x8400
GPU Part Number : N/A
Inforom Version
Image Version : 2083.0031.00.03
OEM Object : 1.1
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : Low Double Precision
Pending : Low Double Precision
GPU Virtualization Mode
Virtualization mode : None
PCI
Bus : 0x84
Device : 0x00
Domain : 0x0000
Device Id : 0x100C10DE
Bus Id : 0000:84:00.0
Sub System Id : 0x37903842
GPU Link Info
PCIe Generation
Max : 3
Current : 3
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays since reset : 0
Tx Throughput : N/A
Rx Throughput : N/A
Fan Speed : 26 %
Performance State : P0
Clocks Throttle Reasons
Idle : Not Active
Applications Clocks Setting : Active
SW Power Cap : Not Active
HW Slowdown : Not Active
Sync Boost : Not Active
Unknown : Not Active
FB Memory Usage
Total : 6082 MiB
Used : 0 MiB
Free : 6082 MiB
BAR1 Memory Usage
Total : 128 MiB
Used : 2 MiB
Free : 126 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Aggregate
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending : N/A
Temperature
GPU Current Temp : 33 C
GPU Shutdown Temp : 100 C
GPU Slowdown Temp : 95 C
Power Readings
Power Management : Supported
Power Draw : 82.09 W
Power Limit : 250.00 W
Default Power Limit : 250.00 W
Enforced Power Limit : 250.00 W
Min Power Limit : 150.00 W
Max Power Limit : 265.00 W
Clocks
Graphics : 888 MHz
SM : 888 MHz
Memory : 3500 MHz
Video : 540 MHz
Applications Clocks
Graphics : 888 MHz
Memory : 3500 MHz
Default Applications Clocks
Graphics : 888 MHz
Memory : 3500 MHz
Max Clocks
Graphics : 1202 MHz
SM : 1202 MHz
Memory : 3500 MHz
Video : 540 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes : None

GPU 0000:89:00.0
Product Name : GeForce GTX 780
Product Brand : GeForce
Display Mode : N/A
Display Active : N/A
Persistence Mode : Disabled
Accounting Mode : N/A
Accounting Mode Buffer Size : N/A
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
GPU UUID : GPU-bd5a9444-8730-8b56-ac9e-04b0f7b70e12
Minor Number : 5
VBIOS Version : 80.80.21.00.53
MultiGPU Board : N/A
Board ID : N/A
GPU Part Number : N/A
Inforom Version
Image Version : N/A
OEM Object : N/A
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization mode : N/A
PCI
Bus : 0x89
Device : 0x00
Domain : 0x0000
Device Id : 0x100410DE
Bus Id : 0000:89:00.0
Sub System Id : 0x104B196E
GPU Link Info
PCIe Generation
Max : N/A
Current : N/A
Link Width
Max : N/A
Current : N/A
Bridge Chip
Type : N/A
Firmware : N/A
Replays since reset : 0
Tx Throughput : N/A
Rx Throughput : N/A
Fan Speed : 32 %
Performance State : P0
Clocks Throttle Reasons : N/A
FB Memory Usage
Total : 3020 MiB
Used : 0 MiB
Free : 3020 MiB
BAR1 Memory Usage
Total : N/A
Used : N/A
Free : N/A
Compute Mode : Default
Utilization
Gpu : N/A
Memory : N/A
Encoder : N/A
Decoder : N/A
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Aggregate
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending : N/A
Temperature
GPU Current Temp : 26 C
GPU Shutdown Temp : N/A
GPU Slowdown Temp : N/A
Power Readings
Power Management : N/A
Power Draw : N/A
Power Limit : N/A
Default Power Limit : N/A
Enforced Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Clocks
Graphics : N/A
SM : N/A
Memory : N/A
Video : N/A
Applications Clocks
Graphics : 1006 MHz
Memory : 3104 MHz
Default Applications Clocks
Graphics : 1006 MHz
Memory : 3104 MHz
Max Clocks
Graphics : N/A
SM : N/A
Memory : N/A
Video : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes : N/A

These kind of situations are hard to diagnose remotely, but the first thing you would want to check is the physical setup, specifically: power supply, cooling, and PCIe connectivity. The switch to CUDA 8 is probably a red herring here. For what it is worth, 95% of all “weird” issues with Linux systems I see reported in these forums seem to involve Ubuntu.

(1) Power supply: The rated output of your PSU should be at least 1.5x the total specified wattage of your components, in this case about 1.5 x (6 x 250 + 150) = 1.5 * 1400 = 2100 watts. For robustness and efficiency, I recommend a PSU compliant with the 80 PLUS Platinum specification. Make sure all PCIe power connectors on the GPUs are firmly inserted (there should be tab that engages). Don’t use Y-splitters or 6-pin to 8-pin converters in any of the cables for the PCIe power connector.

(2) Thermals: When you check with nvidia-smi during heavy use, do you see any indications of thermal shutdown or clock throttling due to hitting the temperature limit? Generally, it is best for GPUs to operate at no more than 85 degrees Celsius. Make sure airflow is unobstructed, by moving cabling and other obstacles out of the way. Do you have an issue with elevated ambient air temperature, where the GPUs suck in air heated by other system components?

(3) PCIe: Does this setup use riser cards? They can lead to poor PCIe signal quality. Unplug and replug all PCIe connectors, making sure they are full engaged. Make sure there is no mechanical stress on the connectors, e.g. bending. Vibrations (external or from hard drives, fans) can negatively affect the connectors, so make sure all GPUs are firmly attached, using whatever screws, latching mechanisms, etc are offered by the enclosure.

Hi,

Thanks for getting back to me so quickly. The remote admin difficulty is why I’ve waited for a while before trying to get help, but it’s hard to diagnose in person, and updates have not helped.

I am almost certain that it’s not a physical issue because

  1. it only effects the GPUs and not the CPU-based processes
  2. it started right after upgrade to CUDA 8
  3. the server is in a dedicated cooled server room and was set up and is looked after by pros (however I will have them look at some of the issues just in case)
  4. crashes are not really correlated with heavy usage

The severity of the crashes seemed to suggest to me that more than a simple software incompatibility is at work here. Plus I have seen so-called “IT experts” doing questionable things to hardware, in particular as it pertains to power supply configurations. This could also be a case of a defective GPU (GTX 780 is an older model), with happens rarely, however. That would probably require removing GPUs to identify the GPU at fault with certainty.

Do you see any messages in relevant system logs (e.g. dmesg) that say something like “GPU has fallen off the bus”?

I notice that Ubuntu 16.01 is not listed in the software compatibility list for CUDA 8, only 16.04 is, so you might want to investigate that angle as well:

[url]http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#axzz4kZeiYSVa[/url]

Sorry my bad. Ubuntu 16.04.2 LTS (GNU/Linux 4.4.0-81-generic x86_64)

I’d have to wait for another crash to see something in dmesg, but I don’t remember seeing that last time.

The only thing I can see is on re-install of the drivers.

Jun 20 14:29:04 server_name kernel: [ 194.731669] nvidia: module license ‘NVIDIA’ taints kernel.
Jun 20 14:29:04 server_name kernel: [ 194.736671] nvidia: module verification failed: signature and/or required key missing - tainting kernel
Jun 20 14:29:04 server_name kernel: [ 194.743807] nvidia-nvlink: Nvlink Core is being initialized, major device number 242

Sorry to piggyback on this thread. We are experiencing the same issue since we started working with Linux+Nvidia+CUDA, around end 2016.

We had crashes every hour and we had to reboot the boxes, causing a lot of trouble for the users. We opened a tkt and provided as much info as we could, but it looks like Nvidia has other higher priority issues to solve.

Now we are using v 384.59 and we have “only” daily hangs. As per Customer Support:

"I have an update from Engineering. They have identified two problems that are causing the issue and are reviewing a fix to the 375 driver for the first problem. In Engineering’s own works, this fix “will not fix their issue completely, but may improve the situation.”

Unfortunately we don’t yet have a timeframe for fixing the second problem and I will provide one as soon as possible. "

So the situation is slightly better but still unacceptable. It’s a pitty there is no other vendor offering a computational solution (hw and sw) at similar cost.

I’m really interested to see what’s the response to you.

Have a good day!

Well with the recent drivers I use on Gentoo Linux its a temp issue that seems to be related to fan control probs, temp goes up 65-68C and fans stays on lowest minimum 32% on mine won’t get higher and I have stock bios on my GTX 970 its not the special ones with ACX cooling. On windows fans control works fine this is the only issue I’ve gotten so far under linux and Blender runs great for me no crashing. But if I play DOOM under wine + vulkan play for a few hrs it shuts restarts my system when hitting the temp as the fan wants to stay at 32 % speed if I manually set it 60% temps stay 60-61C no restarts or crashes.

A GPU temperature of 68 deg C should be nothing to worry about. You can check the throttle and shutdown limits on your specific GPU with nvidia-smi. Traditionally those values have been quite high. E.g. here is output from my system:

C:\Users\Norbert\My Programs>nvidia-smi -q | grep Temp
    Temperature
        GPU Current Temp            : 82 C
        GPU Shutdown Temp           : 101 C
        GPU Slowdown Temp           : 96 C

As long as the temperature is below the “Slowdown Temp”, there should be no adverse impact of temperature on GPU operation, except possibly a lowering of maximum boost clocks.

Sudden restarts of the system are more an indication of insufficient power supply. As a GPU heats up under heavy load, power consumption increases over the first 10 minutes or so: (1) ohmic resistance in electronic components increases, driving up power consumption (2) The fan must turn faster, requiring more power to turn the fan. In addition, games tend to have bursts of heavy load rather than a continuous heavy load as many compute tasks like Blender. This can cause “power spikes”, which put strain on the power supply (PSU). I suspect your PSU may be sized incorrectly.

(1) The total combined wattage for all system components should be <= 60% of the power rating of your PSU

(2) PCIe power connectors on the GPU must be plugged in properly, and their power cables should contain no converters or splitters.

Hi all,

After we updated the driver to Version 384.90, we didn’t experience any hung issue (so far).

We had some other problems, like “PCI card fallen from the bus”, but those are more low level PCI issues.
Also on our Tyan servers we had a lot of PCI error recovery messages, which dissapeared afer we updated the motherboard BIOS.