Kernel crash on SuperMicro X10DRG-HT, 2x Xeon(R) CPU E5-2620 v4, 6x P5000

Hello NVIDIA devs!
I’m currently struggling with 6 NVIDIA P5000 cards on SuperMicro X10DRG-HT (SYS-2028GR-TRHT).

I’ve succesfully installed whole system, nvidia kernel driver, cuda and all stuff is working but is pretty unstable. This crash is reproducible on Ubuntu 16.04 and Debian stretch. Now I’m running Debian with only 4 cards (remaining two are on another board for more tests) and was able to gather some info which could help with finding what’s wrong.

Unfortunately I can not provide nvidia-bugreport output. This crash leads to complete machine freeze or immediate reboot.

I’ve managed steps to reproduce this issue:
In three separate ssh connections run this three commands:

$ while true; do ./gpu_burn 20; sleep 3; done
$ while true; do nvidia-smi & nvidia-smi & nvidia-smi & nvidia-smi & nvidia-smi & nvidia-smi ; done
$ nvidia-smi dmon -s pucvmet

(gpu_burn compiled from GitHub - Microway/gpu-burn: Microway's improved version of GPU Burn)

after few minutes kernel crashes.

I’ve managed to setup netconsole with MCE logs and this is what I’ve catched:

===
Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast exception handler
Shutting down cpus with NMI
Kernel Offset: 0x2c800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
Rebooting in 30 seconds…
mce: [Hardware Error]: Machine check events logged
mce: [Hardware Error]: Machine check events logged
INFO: rcu_sched detected stalls on CPUs/tasks:
6-…: (0 ticks this GP) idle=453/140000000000000/0 softirq=48357/48359 fqs=382
(detected by 7, t=5252 jiffies, g=18619, c=18618, q=397725)
Task dump for CPU 6:
gpu_burn R running task 0 4604 4588 0x00000008
ffff93df455938e8 ffffffffc10f721e ffff93df45593888 ffffffffc16569ec
0000000000000000 ffffffffc142e103 ffff93df45e94008 0000000000000001
0000000000000000 0000000000000000 00000000c1d001d0 0000000000000001
Call Trace:
[] ? os_acquire_spinlock+0xe/0x20 [nvidia]
[] ? _nv030712rm+0xc/0x20 [nvidia]
[] ? _nv019730rm+0xf3/0x130 [nvidia]
[] ? _nv012030rm+0x60/0x60 [nvidia]
[] ? pci_conf1_write+0x57/0xe0
[] ? pci_bus_write_config_word.part.7+0x44/0x60
[] ? nv_check_pci_config_space+0x16e/0x320 [nvidia]
[] ? _nv031399rm+0x158/0x190 [nvidia]
[] ? _nv028452rm+0x58/0x70 [nvidia]
[] ? _nv033408rm+0x34/0x1d0 [nvidia]
[] ? _nv007621rm+0x118/0x180 [nvidia]
[] ? _nv007599rm+0x28c/0x2a0 [nvidia]
[] ? _nv001091rm+0x12/0x20 [nvidia]
[] ? _nv006820rm+0x64/0xa0 [nvidia]
[] ? _nv001193rm+0x5e8/0x880 [nvidia]
[] ? rm_ioctl+0x73/0x100 [nvidia]
[] ? nvidia_ioctl+0x19a/0x5a0 [nvidia]
[] ? handle_mm_fault+0xefe/0x12d0
[] ? nvidia_frontend_compat_ioctl+0x3c/0x40 [nvidia]
[] ? do_vfs_ioctl+0x9f/0x600
[] ? SyS_ioctl+0x74/0x80
[] ? system_call_fast_compare_end+0xc/0x9b
rcu_sched kthread starved for 4433 jiffies! g18619 c18618 f0x2 RCU_GP_WAIT_FQS(3) ->state=0x0
rcu_sched R running task 0 8 2 0x00000000
ffff93df5737d400 0000000000000000 ffff93d75c368080 ffff93d75fad8240
ffff93d75bd930c0 ffffba3d062d3db0 ffffffff8d2038e3 ffffba3d062d3de0
0000000100003778 ffff93d75fad8240 0000000000000003 ffff93d75c368080
Call Trace:
[] ? __schedule+0x233/0x6d0
[] ? schedule+0x32/0x80
[] ? schedule_timeout+0x16b/0x350
[] ? del_timer_sync+0x50/0x50
[] ? rcu_gp_kthread+0x505/0x850
[] ? __wake_up_common+0x49/0x80
[] ? rcu_note_context_switch+0xe0/0xe0
[] ? kthread+0xd7/0xf0
[] ? kthread_park+0x60/0x60
[] ? ret_from_fork+0x25/0x30
MCE records pool full!
MCE records pool full!
MCE records pool full!

Now I’m running offical debian kernel 4.9.0-4-amd64, debian stretch 9.2, fully updated.
I’ve got nvidia-persistened daemon up and running (without it I’ve faced many “GPU fallen off the bus” errors). I’m attaching nvidia-bugreport for you of NOT CRASHED system so you’ve got all information about my system what you need.

I hope this will help with finding issue I’m facing.

Best Regards,

Martin
nvidia-bug-report.log.gz (337 KB)