GPU server crashes while running deep learning frameworks like Caffe

Our gpu server crashes from time to time while running deep learning frameworks like Caffe.
In these cases only a hard reset of the server helps

It is still kind of random and not reproducible at what point it will crash. Sometimes the whole training passes, sometimes it crashes at different iterations. Not the program is crashing but the server freezes and does not react on any input anymore also after hours.

I was logging nvidia-smi and also cpu load and memory but nothing extra ordinary happens around the time of the server freeze.

The server is a Supermicro SYS-4028GR-TR with Intel® C612 Chipset, 2x Intel Xeon E5-2640 v4, 8x16GB RAM, 4x NVIDIA GeForce GTX 1080 Ti

I have currently no idea how to find the cause of the freezes. Can you help me to find the issue? Pleas let me know in case I should provide more information.

nvidia-bug-report.log.gz (350 KB)

Did you monitor the gpu and system temperatures?
Can you ssh to the server when freeeze occurs to generate a better nvidia-bug-report.sh?

I monitored the gpu temparature. These are the last few lines of my nvidia-smi log:

index, timestamp, name, driver_version, temperature.gpu, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB], vbios_version, persistence_mode, pstate
0, 2017/10/23 15:39:09.194, GeForce GTX 1080 Ti, 384.90, 75, 66 %, 39 %, 11172 MiB, 1646 MiB, 9526 MiB, 86.02.39.00.01, Enabled, P2
1, 2017/10/23 15:39:09.198, GeForce GTX 1080 Ti, 384.90, 29, 0 %, 0 %, 11172 MiB, 11161 MiB, 11 MiB, 86.02.39.00.01, Enabled, P8
2, 2017/10/23 15:39:09.200, GeForce GTX 1080 Ti, 384.90, 27, 0 %, 0 %, 11172 MiB, 11161 MiB, 11 MiB, 86.02.39.00.01, Enabled, P8
3, 2017/10/23 15:39:09.201, GeForce GTX 1080 Ti, 384.90, 26, 0 %, 0 %, 11172 MiB, 11161 MiB, 11 MiB, 86.02.39.00.01, Enabled, P8
0, 2017/10/23 15:39:10.217, GeForce GTX 1080 Ti, 384.90, 75, 69 %, 21 %, 11172 MiB, 1646 MiB, 9526 MiB, 86.02.39.00.01, Enabled, P2

I can not ssh to the server when freeze occurs. I can also not use the shell locally. I need to press the reset button so that the server will restart.

Looks perfectly fine.
After a crash and reboot, what’s the output of
sudo journalctl -n20 -b1 --no-pager
Maybe there are some last words logged.

sudo journalctl -n20 -b1 --no-pager

-- Logs begin at Tue 2017-10-24 14:02:06 CEST, end at Tue 2017-10-24 19:10:39 CEST. --
Oct 24 15:17:01 delisv0120 CRON[3813]: pam_unix(cron:session): session closed for user root
Oct 24 15:17:56 delisv0120 sshd[3896]: Connection closed by 172.20.73.70 port 60481 [preauth]
Oct 24 15:22:24 delisv0120 smartd[1595]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 76 to 77
Oct 24 15:22:24 delisv0120 smartd[1595]: Device: /dev/sdc [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 76 to 77
Oct 24 15:22:56 delisv0120 sshd[4334]: Connection closed by 172.20.73.70 port 60601 [preauth]
Oct 24 15:25:01 delisv0120 CRON[4516]: pam_unix(cron:session): session opened for user root by (uid=0)
Oct 24 15:25:01 delisv0120 CRON[4517]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Oct 24 15:25:01 delisv0120 CRON[4516]: pam_unix(cron:session): session closed for user root
Oct 24 15:27:57 delisv0120 sshd[4773]: Connection closed by 172.20.73.70 port 60641 [preauth]
Oct 24 15:32:56 delisv0120 sshd[5210]: Connection closed by 172.20.73.70 port 60761 [preauth]
Oct 24 15:35:01 delisv0120 CRON[5395]: pam_unix(cron:session): session opened for user root by (uid=0)
Oct 24 15:35:01 delisv0120 CRON[5396]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Oct 24 15:35:01 delisv0120 CRON[5395]: pam_unix(cron:session): session closed for user root
Oct 24 15:37:56 delisv0120 sshd[5651]: Connection closed by 172.20.73.70 port 60800 [preauth]
Oct 24 15:42:56 delisv0120 sshd[6089]: Connection closed by 172.20.73.70 port 60921 [preauth]
Oct 24 15:45:01 delisv0120 CRON[6272]: pam_unix(cron:session): session opened for user root by (uid=0)
Oct 24 15:45:01 delisv0120 CRON[6273]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Oct 24 15:45:01 delisv0120 CRON[6272]: pam_unix(cron:session): session closed for user root
Oct 24 15:47:56 delisv0120 sshd[6527]: Connection closed by 172.20.73.70 port 60961 [preauth]
Oct 24 15:52:56 delisv0120 sshd[6964]: Connection closed by 172.20.73.70 port 32848 [preauth]

The system seems to run more stable after switching from nvidia driver version 384 to 375. At least up till now no crashes since the switch. So it might be an issue with the most current drivers.