We’ve hit an issue on multiple identical systems recently.
Configuration is as follows:
SuperMicro SYS-2028GR-TRHT
Intel E5-2699V4 CPUs x2
Hynix HMA84GR7AFR4N-UH 32GB RAM x8 (256GB total)
2 QUADRO P6000 installed in rear ports
Samsung 850/860 Pro SSDs.
RHEL 7.4
At least 6 of these systems doing the same thing.
When the GPUs are under load, the system freezes for a moment, then eventually resets.
Nothing we can find in any logs indicate what happened.
We’ve made sure SSD firmware is up to date, as well as NVIDIA drivers and system BIOS/firmware.
We ran memtest on 3 of them for 30+ hours and all passed.
nvidia-bug-report(4).log.gz (146 KB)