nvidia-smi periodically crashes system on Ubuntu 16.04 LTS

I. Description of issue

I recently took delivery of a new GPU server, which has been periodically crashing since installing the Nvidia drivers. After a few days of frustration, I’ve found that the problem crops up when I run nvidia-smi, but only periodically. For instance, simply running

nvidia-smi

can sometimes cause a crash and

nvidia-smi -l 1

is pretty much guaranteed to cause a crash within an hour or so. I’ve tested with two different drivers (installed via apt-get nvidia-XXX), both of which have this issue:

  • 367.44
  • 370.23

In case they are useful, details about the software and hardware as well as the result of running nvidia-bug-report.sh follow.

II. Software details

  • OS: Ubuntu 16.04 LTS
  • Driver version: 367.44

III Hardware details

  • CPU: 2 x Intel Xeon E5-2680 v4
  • Motherboard: SuperMicro X10DRG-O±CPU
  • GPU: 6 x Nvidia GTX 1080
  • Memory: 24 x 16GB DDR4-2400 ECC operating operating at 1600 MT/s

IV. Log file

nvidia-bug-report.sh was run upon reboot after the most recent crash and the output is available at [url]https://drive.google.com/file/d/0B2tuXP9BWQtnNXBoZ3JGZkZBejA/view?usp=sharing[/url].

The issue is not restricted to nvidia-smi, but appears more general. After purging all Nvidia drivers from the system, I reinstalled 370.23 and the latest version of CUDA (8.0.27), then compiled the CUDA examples. Both the drivers and CUDA were installed via the .run scripts available on NVIDIA’s website and not via apt-get. I then looped a subset of the compiled CUDA examples one after util the system crashed. So far, I’ve also observed crashes with the deviceQuery, matrixMulCUBLAS and bandwidthTest examples; the example may run once, twice, or even 50 times without a problem, but at some point the system goes down. Moreover, it doesn’t seem to matter which GPU I execute the code on.

I have the same issue. Have you found any solution?