GPU is lost, all GPU card fans on, 1080 Ti, Ubuntu 16.04.

Hello,

I am using a GTX1080 Ti, on 16.04.3.

The GPU driver/sub-system crashes on its own sometime (i.e. I leave the machine running), and sometime when I am running tensorflow code.

nvidia-smi tells me something like this, when this happens. Also, all the GPU card fans start running at high speed.

Unable to determine the device handle for GPU 0000:01:00.0: GPU is lost.  Reboot the system to recover this GPU
...
[  211.957262] NVRM: GPU at PCI:0000:01:00: GPU-f0a4ec3b-aa15-2398-6fe2-ea529751b19d
[  211.957272] NVRM: GPU Board Serial Number:
[  211.957278] NVRM: Xid (PCI:0000:01:00): 79, GPU has fallen off the bus.
[  211.957282] NVRM: GPU at 0000:01:00.0 has fallen off the bus.
[  211.957285] NVRM: GPU is on Board .
[  211.957298] NVRM: A GPU crash dump has been created. If possible, please run
               NVRM: nvidia-bug-report.sh as root to collect this data before
               NVRM: the NVIDIA kernel module is unloaded.

System: Ubuntu 16.04.3 LTS, x86_64 (4.4.0-87-generic)
Hardware: GTX 1080 Ti
Driver: 384.98

Please see attached, nvidia-bug-report.log.gz.

Thank you in advance for your help.

nvidia-bug-report.log.gz (275 KB)

As always:

  • Check/replace your PSU
  • Update your system BIOS
  • Reset your BIOS settings
  • Remove/disable any overclocking
  • Reseat your GPU in a PCI-E slot

Also make sure your problem is reproducible under Windows (Windows 10 trial can easily be downloaded).

Thank you for the suggestions. After reseating the GPU card and plugging (out and in) the PSU cables, all seem to be working again.