GPU is lost, all GPU card fans on, 1080 Ti, Ubuntu 16.04.
Hello, I am using a GTX1080 Ti, on 16.04.3. The GPU driver/sub-system crashes on its own sometime (i.e. I leave the machine running), and sometime when I am running tensorflow code. nvidia-smi tells me something like this, when this happens. Also, all the GPU card fans start running at high speed. [code] Unable to determine the device handle for GPU 0000:01:00.0: GPU is lost. Reboot the system to recover this GPU ... [ 211.957262] NVRM: GPU at PCI:0000:01:00: GPU-f0a4ec3b-aa15-2398-6fe2-ea529751b19d [ 211.957272] NVRM: GPU Board Serial Number: [ 211.957278] NVRM: Xid (PCI:0000:01:00): 79, GPU has fallen off the bus. [ 211.957282] NVRM: GPU at 0000:01:00.0 has fallen off the bus. [ 211.957285] NVRM: GPU is on Board . [ 211.957298] NVRM: A GPU crash dump has been created. If possible, please run NVRM: nvidia-bug-report.sh as root to collect this data before NVRM: the NVIDIA kernel module is unloaded. [/code] System: Ubuntu 16.04.3 LTS, x86_64 (4.4.0-87-generic) Hardware: GTX 1080 Ti Driver: 384.98 Please see attached, nvidia-bug-report.log.gz. Thank you in advance for your help.
Hello,

I am using a GTX1080 Ti, on 16.04.3.

The GPU driver/sub-system crashes on its own sometime (i.e. I leave the machine running), and sometime when I am running tensorflow code.

nvidia-smi tells me something like this, when this happens. Also, all the GPU card fans start running at high speed.

Unable to determine the device handle for GPU 0000:01:00.0: GPU is lost.  Reboot the system to recover this GPU
...
[ 211.957262] NVRM: GPU at PCI:0000:01:00: GPU-f0a4ec3b-aa15-2398-6fe2-ea529751b19d
[ 211.957272] NVRM: GPU Board Serial Number:
[ 211.957278] NVRM: Xid (PCI:0000:01:00): 79, GPU has fallen off the bus.
[ 211.957282] NVRM: GPU at 0000:01:00.0 has fallen off the bus.
[ 211.957285] NVRM: GPU is on Board .
[ 211.957298] NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.


System: Ubuntu 16.04.3 LTS, x86_64 (4.4.0-87-generic)
Hardware: GTX 1080 Ti
Driver: 384.98

Please see attached, nvidia-bug-report.log.gz.

Thank you in advance for your help.

#1
Posted 12/24/2017 07:37 PM   
As always: [list] [.]Check/replace your PSU[/.] [.]Update your system BIOS[/.] [.]Reset your BIOS settings[/.] [.]Remove/disable any overclocking[/.] [.]Reseat your GPU in a PCI-E slot[/.] [/list] Also make sure your problem is reproducible under Windows (Windows 10 trial can easily be downloaded).
As always:

  • Check/replace your PSU
  • Update your system BIOS
  • Reset your BIOS settings
  • Remove/disable any overclocking
  • Reseat your GPU in a PCI-E slot

Also make sure your problem is reproducible under Windows (Windows 10 trial can easily be downloaded).

Artem S. Tashkinov
Linux and Open Source advocate

#2
Posted 12/27/2017 07:22 AM   
Thank you for the suggestions. After reseating the GPU card and plugging (out and in) the PSU cables, all seem to be working again.
Thank you for the suggestions. After reseating the GPU card and plugging (out and in) the PSU cables, all seem to be working again.

#3
Posted 01/03/2018 03:13 PM   
Scroll To Top

Add Reply