GTX1080 crash, after reboot for crashing in windows 10, must poweroff

I am not sure whether it is the windows application making the graphics card into a wrong state.
When I play some recently 3D games at the windows, the screen would turn off, the fan of graphics card is running loudly(at full speed), I must use the remote console(ssh server of the windows 10) to restart the system.
I don’t meet problem at booting time, the screen and the fan are normal. But I meet problem when the X server is starting.
Reboot won’t work, I must do a poweroff then cool startup. Even reboot at the Linux won’t work. I must order a poweroff(the power adapter won’t be cut off).
Here is the bug reports achieve:

https://drive.google.com/open?id=1zOGiho0s9I3DtUEfY5ptxuMfio50abY2

Hi,

Have you tried a clean driver install via the driver installer? Custom install with clean install checked?

Also what is the wattage on your power supply?

-Josh

Yes, I have tried to uninstall and clean install again. It doesn’t help.
My power supply is 650W and 700W at maximum.

I found a way to reproduce the problem in Linux, running the cuda. I will get the following messasge:
[50100.142935] NVRM: GPU at PCI:0000:02:00: GPU-0f624448-93a1-9681-1224-3fe93d7e
42f1
[50100.142947] NVRM: GPU Board Serial Number:
[50100.142953] NVRM: Xid (PCI:0000:02:00): 79, GPU has fallen off the bus.
[50100.142958] NVRM: GPU at 0000:02:00.0 has fallen off the bus.
[50100.142961] NVRM: GPU is on Board .
[50100.142976] NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.

I have uploaded the full nvidia bug report result herehttps://drive.google.com/open?id=18_jEHe9tF-c9AQ2AS_derw5oW1qX8Bjr

Hi ayaka, What cuda app you are running? What cuda sdk version you are using? How long does it take to repro this issue? What is the temperature of gpu when you see Xid 79 error in dmesg? You can check it with nvidia-smi or nvidia-settings on linux.

Looks like you have issue on both Linux as well as windows. Reading through thread looks like this issue gpu hardware or system power supply or thermal issue. Do you have any other same gpu with which you can test? Also I think you are using supiermicro X10DAL-i motherboard, Is the issue reproduce on any other system with same gpu?

After the hardware crashed, I tried to run the nvidia-smi but it would tell the hardware is not available.
The cuda version is 9.1 with cudnn at the same version. The application I run can be found here
https://github.com/BoyuanJiang/Age-Gender-Estimate-TF
I would failed at training a model from tfrecords.
I only have the other Quadro FX 380.

Hi ayaka, Before running your app you can start nvidia-smi -l on other terminal in loop to check temperature. Also we never used Age-Gender-Estimate-TF app to It would be good if you can provide detailed[step-by-step] instructions to compile, build, usecase, model use so we can reproduce same issue inhouse to investigate further.

Our engineers think this sounds more likely to be a hardware problem (either a hardware defect or a configuration problem, such as an insufficient PSU) than a software problem. Please contact GPU vendor for hardware support and test with same model of GPU.

Hi ayaka, Is this issue resolved for you?

I have bought the a new power adapter and with an UPS. It doesn’t solve the problem.
I am still contacting the vendor.

Thanks ayaka. Who is your GPU vendor? Please keep us posted.

ASUS, a Taiwan(Republic of China) computer vendor.

Are you running the nvidea sound drivers along side realtek sound drivers, if so remove the realtek driver and they shouldn’t clash, it is a weird issue but sometimes it resolves the crashing

By sound driver I mean audio driver :)