I am not sure whether it is the windows application making the graphics card into a wrong state.
When I play some recently 3D games at the windows, the screen would turn off, the fan of graphics card is running loudly(at full speed), I must use the remote console(ssh server of the windows 10) to restart the system.
I don’t meet problem at booting time, the screen and the fan are normal. But I meet problem when the X server is starting.
Reboot won’t work, I must do a poweroff then cool startup. Even reboot at the Linux won’t work. I must order a poweroff(the power adapter won’t be cut off).
Here is the bug reports achieve:
I found a way to reproduce the problem in Linux, running the cuda. I will get the following messasge:
[50100.142935] NVRM: GPU at PCI:0000:02:00: GPU-0f624448-93a1-9681-1224-3fe93d7e
42f1
[50100.142947] NVRM: GPU Board Serial Number:
[50100.142953] NVRM: Xid (PCI:0000:02:00): 79, GPU has fallen off the bus.
[50100.142958] NVRM: GPU at 0000:02:00.0 has fallen off the bus.
[50100.142961] NVRM: GPU is on Board .
[50100.142976] NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
I have uploaded the full nvidia bug report result herehttps://drive.google.com/open?id=18_jEHe9tF-c9AQ2AS_derw5oW1qX8Bjr
Hi ayaka, What cuda app you are running? What cuda sdk version you are using? How long does it take to repro this issue? What is the temperature of gpu when you see Xid 79 error in dmesg? You can check it with nvidia-smi or nvidia-settings on linux.
Looks like you have issue on both Linux as well as windows. Reading through thread looks like this issue gpu hardware or system power supply or thermal issue. Do you have any other same gpu with which you can test? Also I think you are using supiermicro X10DAL-i motherboard, Is the issue reproduce on any other system with same gpu?
After the hardware crashed, I tried to run the nvidia-smi but it would tell the hardware is not available.
The cuda version is 9.1 with cudnn at the same version. The application I run can be found here https://github.com/BoyuanJiang/Age-Gender-Estimate-TF
I would failed at training a model from tfrecords.
I only have the other Quadro FX 380.
Hi ayaka, Before running your app you can start nvidia-smi -l on other terminal in loop to check temperature. Also we never used Age-Gender-Estimate-TF app to It would be good if you can provide detailed[step-by-step] instructions to compile, build, usecase, model use so we can reproduce same issue inhouse to investigate further.
Our engineers think this sounds more likely to be a hardware problem (either a hardware defect or a configuration problem, such as an insufficient PSU) than a software problem. Please contact GPU vendor for hardware support and test with same model of GPU.
Are you running the nvidea sound drivers along side realtek sound drivers, if so remove the realtek driver and they shouldn’t clash, it is a weird issue but sometimes it resolves the crashing