nvidia-smi suddenly loss one of three cards

zhejiangyyf · June 5, 2017, 6:28am

I have three cards with cuda on my ubuntu 14.04. The three cards run well most of the time.
But sometimes, the nvidia-smi loss one card, showing only two.
This happens many times on the history, everytime restart the ubuntu will solve the problem.
Everyone of the three has met this problem os that it is no relation with slot or power connector.
I want to know how to solve the problem forever because restart always interrupt my work.

lspci show three cards:
lspci | grep NVIDIA
03:00.0 VGA compatible controller: NVIDIA Corporation Device 1b81 (rev a1)
03:00.1 Audio device: NVIDIA Corporation Device 10f0 (rev a1)
05:00.0 VGA compatible controller: NVIDIA Corporation Device 1b06 (rev a1)
05:00.1 Audio device: NVIDIA Corporation Device 10ef (rev a1)
06:00.0 VGA compatible controller: NVIDIA Corporation Device 1b81 (rev a1)
06:00.1 Audio device: NVIDIA Corporation Device 10f0 (rev a1)

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 3627 C python 81MiB |
±----------------------------------------------------------------------------+

generix · June 5, 2017, 1:57pm

If this happens under load, maybe a problem with air flow so the card overheats and shuts down. Monitor the temperatures and power draw while it’s working until it shuts down.

zhejiangyyf · June 6, 2017, 7:28am

This always hanppens at the free time, so no relation with overheats.
When on working，the card never shutdown .

generix · June 6, 2017, 8:42am

Please run nvidia-bug-report.sh when this happens and attach output file to your post.

zhejiangyyf · June 6, 2017, 10:13am

How to download nvidia-bug-report.sh?

generix · June 6, 2017, 10:28am

It comes bundled with the driver so should be installed. Icon for attaching files is next to editing your post.
Do you have the persistence daemon started? If not, the gpus would go offline when idle.

zhejiangyyf · June 8, 2017, 2:45am

Just now it happens again !!!
This is the log content.

sandipt · June 8, 2017, 1:53pm

In log I see :

[184611.868266] NVRM: RmInitAdapter failed! (0x26:0x65:1097)
[184611.868293] NVRM: rm_init_adapter failed for device bearing minor number 0
[184669.238992] NVRM: RmInitAdapter failed! (0x24:0x65:1060)
[184669.239024] NVRM: rm_init_adapter failed for device bearing minor number 0
[184725.003731] NVRM: RmInitAdapter failed! (0x24:0x65:1060)
[184725.003821] NVRM: rm_init_adapter failed for device bearing minor number 0

If same gpu is hitting this issue again and again that mean it can be faulty gpu. Please replace remove gpu and check.

Are you running any task on these gpus? And that task hit this issue? Is the earlier driver worked on you system without this issue? Any recent change on system? Also please test with latest nvidia driver from Unix Drivers | NVIDIA

Also check why you are getting below errors in dmesg :

[ 0.290379] pci 0000:09:01.0: BAR 14: no space for [mem size 0x00200000]
[ 0.290380] pci 0000:09:01.0: BAR 14: failed to assign [mem size 0x00200000]
[ 0.290381] pci 0000:09:01.0: BAR 15: assigned [mem 0x5d000000-0x5d1fffff 64bit pref]
[ 0.290381] pci 0000:09:04.0: BAR 14: no space for [mem size 0x00200000]
[ 0.290382] pci 0000:09:04.0: BAR 14: failed to assign [mem size 0x00200000]
[ 0.290383] pci 0000:09:04.0: BAR 15: assigned [mem 0x5d200000-0x5d3fffff 64bit pref]
[ 0.290384] pci 0000:09:01.0: BAR 13: assigned [io 0x2000-0x2fff]
[ 0.290385] pci 0000:09:04.0: BAR 13: assigned [io 0x3000-0x3fff]
[ 0.290386] pci 0000:09:04.0: BAR 14: no space for [mem size 0x00200000]
[ 0.290387] pci 0000:09:04.0: BAR 14: failed to assign [mem size 0x00200000]
[ 0.290388] pci 0000:09:01.0: BAR 14: no space for [mem size 0x00200000]
[ 0.290388] pci 0000:09:01.0: BAR 14: failed to assign [mem size 0x00200000]

zhejiangyyf · June 9, 2017, 3:19am

This issue happens on all of the cards not only special one, so there is no faulty gpu.
I always run training task on this card for longtime. The card never hit this issue when it is running.
The issue always happens on the other free card.

I have no change about the system recently. I can try the newest driver.
In my memory this problem is not about special driver version.

I have no idea about errors in dmesg now. I will try search reason about this.

zhejiangyyf · June 9, 2017, 3:25am

Again! The issue hit me Again!
One of my 1070 card is missing. I have to restart the system again.
The log content:

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 10137 C python 10733MiB |
±----------------------------------------------------------------------------+

sandipt · June 9, 2017, 7:25am

Please connected one and then two cards in system. I then check if this issue still reproduce? Is the issue reproduce with only one cards? Is the issue reproduce with only two cards?

>>I always run training task on this card for longtime.
May I know how to run this task so we can reproduce this issue here internally to investigate. Please share code or sample application that can trigger this issue. Let me know what are the different apps or packages needed on os.

zhejiangyyf · June 9, 2017, 8:55am

I am very sure that this happened with two cards, because I had two cards for sometimes.
But I am not sure the situation with one card.
In my memory, it never appeared as in the beginning months I only had one card.

Sorry, I have no time to test the two cases as I have important task for the cards right now.

sandipt · June 9, 2017, 9:49am

>>I always run training task on this card for longtime.
May I know how to run this task so we can reproduce this issue here internally to investigate. Please share code or sample application that can trigger this issue. Let me know what are the different apps or packages needed on os.

zhejiangyyf · June 9, 2017, 10:10am

Sorry, although I have met the issue many times, but I have no idea about the trigger condition.
It is just like a magic comming without any signs at all.

In the following time, I will try to be more careful about the trigger condition.
I will update the topic when I find useful information.

sL1pKn07 · June 9, 2017, 2:02pm

the PSU is ok?

marcopardo · December 14, 2017, 5:10pm

Hi,

I have the same problem. I have 4x Zotac Nvidia GTX 1060 3GB. I can run multiple heavy tasks (mining, videoencoding/recoding) on the cards for months without any problem. But if I stop the tasks for reorganising/updating I ‘loose’ 1 or 2 cards.
lspci shows all of them but nvidia-smi dont.

The temps and the PSU are okay.
All cards together under load only uses 280W. The system has 550W on its 12V rail and they never ‘get lost’ when they work with their 70W each. Only if the load disappears.

After some minutes the cards are back online again.

here is my log. It is little after this problem (here are all cars good again), but eventually you can see something.
ftp://nvidia:nvidia20171214@78.94.151.94/nvidia-bug-report.log.gz

generix · December 14, 2017, 9:35pm

Just a wild guess, see if kernel parameter
pcie_port_pm=off
helps.

jetxu24 · April 1, 2022, 2:35am

I have met the issue too, firstly, I met the issue after I run a CNN model with gpus. After that, I just run “watch -n 1 nvidia-smi” command-line for monitoring the devices, met the issue again.