I have an HPE DL580 Gen8 server with four Quadro RTX 6000 cards, and it will frequently power off with a hardware power fault.
Message is:
System Power Fault Detected (XR: 10 A2 MID: FF 0F F0 00 00…
This is apparently an emergency protection shutdown.
The system has 6000 watt power supply (four 1500 watt supplies), and the cards are fed with the standard HPE 8+6 cables from the power distribution panel. It has 3 TiB of RAM, and four 18C/36T Xeons. The system is about three years old, and has been fine with four GTX 1080 Ti cards and Maxwell Titans before that.
OS is Ubuntu Server LTS 18.04, and all Ubuntu and HPE patches are applied.
Looking at ‘nvidia-smi -q -d POWER’, I see samples like:
1.==============NVSMI LOG==============
2.
3.Timestamp : Tue Mar 26 15:45:12 2019
4.Driver Version : 418.56
5.CUDA Version : 10.1
6.
7.Attached GPUs : 4
8.GPU 00000000:41:00.0
9. Power Readings
10. Power Management : Supported
11. Power Draw : 252.35 W
12. Power Limit : 260.00 W
13. Default Power Limit : 260.00 W
14. Enforced Power Limit : 260.00 W
15. Min Power Limit : 100.00 W
16. Max Power Limit : 260.00 W
17. Power Samples
18. Duration : 2.37 sec
19. Number of Samples : 119
[b]20. Max : 392.15 W
[/b]21. Min : 62.49 W
22. Avg : 144.21 W
23.
24.GPU 00000000:81:00.0
25. Power Readings
26. Power Management : Supported
27. Power Draw : 270.69 W
[b]28. Power Limit : 260.00 W
[/b]
29. Default Power Limit : 260.00 W
30. Enforced Power Limit : 260.00 W
31. Min Power Limit : 100.00 W
32. Max Power Limit : 260.00 W
33. Power Samples
34. Duration : 2.38 sec
35. Number of Samples : 119
[b]36. Max : 362.39 W
[/b]
37. Min : 60.93 W
38. Avg : 132.90 W
Lowering the power limit makes it better, but it just reduces the probability of a shutdown - it doesn’t eliminate the problem.
1.==============NVSMI LOG==============
2.
3.Timestamp : Sat Mar 23 11:07:43 2019
4.Driver Version : 418.56
5.CUDA Version : 10.1
6.
7.Attached GPUs : 4
8.GPU 00000000:41:00.0
9. Power Readings
10. Power Management : Supported
11. Power Draw : 90.29 W
[b]12. Power Limit : 150.00 W
[/b]
13. Default Power Limit : 260.00 W
14. Enforced Power Limit : 150.00 W
15. Min Power Limit : 100.00 W
16. Max Power Limit : 260.00 W
17. Power Samples
18. Duration : 2.38 sec
19. Number of Samples : 119
[b]20. Max : 338.29 W
[/b]
21. Min : 65.20 W
22. Avg : 124.28 W
23.
24.GPU 00000000:81:00.0
25. Power Readings
26. Power Management : Supported
27. Power Draw : 95.68 W
[b]28. Power Limit : 150.00 W
[/b]
29. Default Power Limit : 260.00 W
30. Enforced Power Limit : 150.00 W
31. Min Power Limit : 100.00 W
32. Max Power Limit : 260.00 W
33. Power Samples
34. Duration : 2.38 sec
35. Number of Samples : 119
[b]36. Max : 375.68 W
[/b]
37. Min : 62.28 W
38. Avg : 124.25 W
Any ideas on how to tame the power consumption so that the Quadros don’t frighten the server?