I have several servers(Dell T630) with 4 Tesla P40 running RedHat 7.4 with kernel 3.10, with nvidia drivers 384.81 or 384.125. One of the servers work abnormal when I install all the driver and tools. It’s power usage is much higher than other servers that is at range of 9~11w, it always at 49w with nothing running on gpus.
# nvidia-smi
Tue Apr 3 21:06:08 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.125 Driver Version: 384.125 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P40 Off | 00000000:02:00.0 Off | 0 |
| N/A 20C P0 49W / 250W | 0MiB / 22912MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P40 Off | 00000000:04:00.0 Off | 0 |
| N/A 22C P0 49W / 250W | 0MiB / 22912MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla P40 Off | 00000000:83:00.0 Off | 0 |
| N/A 18C P0 49W / 250W | 0MiB / 22912MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla P40 Off | 00000000:84:00.0 Off | 0 |
| N/A 17C P0 48W / 250W | 0MiB / 22912MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
I found it response slowly when typing dmesg, and having warning message saying
[ 916.278598] nvidia 0000:02:00.0: irq 212 for MSI/MSI-X
[ 916.280934] ACPI Warning: \_SB_.PCI0.BR3A.NDX0._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20130517/nsarguments-95)
[ 916.280970] ACPI Warning: \_SB_.PCI0.BR3A.NDX0._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20130517/nsarguments-95)
[ 916.280980] ACPI Warning: \_SB_.PCI0.BR3A.NDX0._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20130517/nsarguments-95)
[ 916.280990] ACPI Warning: \_SB_.PCI0.BR3A.NDX0._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20130517/nsarguments-95)
[ 916.280999] ACPI Warning: \_SB_.PCI0.BR3A.NDX0._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20130517/nsarguments-95)
[ 916.281009] ACPI Warning: \_SB_.PCI0.BR3A.NDX0._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20130517/nsarguments-95)
[ 916.281018] ACPI Warning: \_SB_.PCI0.BR3A.NDX0._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20130517/nsarguments-95)
[ 917.071290] nvidia 0000:04:00.0: irq 213 for MSI/MSI-X
[ 917.690861] nvidia 0000:83:00.0: irq 214 for MSI/MSI-X
[ 918.330270] nvidia 0000:84:00.0: irq 215 for MSI/MSI-X
If you’re running nothing on the gpus and not use the persistence-daemon, the driver will de-initialize. If you’re then running nvidia-smi, it will initialize the gpus again which puts them to P0 state at start.
Using driver persistence, the gpus will stay initialized so when you run nvidia-smi, they stay idle.
This server is the only one that meets the problem of latency, others using cuda-8.0 with 384.66 or cuda-9.0 with 384.81 works well. As u can see that other machine without persistence mode work on P8 and the Idel state is active.
In addition, Before I installing cuda9.0 with 384.66, I installed the cuda9.1 using yum and found it cannot work with tensorflow1.7. Then I uninstall it using yum history undo. May this strange problem is associated with it?Do I need to flush the driver like xorg or other things mannually
I can only tell you that this is the expected behaviour and the reason for persistence mode/daemon. Not having logs from the other systems I can’t tell why those are behaving differently.
For TF and Cuda 9.1, minimum required driver version is 387 and you’ll have to build TF from source, the binary is built against Cuda 9.0.
Edit: build instructions [url]https://medium.com/@xinh3ng/install-cuda-9-1-and-cudnn-7-for-tensorflow-1-5-0-cda36239bc68[/url]