K20 with high utilization, but no compute processes.

I’ve been looking at this issue for the last week or so and cannot determine the answer. We have a cluster (RHEL 6.3) which has GPU nodes (some have M2090 and some have K20). On the K20 nodes, the GPUs seem to have utilization, but there are no processes running on the boards. Using the NVIDIA SMI tool, obtain the simple output:

+------------------------------------------------------+                       
| NVIDIA-SMI 4.310.32   Driver Version: 310.32         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name                     | Bus-Id        Disp.  | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap| Memory-Usage         | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K20m               | 0000:2A:00.0     Off |                    0 |
| N/A   29C    P0    47W / 225W |   0%   11MB / 4799MB |     23%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K20m               | 0000:90:00.0     Off |                    0 |
| N/A   30C    P0    45W / 225W |   0%   11MB / 4799MB |     78%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Compute processes:                                               GPU Memory |
|  GPU       PID  Process name                                     Usage      |
|=============================================================================|
|  No running compute processes found                                         |
+-----------------------------------------------------------------------------+

###AND For the M2090, we can see a similar result, but GPU utilization is 0:

+------------------------------------------------------+                       
| NVIDIA-SMI 4.310.32   Driver Version: 310.32         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name                     | Bus-Id        Disp.  | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap| Memory-Usage         | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla M2090              | 0000:2A:00.0     Off |                    0 |
| N/A   N/A    P0    75W / 225W |   0%    9MB / 5375MB |      0%    E. Thread |
+-------------------------------+----------------------+----------------------+
|   1  Tesla M2090              | 0000:90:00.0     Off |                    0 |
| N/A   N/A    P0    77W / 225W |   0%    9MB / 5375MB |      0%    E. Thread |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Compute processes:                                               GPU Memory |
|  GPU       PID  Process name                                     Usage      |
|=============================================================================|
|  No running compute processes found                                         |
+-----------------------------------------------------------------------------+

I’m curious if anybody has any suggestions to trace GPU utilization beyond SMI so that I can remove the culprit. Also, we have reset the nodes and reset the cards as well. It simply just didn’t help. I appreciate any help on this as the K20 are actually significantly slow to compute on versus the K20 I have in my workstation. Thanks!

Excuse the single code block, the website apparently does not like two code blocks.

Hi jbaksta,

I see that you have ECC Enabled. Do you happen to have Persistence Mode Disabled?

During driver initialization when ECC is enabled one can see high GPU and Memory Utilization readings. This is caused by ECC Memory Scrubbing mechanism that is performed during driver initialization.

When Persistence Mode is Disabled, driver deinitializes when there are no clients running (CUDA apps or nvidia-smi or XServer) and needs to initialize again before any GPU application (like nvidia-smi) can query its state thus causing ECC Scrubbing.

As a rule of thumb always run with Persistence Mode Enabled. Just run as root nvidia-smi -pm 1. This will speed up application lunching by keeping the driver always loaded.

Let me know if that explains the results you’re seeing.

Regrads,
Przemyslaw Zych

2 Likes

I just logged into one of the nodes, and the persistent mode has done the trick. Thank you!

I remember vaguely hearing about it at GTC 2013 and remembering that we should enable it…another sys admin did the driver install. The nodes all have the same OS image, so they all have the pretty much pristine identical settings which bring me to 2 more questions.

Does the M2090 not require persistent mode, they seem not to be affected?

And I have read that this in not a permanent change so we need to add this into our start-up scripts so on boot, the persistent mode is always on?

cheers,

Jared

Does the M2090 not require persistent mode, they seem not to be affected?

By the time NVSMI is done initialising ECC scrubbing is done as well, so there’s a race between the query and whether utilisation still keeps the values from the scrubbing.
Some queries or initialisation itself might take a bit longer on M2090 than K20. Hope this explains your concerns.

M2090 is best to be used with Persistence Mode Enabled as well as K20.

And I have read that this in not a permanent change so we need to add this into our
start-up scripts so on boot, the persistent mode is always on?

Correct. You need to enable persistence mode after every reboot.

Hi,
I have also encountered a high utilization issue with k20m on CentOS 6.3

# nvidia-smi 
Tue Apr 23 13:40:00 2013       
+------------------------------------------------------+                       
| NVIDIA-SMI 4.304.54   Driver Version: 304.54         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name                     | Bus-Id        Disp.  | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap| Memory-Usage         | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K20m               | 0000:03:00.0     Off |                    0 |
| N/A   42C    P0    50W / 225W |   0%   11MB / 4799MB |     99%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Compute processes:                                               GPU Memory |
|  GPU       PID  Process name                                     Usage      |
|=============================================================================|
|  No running compute processes found                                         |
+-----------------------------------------------------------------------------+

I have set Persistence mode to enabled, also in a startup script, yet the gpu starts at 99% utilization after a reboot or gpu reset

nvidia-smi -a

==============NVSMI LOG==============

Timestamp                       : Tue Apr 23 13:53:31 2013
Driver Version                  : 304.54

Attached GPUs                   : 1
GPU 0000:03:00.0
    Product Name                : Tesla K20m
    Display Mode                : Disabled
    Persistence Mode            : Enabled
    Driver Model
        Current                 : N/A
        Pending                 : N/A
    Serial Number               : 0325112069301
    GPU UUID                    : GPU-96e8d3f9-d622-f02c-4cca-1da19e0b6a8b
    VBIOS Version               : 80.10.11.00.0B
    Inforom Version
        Image Version           : 2081.0208.01.07
        OEM Object              : 1.1
        ECC Object              : 3.0
        Power Management Object : N/A
    GPU Operation Mode
        Current                 : Compute
        Pending                 : Compute
    PCI
        Bus                     : 0x03
        Device                  : 0x00
        Domain                  : 0x0000
        Device Id               : 0x102810DE
        Bus Id                  : 0000:03:00.0
        Sub System Id           : 0x101510DE
        GPU Link Info
            PCIe Generation
                Max             : 2
                Current         : 2
            Link Width
                Max             : 16x
                Current         : 16x
    Fan Speed                   : N/A
    Performance State           : P0
    Clocks Throttle Reasons
        Idle                    : Not Active
        User Defined Clocks     : Not Active
        SW Power Cap            : Not Active
        HW Slowdown             : Not Active
        Unknown                 : Not Active
    Memory Usage
        Total                   : 4799 MB
        Used                    : 11 MB
        Free                    : 4788 MB
    Compute Mode                : Default
    Utilization
        Gpu                     : 99 %
        Memory                  : 0 %
    Ecc Mode
        Current                 : Enabled
        Pending                 : Enabled
    ECC Errors
        Volatile
            Single Bit            
                Device Memory   : 0
                Register File   : 0
                L1 Cache        : 0
                L2 Cache        : 0
                Texture Memory  : 0
                Total           : 0
            Double Bit            
                Device Memory   : 0
                Register File   : 0
                L1 Cache        : 0
                L2 Cache        : 0
                Texture Memory  : 0
                Total           : 0
        Aggregate
            Single Bit            
                Device Memory   : 36
                Register File   : 0
                L1 Cache        : 0
                L2 Cache        : 0
                Texture Memory  : 0
                Total           : 36
            Double Bit            
                Device Memory   : 25
                Register File   : 0
                L1 Cache        : 0
                L2 Cache        : 0
                Texture Memory  : 0
                Total           : 25
    Temperature
        Gpu                     : 41 C
    Power Readings
        Power Management        : Supported
        Power Draw              : 50.61 W
        Power Limit             : 225.00 W
        Default Power Limit     : 225.00 W
        Min Power Limit         : 150.00 W
        Max Power Limit         : 225.00 W
    Clocks
        Graphics                : 758 MHz
        SM                      : 758 MHz
        Memory                  : 2600 MHz
    Applications Clocks
        Graphics                : 705 MHz
        Memory                  : 2600 MHz
    Max Clocks
        Graphics                : 758 MHz
        SM                      : 758 MHz
        Memory                  : 2600 MHz
    Compute Processes           : None

Can anyone help please?

Hey checkpalm,

The first time the driver is loaded the driver needs to initialize the ECC check bits in the device memory. Rebooting or performing a GPU reset will force this initialization to occur (even if persistence mode is enabled). During the initialization process the GPU will report high utilization.

yet the gpu starts at 99% utilization after a reboot or gpu reset

My expectation is that the ECC check bit initialization will complete after a few seconds, and the GPU utilization will fall to 0%. Can you confirm if the GPU utilization drops after a few seconds?

-Robert

No, The GPU remains at 99% utilization as long as the machine is working. I rebooted as I thought some process might have “hung” in the GPU (this happened after trying out some cuda enabled namd - I rebooted after ~ 24 hours). After the reboot, utilization is still at 99% for a few hours now.

I have disabled ECC (nvidia-smi -e 0) and rebooted the computer. Now GPU utilization is 0%, but ECC is disabled. Is there a different solution, or is the memory infallible, and ECC not really required?

One more thing to note - up to now the Performance state was P0. since disabling ECC it’s been P8. I suspect the ECC scrub never completed correctly. Is there a way to verify that the HW is working correctly?

@checkpalm

I’ve only seen incorrect results with a GTX Titan when I overclock it somewhere above 1200MHz and performing CUDA calculations using 99-100% GPU usage. (same GK110 chipset) Of course YMMV, but bench test your card against some sort of numerically verifiable results if possible to determine it is stable with ECC disabled. Also, you might want to seek support with your K20c vendor – they should have the pull to your elevate your concerns of a possible issue to a knowledgeable NVIDIA rep who should be able to assist, after all, that’s part of the $$$$ you pay for this product.

In addition to following up with your system vendor as suggested by vacaloca, it would make sense to file a bug using the bug reporting form linked from the registered developer website. The GPU reported as running at P0 and full utilization while no identifiable compute process is running is not expected behavior.

Hi all,
Have you focused on the power of the k20c?
I used the SHOC benchmark to test the k20c. And the max power of the k20c is just about 150w, and it would not be bigger than 160W,even thoug the GPU utilization is reached 99%…
I have tied the SHOC benchmark on Quadro 4000/Tesla M2090/Grid K2, the card’s power could reach 95% of the TDP. So, I think the benchmark tool is no problem.
I just doubt something is wrong with my k20c.
By the way ,I have checked the Power limite is 225W.

*******************************************************************************
Mon Jul 15 19:43:51 2013       
+------------------------------------------------------+                       
| NVIDIA-SMI 5.319.32   Driver Version: 319.32         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K20c          Off  | 0000:06:00.0     Off |                  Off |
| 35%   47C    P0   146W / 225W |      283MB /  5119MB |     99%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Compute processes:                                               GPU Memory |
|  GPU       PID  Process name                                     Usage      |
|=============================================================================|
|    0      6316  ./SGEMM                                              267MB  |
+-----------------------------------------------------------------------------+
****************************************************************************************

Could anybody help me to confirm this problem?

I am having a similar problem. has this been solved? I am running on Centos 6.6

I also noticed persistence mode has been deprecated in favour of a persistence deamon, is this also recommended for the k20c and k40c?

Cheers!

Thu Mar 19 12:51:50 2015       
+------------------------------------------------------+                       
| NVIDIA-SMI 340.29     Driver Version: 340.29         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K40c          Off  | 0000:03:00.0     Off |                    0 |
| 23%   39C    P0    63W / 235W |     23MiB / 11519MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K20c          Off  | 0000:83:00.0     Off |                    0 |
| 30%   36C    P0    48W / 225W |     11MiB /  4799MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K20c          Off  | 0000:84:00.0     Off |                    0 |
| 30%   39C    P0    52W / 225W |     11MiB /  4799MiB |     68%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Compute processes:                                               GPU Memory |
|  GPU       PID  Process name                                     Usage      |
|=============================================================================|
|  No running compute processes found                                         |
+-----------------------------------------------------------------------------+
[root@romulus westengjpvan]#

The act of running nvidia-smi tool can generate utilization on a GPU. This is not a matter of concern. Regarding power utilization, a benchmark like Rodinia is not sufficient to draw full power from a GPU, even though the reported “utilization” may be 99%.