Processes hang trying to ioctl /dev/nvidiactl

I have a system running RHEL 7.1 with CUDA 7. It has 2 Tesla K40m GPUs installed. It was working normally, but then a user noticed that no job could make progress, although the utilization is always at 99%, even after doing a reset (nvidia-smi -r ) for both units 0 and 1. We have not rebooted the server, I’d prefer not to.
Any ideas?

Trying to run a simple hello world CUDA program under strace shows it’s stuck in a loop of calls:

ioctl(3, 0xc020462a, 0x7fff01a2d180)    = 0
nanosleep({1, 0}, NULL)                 = 0

Where fd 3 is /dev/nvidiactl according to /proc

+------------------------------------------------------+                       
| NVIDIA-SMI 346.89     Driver Version: 346.89         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K40m          On   | 0000:1B:00.0     Off |                   0* |
| N/A   29C    P0    65W / 235W |     55MiB / 11519MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K40m          On   | 0000:86:00.0     Off |                   0* |
| N/A   19C    P8    19W / 235W |     55MiB / 11519MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
$ nvidia-smi -a

==============NVSMI LOG==============

Timestamp                           : Thu Sep 24 15:24:19 2015
Driver Version                      : 346.89

Attached GPUs                       : 2
GPU 0000:1B:00.0
    Product Name                    : Tesla K40m
    Product Brand                   : Tesla
    Display Mode                    : Disabled
    Display Active                  : Disabled
    Persistence Mode                : Enabled
    Accounting Mode                 : Disabled
    Accounting Mode Buffer Size     : 128
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : 0322714012955
    GPU UUID                        : GPU-5a79ca7e-4139-eae2-7854-081e69374741
    Minor Number                    : 0
    VBIOS Version                   : 80.80.3E.00.01
    MultiGPU Board                  : No
    Board ID                        : 0x1b00
    Inforom Version
        Image Version               : 2081.0202.01.04
        OEM Object                  : 1.1
        ECC Object                  : 3.0
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    PCI
        Bus                         : 0x1B
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0x102310DE
        Bus Id                      : 0000:1B:00.0
        Sub System Id               : 0x097E10DE
        GPU Link Info
            PCIe Generation
                Max                 : 3
                Current             : 3
            Link Width
                Max                 : 16x
                Current             : 16x
        Bridge Chip
            Type                    : N/A
            Firmware                : N/A
        Replays since reset         : 0
        Tx Throughput               : N/A
        Rx Throughput               : N/A
    Fan Speed                       : N/A
    Performance State               : P0
    Clocks Throttle Reasons
        Idle                        : Not Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
        Unknown                     : Not Active
    FB Memory Usage
        Total                       : 11519 MiB
        Used                        : 55 MiB
        Free                        : 11464 MiB
    BAR1 Memory Usage
        Total                       : 16384 MiB
        Used                        : 2 MiB
        Free                        : 16382 MiB
    Compute Mode                    : Default
    Utilization
        Gpu                         : 99 %
        Memory                      : 0 %
        Encoder                     : 0 %
        Decoder                     : 0 %
    Ecc Mode
        Current                     : Enabled
        Pending                     : Disabled
    ECC Errors
        Volatile
            Single Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
            Double Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
        Aggregate
            Single Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
            Double Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
    Retired Pages
        Single Bit ECC              : 0
        Double Bit ECC              : 0
        Pending                     : No
    Temperature
        GPU Current Temp            : 30 C
        GPU Shutdown Temp           : 95 C
        GPU Slowdown Temp           : 90 C
    Power Readings
        Power Management            : Supported
        Power Draw                  : 65.60 W
        Power Limit                 : 235.00 W
        Default Power Limit         : 235.00 W
        Enforced Power Limit        : 235.00 W
        Min Power Limit             : 180.00 W
        Max Power Limit             : 235.00 W
    Clocks
        Graphics                    : 875 MHz
        SM                          : 875 MHz
        Memory                      : 3004 MHz
    Applications Clocks
        Graphics                    : 745 MHz
        Memory                      : 3004 MHz
    Default Applications Clocks
        Graphics                    : 745 MHz
        Memory                      : 3004 MHz
    Max Clocks
        Graphics                    : 875 MHz
        SM                          : 875 MHz
        Memory                      : 3004 MHz
    Clock Policy
        Auto Boost                  : N/A
        Auto Boost Default          : N/A
    Processes                       : None

GPU 0000:86:00.0
    Product Name                    : Tesla K40m
    Product Brand                   : Tesla
    Display Mode                    : Disabled
    Display Active                  : Disabled
    Persistence Mode                : Enabled
    Accounting Mode                 : Disabled
    Accounting Mode Buffer Size     : 128
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : 0322714013634
    GPU UUID                        : GPU-816c1f5d-8a85-f8e1-e15b-da3f75f40f72
    Minor Number                    : 1
    VBIOS Version                   : 80.80.3E.00.01
    MultiGPU Board                  : No
    Board ID                        : 0x8600
    Inforom Version
        Image Version               : 2081.0202.01.04
        OEM Object                  : 1.1
        ECC Object                  : 3.0
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    PCI
        Bus                         : 0x86
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0x102310DE
        Bus Id                      : 0000:86:00.0
        Sub System Id               : 0x097E10DE
        GPU Link Info
            PCIe Generation
                Max                 : 3
                Current             : 1
            Link Width
                Max                 : 16x
                Current             : 16x
        Bridge Chip
            Type                    : N/A
            Firmware                : N/A
        Replays since reset         : 0
        Tx Throughput               : N/A
        Rx Throughput               : N/A
    Fan Speed                       : N/A
    Performance State               : P8
    Clocks Throttle Reasons
        Idle                        : Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
        Unknown                     : Not Active
    FB Memory Usage
        Total                       : 11519 MiB
        Used                        : 55 MiB
        Free                        : 11464 MiB
    BAR1 Memory Usage
        Total                       : 16384 MiB
        Used                        : 2 MiB
        Free                        : 16382 MiB
    Compute Mode                    : Default
    Utilization
        Gpu                         : 0 %
        Memory                      : 0 %
        Encoder                     : 0 %
        Decoder                     : 0 %
    Ecc Mode
        Current                     : Enabled
        Pending                     : Disabled
    ECC Errors
        Volatile
            Single Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
            Double Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
        Aggregate
            Single Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
            Double Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
    Retired Pages
        Single Bit ECC              : 0
        Double Bit ECC              : 0
        Pending                     : No
    Temperature
        GPU Current Temp            : 19 C
        GPU Shutdown Temp           : 95 C
        GPU Slowdown Temp           : 90 C
    Power Readings
        Power Management            : Supported
        Power Draw                  : 19.42 W
        Power Limit                 : 235.00 W
        Default Power Limit         : 235.00 W
        Enforced Power Limit        : 235.00 W
        Min Power Limit             : 180.00 W
        Max Power Limit             : 235.00 W
    Clocks
        Graphics                    : 324 MHz
        SM                          : 324 MHz
        Memory                      : 324 MHz
    Applications Clocks
        Graphics                    : 745 MHz
        Memory                      : 3004 MHz
    Default Applications Clocks
        Graphics                    : 745 MHz
        Memory                      : 3004 MHz
    Max Clocks
        Graphics                    : 875 MHz
        SM                          : 875 MHz
        Memory                      : 3004 MHz
    Clock Policy
        Auto Boost                  : N/A
        Auto Boost Default          : N/A
    Processes                       : None

Running nvidia-healthmon ends with it saying “Healthmon timed out”, but no specific failure reports:

$ nvidia-healthmon

Using config file path: /etc/nvidia-healthmon/nvidia-healthmon.conf

Loading Config: SUCCESS
Global Tests
   Black-Listed Modules: SKIPPED
   Black-Listed Drivers: SUCCESS
   Load NVML: SUCCESS
   NVML Sanity: SUCCESS
   Tesla Devices Count: SKIPPED
   GPUDirect Comm Matrix
      		GPU0	GPU1	mlx4_0	CPU Affinity
      GPU0	 X 	SOC	PHB	0-7,16-23
      GPU1	SOC	 X 	SOC	8-15,24-31
      mlx4_0	PHB	SOC	 X 	
      
      Legend:
      
        X   = Self
        SOC = Path traverses a socket-level link (e.g. QPI)
        PHB = Path traverses a PCIe host bridge
        PXB = Path traverses multiple PCIe internal switches
        PIX = Path traverses a PCIe internal switch
        CPU Affinity = The cores that are most ideal for NUMA
      
      Result: SUCCESS
   Global Test Results: 13 success, 0 errors, 0 warnings, 8 did not run

-----------------------------------------------------------

0000:1B:00.0
   NVML Sanity: SUCCESS
   InfoROM: SKIPPED
   Multi-GPU InfoROM: SKIPPED
   ECC DBE: SUCCESS
   ECC Enabled Check: SKIPPED
   PCIe Maximum Link Generation: SKIPPED
   PCIe Maximum Link Width: SUCCESS
   CUDA Sanity: SUCCESS
   PCI Bandwidth: SKIPPED
   Memory: SKIPPED


Healthmon timed out.

disable persistence mode on both GPUs (this will require root privilege, using nvidia-smi)

then do:

sudo rmmod nvidia

then run nvidia-smi as root, again

then try your hello world app

and reenable persistence mode, if you wish.

Thanks for the quick reply, however I did the following and the problem persists. nvidia-smi runs ok, but still reports 99% utilization and the hello world program hangs on the same operation according to strace.

[bin]# nvidia-persistenced --no-persistence-mode
[bin]# nvidia-smi -pm 0
Persistence mode is already Disabled for GPU 0000:1B:00.0.
Persistence mode is already Disabled for GPU 0000:86:00.0.
All done.
[bin]# rmmod nvidia
rmmod: ERROR: Module nvidia is in use by: nvidia_uvm
[bin]# rmmod nvidia_uvm
[bin]# rmmod nvidia
[bin]# nvidia-smi

the 99% utilization is only showing on one GPU, and that is “normal” for a run of nvidia-smi. It is a red herring; I don’t believe it is related to whatever issue you are having.

You may have to reboot the server. It’s possible that the kernel is messed up in some what that has nothing to do with CUDA, but is preventing the CUDA driver operations that depend on the kernel.

What is the hello world program?
which GPU are you running it on?

Does it behave the same way (hang) if you run it on the other GPU?

Here’s the hello world program:
I was running it on device 0.
I have some other machines which it runs fine on. If I cudaSetDevice(0) it hangs on the problem machine. cudaSetDevice(1) and it runs very slowly - takes about a minute to print “Hello World” and then several seconds until it actually exits after that. But it does complete.

// This is the REAL "hello world" for CUDA!
// It takes the string "Hello ", prints it, then passes it to CUDA with an array
// of offsets. Then the offsets are added in parallel to produce the string "World!"
// By Ingemar Ragnemalm 2010
 
#include <stdio.h>

const int N = 7;
const int blocksize = 7;

__global__
void hello(char *a, int *b)
{
 a[threadIdx.x] += b[threadIdx.x];
}

int main()
{
 cudaSetDevice(0);
 char a[N] = "Hello ";
 int b[N] = {15, 10, 6, 0, -11, 1, 0};

 char *ad;
 int *bd;
 const int csize = N*sizeof(char);
 const int isize = N*sizeof(int);

 printf("%s", a);

 cudaMalloc( (void**)&ad, csize );
 cudaMalloc( (void**)&bd, isize );
 cudaMemcpy( ad, a, csize, cudaMemcpyHostToDevice );
 cudaMemcpy( bd, b, isize, cudaMemcpyHostToDevice );

 dim3 dimBlock( blocksize, 1 );
 dim3 dimGrid( 1, 1 );
 hello<<<dimGrid, dimBlock>>>(ad, bd);
 cudaMemcpy( a, ad, csize, cudaMemcpyDeviceToHost );
 cudaFree( ad );

 printf("%s\n", a);
 return EXIT_SUCCESS;
}

So to sum up, it seems CUDA “Hello world” will run OK on GPU device 1 but will hang on device 0.
nvidia-smi -r --id 0 does not fix the problem, “healthmon” times out but does not report any errors.
“rmmod nvidia” did not seem to help.

Update - rebooting the machine appears to have solved the issue.