[SOLVED] Problems with nvidia-persistenced

sudo systemctl status nvidia-persistenced
● nvidia-persistenced.service - NVIDIA Persistence Daemon
Loaded: loaded (/lib/systemd/system/nvidia-persistenced.service; static; vendor preset: enabled)
Active: inactive (dead)

gram that what happens:
Sep 06 13:34:29 node2 systemd[1]: Starting NVIDIA Persistence Daemon…
Sep 06 13:34:29 node2 nvidia-persistenced[21134]: Verbose syslog connection opened
Sep 06 13:34:29 node2 nvidia-persistenced[21134]: Now running with user ID 116 and group ID 124
Sep 06 13:34:29 node2 systemd[1]: Started NVIDIA Persistence Daemon.
Sep 06 13:34:29 node2 nvidia-persistenced[21134]: Started (21134)
Sep 06 13:34:30 node2 nvidia-persistenced[21134]: device 0000:03:00.0 - registered
Sep 06 13:34:30 node2 nvidia-persistenced[21134]: Local RPC service initialized
Sep 06 13:34:30 node2 systemd[1]: Stopping NVIDIA Persistence Daemon…
Sep 06 13:34:30 node2 nvidia-persistenced[21134]: Received signal 15
Sep 06 13:34:30 node2 systemd[1]: Stopped NVIDIA Persistence Daemon.

and when I run a pro- Unit nvidia-persistenced.service has begun shutting down.
Sep 06 13:34:30 node2 nvidia-persistenced[21134]: Received signal 15
Sep 06 13:34:30 node2 nvidia-persistenced[21134]: Socket closed.
Sep 06 13:34:30 node2 nvidia-persistenced[21134]: PID file unlocked.
Sep 06 13:34:30 node2 nvidia-persistenced[21134]: PID file closed.
Sep 06 13:34:30 node2 nvidia-persistenced[21134]: The daemon no longer has permission to remove its runtime data directory /var/run/nvidia-persistenced
Sep 06 13:34:30 node2 nvidia-persistenced[21134]: Shutdown (21134)
Sep 06 13:34:30 node2 sudo[21069]: pam_unix(sudo:session): session closed for user root
Sep 06 13:34:30 node2 systemd[1]: Stopped NVIDIA Persistence Daemon.
– Subject: Unit nvidia-persistenced.service has finished shutting down

Does anyone knows about it?

You need to use journalctl to determine why this happened:

Sep 06 13:34:30 node2 systemd[1]: Stopping NVIDIA Persistence Daemon…
Sep 06 13:34:30 node2 nvidia-persistenced [21134]: Received signal 15

everything prior to that point in the log is normal

[url]https://www.digitalocean.com/community/tutorials/how-to-use-journalctl-to-view-and-manipulate-systemd-logs[/url]

So this is total journalctl log :

Sep 06 14:50:33 node2 nvidia-persistenced[21328]: Local RPC service initialized
Sep 06 14:50:33 node2 kernel: traps: ont_core_cpp_te[21265] general protection ip:7f7a4545834a sp:7fff1e187c08 error:0 in libcuda.so.384.130[7f7a45290000+b20000]
Sep 06 14:50:33 node2 systemd[1]: Stopping NVIDIA Persistence Daemon…
Sep 06 14:50:33 node2 nvidia-persistenced[21328]: Received signal 15
Sep 06 14:50:33 node2 nvidia-persistenced[21328]: Socket closed.
Sep 06 14:50:33 node2 nvidia-persistenced[21328]: PID file unlocked.
Sep 06 14:50:33 node2 nvidia-persistenced[21328]: PID file closed.
Sep 06 14:50:33 node2 nvidia-persistenced[21328]: The daemon no longer has permission to remove its runtime data directory /var/run/nvidia-persistenced
Sep 06 14:50:33 node2 nvidia-persistenced[21328]: Shutdown (21328)
Sep 06 14:50:33 node2 sudo[21263]: pam_unix(sudo:session): session closed for user root
Sep 06 14:50:33 node2 systemd[1]: Stopped NVIDIA Persistence Daemon.

Can you help me to undersand?

The main problem of the program was a segmentation fault, have you ever had this?

thanks
Gabriel

It appears that you are running driver 384.130, is that correct?

what does nvidia-smi show?

My guess would be some sort of corrupted CUDA software install (e.g. mismatched driver components, from different driver installs), or else running on a platform that is not supported for CUDA.

Thu Sep  6 15:26:25 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.130                Driver Version: 384.130                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla C2050         Off  | 00000000:03:00.0 Off |                    0 |
| 30%   55C    P0    N/A /  N/A |      0MiB /  2621MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

the device query

Detected 1 CUDA Capable device(s)

Device 0: "Tesla C2050"
  CUDA Driver Version / Runtime Version          9.0 / 9.0
  CUDA Capability Major/Minor version number:    2.0
  Total amount of global memory:                 2621 MBytes (2748448768 bytes)
MapSMtoCores for SM 2.0 is undefined.  Default to use 64 Cores/SM
MapSMtoCores for SM 2.0 is undefined.  Default to use 64 Cores/SM
  (14) Multiprocessors, ( 64) CUDA Cores/MP:     896 CUDA Cores
  GPU Max Clock rate:                            1147 MHz (1.15 GHz)
  Memory Clock rate:                             1500 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 786432 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65535), 3D=(2048, 2048, 2048)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (65535, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Supports Cooperative Kernel Launch:            No
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 3 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 9.0, NumDevs = 1
Result = PASS

Tesla C2050 is a fermi device.

Fermi devices are not supported by CUDA 9

The last CUDA support for Fermi was in CUDA 8.

oh thats too bad… the program that I am trying to use uses only cuda 9.0 or above!

well thanks anyway!

I’ve encountered an issue with my machine that 1 of my 4 1080Ti’s disappears from nvidia-smi (under no load what so ever). I’m trying to get to the root of the problem and I found the following behavior for the NVIDIA Persistence Daemon. Seems to be in a constant loop: activating → active → inactive → activating → active → activating …

That’s the behaviour i see when running:

watch -n 2 systemctl status nvidia-persistence

Is this normal?
(I’m using driver 396.54)

no its not normal. But I doubt it has anything to do with the root cause of the problem. I think it could be what happens if the GPU disappears when the persistence daemon was previously configured on that GPU.