nvidia-docker based host hangs when GPU memory exceeded with ffmpeg transcodes

Hi all,

I have several Ubuntu 16.04 docker nodes, running nvidia-docker. Each node runs several instances of an Emby container, and uses a Quattro P4000 for NVENC media transcoding, with ffmpeg (bundled with the Emby container).

I’ve observed that when transcoding enough concurrent streams to exhaust the GPU RAM (8GB), the host itself will hang, in some cases, cause the NICs to reset (Intel igbxe), requiring a hard reset to restore.

(I can supply nvidia-bug-report gathered at the time of the crash, if this helps)

I realize that I’m over-subscribing my GPU RAM under these conditions, but my user load is unpredictable, and I’d prefer that the entire system not fail as a result. I’m a noob - is there anything I should/could be doing to limit the impact of oversubscribing RAM, such that attempting more transcodes than I have RAM to support will simply result in an error, but not a catastrophic system failure?

Many thanks!
D

What driver/cuda versions are you using?
Allocating vmem shouldn’t take down the host, please attach the nvidia-bug-report.log or send it with your bug description to linux-bugs@nvidia.com

I’m using driver 396.54…

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.54                 Driver Version: 396.54                    |
|-------------------------------+----------------------+----------------------+

With CUDA v9.2.88:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Wed_Apr_11_23:16:29_CDT_2018
Cuda compilation tools, release 9.2, V9.2.88

I’ll attach nvidia-bug-report.log as soon as I figure out how ;)
nvidia-bug-report.log (804 KB)

Figured out how to attach files ;)

Ok, it’s spilling XIDs 31 and 68.
Did you try using the 410 driver?
What kind of codecs are involved?

I didn’t try the 410 driver, latest one that shows up when searching for 410 is this : https://www.nvidia.com/drivers/results/120911 - do you think an older driver would help?

Re codecs, it’s I suspect it’s either HEVC or H264 decodes which to cause the issue (sorry if my terminology is not on-point, I’m a codec-noob)

Here’s the process table at the time of a crash: bin 1809 30567 0 13:32 ? 00:00:00 /bin/ffmpeg -c:v h264_cuvid -res - Pastebin.com

D

Looks like input is hevc and h.264, output is h.264 only.
The preferred way to install the driver in ubuntu is per ppa:
[url]https://launchpad.net/~graphics-drivers/+archive/ubuntu/ppa[/url]
If you previously installed the driver per .run file, and only then the archive is here:
[url]https://http.download.nvidia.com/XFree86/Linux-x86_64/[/url]

Ah, right. You mean driver 410. I didn’t know it was available, I’ll give it a try ;)

My drivers are installed via ppa:

root@docker1:/etc/apt/sources.list.d# grep graphics *
graphics-drivers-ubuntu-ppa-xenial.list:deb http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu xenial main

Haven’t tested 410 yet (waiting for downtime), but the symptoms here seem to match this ffmpeg bug: #7012 (H264_cuvid Decoding Causes NVRM XID 31) – FFmpeg

Sad to report that driver 410 has made no difference. I had a crash today under 410:

root@docker1:# cat nvidia-bug-report.log  | grep Xid
Oct 16 17:14:02 node2 kernel: NVRM: Xid (PCI:0000:03:00): 31, Ch 00000030, engmask 00008100, intr 10000000
Oct 16 17:14:02 node2 kernel: NVRM: Xid (PCI:0000:03:00): 68, CCMDs 00000030 0000c2b0
[17264.379746] NVRM: Xid (PCI:0000:03:00): 31, Ch 00000030, engmask 00008100, intr 10000000
[17264.535490] NVRM: Xid (PCI:0000:03:00): 68, CCMDs 00000030 0000c2b0
root@docker1:#

I have a similar issue with ubuntu 14.04 and ubuntu 16.04 and nvidia driver 410

I’m still experiencing this issue daily.

Since the update to 418.46 our description of the error is more detailed.

Apr 22 06:59:52 node1 kernel: [226709.620345] NVRM: Xid (PCI:0000:03:00): 31, Ch 00000158, intr 10000000. MMU Fault: ENGINE NVDEC HUBCLIENT_NVDEC faulted @ 0xff_fffff000. Fault is of type FAULT_PDE ACCESS_TYPE_READ
Apr 22 14:15:33 node1 kernel: [252852.169067] NVRM: Xid (PCI:0000:03:00): 31, Ch 00000068, intr 10000000. MMU Fault: ENGINE NVDEC HUBCLIENT_NVDEC faulted @ 0xff_fffff000. Fault is of type FAULT_PDE ACCESS_TYPE_READ
Apr 22 14:15:33 node1 kernel: [252852.328891] NVRM: Xid (PCI:0000:03:00): 68, CCMDs 00000068 0000c2b0
Apr 27 19:34:45 node1 kernel: NVRM: Xid (PCI:0000:03:00): 31, Ch 00000040, intr 10000000. MMU Fault: ENGINE NVDEC HUBCLIENT_NVDEC faulted @ 0xff_fffff000. Fault is of type FAULT_PDE ACCESS_TYPE_READ
Apr 27 19:34:45 node1 kernel: NVRM: Xid (PCI:0000:03:00): 68, CCMDs 00000040 0000c2b0
[ 5012.173829] NVRM: Xid (PCI:0000:03:00): 31, Ch 00000040, intr 10000000. MMU Fault: ENGINE NVDEC HUBCLIENT_NVDEC faulted @ 0xff_fffff000. Fault is of type FAULT_PDE ACCESS_TYPE_READ
[ 5012.340438] NVRM: Xid (PCI:0000:03:00): 68, CCMDs 00000040 0000c2b0

Please check the new beta, seems to fix the bug in a similar case:
[url]https://devtalk.nvidia.com/default/topic/1064531/linux/decklink-mini-recorder-4k-streaming-via-nvidia-geforce-gtx-1660-/[/url]