I have several Ubuntu 16.04 docker nodes, running nvidia-docker. Each node runs several instances of an Emby container, and uses a Quattro P4000 for NVENC media transcoding, with ffmpeg (bundled with the Emby container).
I’ve observed that when transcoding enough concurrent streams to exhaust the GPU RAM (8GB), the host itself will hang, in some cases, cause the NICs to reset (Intel igbxe), requiring a hard reset to restore.
(I can supply nvidia-bug-report gathered at the time of the crash, if this helps)
I realize that I’m over-subscribing my GPU RAM under these conditions, but my user load is unpredictable, and I’d prefer that the entire system not fail as a result. I’m a noob - is there anything I should/could be doing to limit the impact of oversubscribing RAM, such that attempting more transcodes than I have RAM to support will simply result in an error, but not a catastrophic system failure?
What driver/cuda versions are you using?
Allocating vmem shouldn’t take down the host, please attach the nvidia-bug-report.log or send it with your bug description to linux-bugs@nvidia.com
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Wed_Apr_11_23:16:29_CDT_2018
Cuda compilation tools, release 9.2, V9.2.88
I’ll attach nvidia-bug-report.log as soon as I figure out how ;) nvidia-bug-report.log (804 KB)