Getting CUDBG_ERROR_ALL_DEVICES_WATCHDOGGED - on dual-adapter system

epk · January 2, 2017, 11:00pm

I’m using Linux Mint 18.1 on a system with

$ inxi -G
Graphics:  Card-1: Intel Xeon E3-1200 v2/3rd Gen Core processor Graphics Controller
           Card-2: NVIDIA GK106 [GeForce GTX 650 Ti Boost]
           Display Server: X.Org 1.18.4 driver: nvidia Resolution: 1920x1080@60.00hz
           GLX Renderer: GeForce GTX 650 Ti BOOST/PCIe/SSE2 GLX Version: 4.5.0 NVIDIA 375.26

and having added the updated drivers PPA:

root@bakunin /home/eyalroz # cat /etc/apt/sources.list.d/graphics-drivers-ppa-xenial.list
deb http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu xenial main
deb-src http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu xenial main

I only use the Intel on-board graphics for driving my display. Now, I can run CUDA code just fine, but if I try to debug anything (using nsight), I get the CUDBG_ERROR_ALL_DEVICES_WATCHDOGGED error. The contents of ‘xorg.conf’ is:

Section "ServerLayout"
    Identifier "layout"
    Screen 0 "nvidia"
    Inactive "intel"
EndSection

Section "Device"
    Identifier "intel"
    Driver "modesetting"
    BusID "PCI:0@0:2:0"
    Option "AccelMethod" "None"
EndSection

Section "Screen"
    Identifier "intel"
    Device "intel"
EndSection

Section "Device"
    Identifier "nvidia"
    Driver "nvidia"
    BusID "PCI:2@0:0:0"
    Option "ConstrainCursor" "off"
EndSection

Section "Screen"
    Identifier "nvidia"
    Device "nvidia"
    Option "AllowEmptyInitialConfiguration" "on"
    Option "IgnoreDisplayDevices" "CRT"
EndSection

… but if I try to replace “intel” with “nvidia” as the Inactive screen, bad things happen (= Cinnamon starts in fallback mode). If I remove the nVIDIA entries altogether, the file gets magically rewritten when I log out and log in again.

Why is this happening? And what can I do to be able to debug in peace?

veraj · January 4, 2017, 2:45am

Hi, epk2

Can you tell me which cuda toolkit are your using?
Can you help to build the sdk samples under /usr/local/cuda/samples/1_Utilities/deviceQuery, and then run deviceQuery and paste the output?
Thanks!

ps: Is the app you want to debug with a GUI display ?

epk · January 4, 2017, 7:38am

My driver version: 375.26

deviceQuery output:

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX 650 Ti BOOST"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    3.0
  Total amount of global memory:                 1999 MBytes (2095775744 bytes)
  ( 4) Multiprocessors, (192) CUDA Cores/MP:     768 CUDA Cores
  GPU Clock rate:                                1058 MHz (1.06 GHz)
  Memory Clock rate:                             3004 Mhz
  Memory Bus Width:                              192-bit
  L2 Cache Size:                                 393216 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Bus ID / PCI location ID:           2 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GeForce GTX 650 Ti BOOST
Result = PASS

And - the app I was debugging does not involve any GUI.

veraj · January 5, 2017, 7:34am

Hi, epk2

Thanks for the info！
As far as I know, CUDBG_ERROR_ALL_DEVICES_WATCHDOGGED error code will reported when the GPU is also used for display.
And I can reproduce this if I use my local gk106 to connect display and then try to debugger on it.

Based on the info you give below, seems can explains this.
Display Server: X.Org 1.18.4 driver: nvidia Resolution: 1920x1080@60.00hz
GLX Renderer: GeForce GTX 650 Ti BOOST/PCIe/SSE2 GLX Version: 4.5.0 NVIDIA 375.26

Also as gk106 do not support software preemption feature, so you can’t software debugging on it.

As a conclusion, if you want to debug on your system, maybe you can kill the X server and then do debugging from cuda-gdb command line.

epk · January 5, 2017, 11:09am

veraj: What I don’t understand is:

How can the driver say 'nvidia' when the adapter to which the display is connected is the _non-nVIDIA_ one?
If the X server _is_ using the GPU - why would it be doing that (considering the above)?
How can I tell the X server not to use the GPU? I mean, it doesn't _need_ to do so.

veraj · January 10, 2017, 3:10am

Hi, epk2

You can execute nvidia-smi to check if there is X server running on nvidia gpu.
If yes, that can explains your problem.

Actually, the use of Intel and Nvidia GPU together will involve many problems, and you must do correct configurations to implement that. Suppose there should be multi material you can refer by search ‘google’. I never tried this before, so have little to say about this.

In our env, we usually disable Intel in bios and just use nvidia gpu.

epk · January 10, 2017, 9:43pm

veraj: How do I invoke nvidia-smi to make that check? Also, how can an X server run on a GPU if the physical monitor is not connected to that GPU?

Now, while it’s true that I have Intel graphics and an nVIDIA GPU together on the same system, I am not actually using them “together” - I’ve done nothing to link them in any way. And, after all, every PC system with an Intel rather than an AMD CPU now has some kind of Intel graphics controller, so it’s not clear to me how using an nVIDIA GPU in what is perhaps the most common configuration should involve many problems…

veraj · January 11, 2017, 10:14am

Hi, epk2

Following fragment in your xorg.conf seems let X run on nvidia. But you also said you connect the monitor to Intel. Please double check.

Section “ServerLayout”
Identifier “layout”
Screen 0 “nvidia”
Inactive “intel”
EndSection

there is a workaround to let you debug the app, you can kill the X by ‘stop lightdm’, and then debug you app from command line, that is cuda-gdb
You can also upgrade your gk106 to a gpu higher than gk110, such as gk2XX,gmXXX,gpXXX. Then you can debug using software preemption.

ps: If you already install nvidia driver, the nvidia-smi command should work directly.

epk · January 11, 2017, 10:40am

veraj:

Double-checked. My monitor is connected to the Motherboard’s DVI port, not the nVIDIA card’s. You can have a peek at my Xorg.0.log though. It’s full of copies of the following:

[486627.932] (--) NVIDIA(GPU-0): DFP-4: disconnected
[486627.932] (--) NVIDIA(GPU-0): DFP-4: Internal DisplayPort
[486627.932] (--) NVIDIA(GPU-0): DFP-4: 960.0 MHz maximum pixel clock

for DFP-0… DFP-4.

- 1. I used to be able to debug apps with this exact same setup with my previous Linux distribution installation (I was using Kubuntu 16.04 with lightdm and lxdm).

Re nvidia-smi: It works, but it’s not clear what you suggest I do with it. Just running it yields:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.26                 Driver Version: 375.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 650...  Off  | 0000:02:00.0     N/A |                  N/A |
| 30%   33C    P2    N/A /  N/A |    549MiB /  1998MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0                  Not Supported                                         |
+-----------------------------------------------------------------------------+

So no information about any X server.

veraj · January 12, 2017, 2:23am

hmm, it’s strange that you can debug with exact same setup but not with this Mint18.1.

For now, the best workaround seems point 2 mentioned above.

As for your problem, suggest you raise a question related about how to configure Intel/Nvidia togehter in Linux Mint to see if any more can help.