Running CUDA programs without starting X server
Hello, I have a Ubuntu 9.10 machine with a GTX 480 card. Since I have only one video card on the machine, when I run a kernel that takes longer than about 10 seconds, the watchdog seems to kill it and I get launch timeout.

I have booted ubuntu into text mode, so there is no X server and therefore no watchdog. The problem is, even though the nvidia driver seems to be loaded(lsmod | grep nvidia), the CUDA programs do not work: they cannot find any CUDA capable device.

Do I need to load an additional driver or something?


Thanks!
Hello, I have a Ubuntu 9.10 machine with a GTX 480 card. Since I have only one video card on the machine, when I run a kernel that takes longer than about 10 seconds, the watchdog seems to kill it and I get launch timeout.



I have booted ubuntu into text mode, so there is no X server and therefore no watchdog. The problem is, even though the nvidia driver seems to be loaded(lsmod | grep nvidia), the CUDA programs do not work: they cannot find any CUDA capable device.



Do I need to load an additional driver or something?





Thanks!

#1
Posted 06/11/2010 10:50 AM   
This is explained in the cuda toolkit release notes.

[quote]In order to run CUDA applications, the CUDA module must be

loaded and the entries in /dev created. This may be achieved

by initializing X Windows, or by creating a script to load the

kernel module and create the entries.



An example script (to be run at boot time):



#!/bin/bash



modprobe nvidia



if [ "$?" -eq 0 ]; then



# Count the number of NVIDIA controllers found.

N3D=`/sbin/lspci | grep -i NVIDIA | grep "3D controller" | wc -l`

NVGA=`/sbin/lspci | grep -i NVIDIA | grep "VGA compatible controller" | wc -l`



N=`expr $N3D + $NVGA - 1`

for i in `seq 0 $N`; do

mknod -m 666 /dev/nvidia$i c 195 $i;

done



mknod -m 666 /dev/nvidiactl c 195 255



else

exit 1

fi[/quote]

N.
This is explained in the cuda toolkit release notes.



In order to run CUDA applications, the CUDA module must be



loaded and the entries in /dev created. This may be achieved



by initializing X Windows, or by creating a script to load the



kernel module and create the entries.







An example script (to be run at boot time):







#!/bin/bash







modprobe nvidia







if [ "$?" -eq 0 ]; then







# Count the number of NVIDIA controllers found.



N3D=`/sbin/lspci | grep -i NVIDIA | grep "3D controller" | wc -l`



NVGA=`/sbin/lspci | grep -i NVIDIA | grep "VGA compatible controller" | wc -l`







N=`expr $N3D + $NVGA - 1`



for i in `seq 0 $N`; do



mknod -m 666 /dev/nvidia$i c 195 $i;



done







mknod -m 666 /dev/nvidiactl c 195 255







else



exit 1



fi




N.

#2
Posted 06/11/2010 11:26 AM   
Thank you, that was it. I should have RTFM-ed more :">
Thank you, that was it. I should have RTFM-ed more :">

#3
Posted 06/15/2010 09:00 AM   
That is fine but every run of cuda code tooks about 5 seconds! Something is missing here! X-es loads something... but it is not a module! I've tried lsmod > modules_1.log during idle and lsmod > modules_2.log and diff modules_1.log modules_2.log gave me only: diff modules_1.log modules_2.log 14c14 < nvidia 11201625 0 --- > nvidia 11201625 56 What could be missing? It is some initialization of device i suppose. May be i need some permanently running "cuda kick starter". i mean some code running quite frequently that doing nothing but lets device to be active... (I do not mean performance level - it could be minimal)
That is fine but every run of cuda code tooks about 5 seconds! Something is missing here! X-es loads something... but it is not a module!
I've tried lsmod > modules_1.log during idle and lsmod > modules_2.log and diff modules_1.log modules_2.log gave me only:
diff modules_1.log modules_2.log
14c14
< nvidia 11201625 0
---
> nvidia 11201625 56
What could be missing? It is some initialization of device i suppose. May be i need some permanently running "cuda kick starter". i mean some code running quite frequently that doing nothing but lets device to be active...
(I do not mean performance level - it could be minimal)

#4
Posted 09/18/2013 02:07 PM   
Under Debian I just press ctrl-alt-F1 to go to the shell. There, I just launch my cuda program without being killed by the watchdog after a few seconds. When the program finished, I go back to my X desktop by pressing ctrl-alt-F7.
Under Debian I just press ctrl-alt-F1 to go to the shell. There, I just launch my cuda program without being killed by the watchdog after a few seconds. When the program finished, I go back to my X desktop by pressing ctrl-alt-F7.

#5
Posted 09/18/2013 03:38 PM   
The Watchdog kills an individual kernel when it takes more than 5 seconds. I am running cuda programs for days without having them killed. A cufft called for example has quite many kernel calls so it little chances to get killed even for very large matrices. Regarding the original questions. At my workplace we have 2 computers without running X server which are used for CUDA.
The Watchdog kills an individual kernel when it takes more than 5 seconds. I am running cuda programs for days without having them killed. A cufft called for example has quite many kernel calls so it little chances to get killed even for very large matrices.

Regarding the original questions. At my workplace we have 2 computers without running X server which are used for CUDA.

#6
Posted 09/19/2013 01:06 AM   
no! i mean the hangup before running the kernel. It takes 5-7 seconds to run my program or nvprof or nvidia-smi (any device related program). After that (inside my program) kernels run normally: before running each kernel there is no hang up. Moreover during the runtime of my program nvidia-smi also runs smoothly. So it is some initialization happens before running the kernel. I will be very appreciated if you can advice something... I've tried lsmod during runtime of device related programs but nothing except nvidia module changed... it was used by 0 before run and by 56 during runtime of my program.
no! i mean the hangup before running the kernel. It takes 5-7 seconds to run my program or nvprof or nvidia-smi (any device related program). After that (inside my program) kernels run normally: before running each kernel there is no hang up.
Moreover during the runtime of my program nvidia-smi also runs smoothly. So it is some initialization happens before running the kernel.

I will be very appreciated if you can advice something... I've tried lsmod during runtime of device related programs but nothing except nvidia module changed... it was used by 0 before run and by 56 during runtime of my program.

#7
Posted 09/19/2013 09:05 AM   
I'd like to add some new results: If I use script which sets cuda nodes (the same as above provided by Nico) and after that start X then i have no hang up for nvidia-smi but 6 sec hang up for cudaSetDevice(<any number including the card used by X>). I have 4 physical 690 cards -> 8 logical in my system.
I'd like to add some new results:
If I use script which sets cuda nodes (the same as above provided by Nico) and after that start X then i have no hang up for nvidia-smi but 6 sec hang up for cudaSetDevice(<any number including the card used by X>). I have 4 physical 690 cards -> 8 logical in my system.

#8
Posted 09/19/2013 10:04 AM   
Scroll To Top