deviceQuery OK, everything else hangs Cuda sdk 4.1 examples simply hang, no errors, no warnings
Hi,

I recently installed a TESLA 2075 on Ubuntu10.04. Drivers and runtime are both CUDA 4.1. Compiling the software SDK runs through with out errors. deviceQuery returns:
[deviceQuery] starting...

bin/linux/release/deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Found 1 CUDA Capable device(s)

Device 0: "Tesla C2075"
CUDA Driver Version / Runtime Version 4.1 / 4.1
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 5375 MBytes (5636554752 bytes)
(14) Multiprocessors x (32) CUDA Cores/MP: 448 CUDA Cores
GPU Clock Speed: 1.15 GHz
Memory Clock rate: 1566.00 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 786432 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: Yes
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 6 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4.1, CUDA Runtime Version = 4.1, NumDevs = 1, Device = Tesla C2075
[deviceQuery] test results...
PASSED

> exiting in 3 seconds: 3...2...1...done!

So far, no problem. I can also see the card using nvidia-smi:
Tue Apr 3 16:25:39 2012
+------------------------------------------------------+
| NVIDIA-SMI 2.285.05 Driver Version: 285.05.33 |
|-------------------------------+----------------------+----------------------+
| Nb. Name | Bus Id Disp. | Volatile ECC SB / DB |
| Fan Temp Power Usage /Cap | Memory Usage | GPU Util. Compute M. |
|===============================+======================+======================|
| 0. Tesla C2075 | 0000:06:00.0 Off | 0 0 |
| 30% 52 C P0 80W / 225W | 0% 10MB / 5375MB | 99% Default |
|-------------------------------+----------------------+----------------------|
| Compute processes: GPU Memory |
| GPU PID Process name Usage |
|=============================================================================|
| No running compute processes found |
+-----------------------------------------------------------------------------+

now if I try to run an application, say vectorAdd: the status is like this:
Tue Apr 3 16:26:29 2012
+------------------------------------------------------+
| NVIDIA-SMI 2.285.05 Driver Version: 285.05.33 |
|-------------------------------+----------------------+----------------------+
| Nb. Name | Bus Id Disp. | Volatile ECC SB / DB |
| Fan Temp Power Usage /Cap | Memory Usage | GPU Util. Compute M. |
|===============================+======================+======================|
| 0. Tesla C2075 | 0000:06:00.0 Off | 0 0 |
| 30% 52 C P12 32W / 225W | 1% 59MB / 5375MB | 0% Default |
|-------------------------------+----------------------+----------------------|
| Compute processes: GPU Memory |
| GPU PID Process name Usage |
|=============================================================================|
| 0. 9361 ...A_GPU_Computing_SDK/C/bin/linux/release/vectorAdd 47MB |
+-----------------------------------------------------------------------------+

Nothing seems to happen on the card, GPU Util(isation, I assume) is stuck at 0% and the code just hangs, no error no nothing. The only thing I see is that one of the CPUs is at 100%.

Any ideas what might be wrong?

Thanks a lot, MW
Hi,



I recently installed a TESLA 2075 on Ubuntu10.04. Drivers and runtime are both CUDA 4.1. Compiling the software SDK runs through with out errors. deviceQuery returns:

[deviceQuery] starting...



bin/linux/release/deviceQuery Starting...



CUDA Device Query (Runtime API) version (CUDART static linking)



Found 1 CUDA Capable device(s)



Device 0: "Tesla C2075"

CUDA Driver Version / Runtime Version 4.1 / 4.1

CUDA Capability Major/Minor version number: 2.0

Total amount of global memory: 5375 MBytes (5636554752 bytes)

(14) Multiprocessors x (32) CUDA Cores/MP: 448 CUDA Cores

GPU Clock Speed: 1.15 GHz

Memory Clock rate: 1566.00 Mhz

Memory Bus Width: 384-bit

L2 Cache Size: 786432 bytes

Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)

Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 49152 bytes

Total number of registers available per block: 32768

Warp size: 32

Maximum number of threads per block: 1024

Maximum sizes of each dimension of a block: 1024 x 1024 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535

Maximum memory pitch: 2147483647 bytes

Texture alignment: 512 bytes

Concurrent copy and execution: Yes with 2 copy engine(s)

Run time limit on kernels: No

Integrated GPU sharing Host Memory: No

Support host page-locked memory mapping: Yes

Concurrent kernel execution: Yes

Alignment requirement for Surfaces: Yes

Device has ECC support enabled: Yes

Device is using TCC driver mode: No

Device supports Unified Addressing (UVA): Yes

Device PCI Bus ID / PCI location ID: 6 / 0

Compute Mode:

< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >



deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4.1, CUDA Runtime Version = 4.1, NumDevs = 1, Device = Tesla C2075

[deviceQuery] test results...

PASSED



> exiting in 3 seconds: 3...2...1...done!



So far, no problem. I can also see the card using nvidia-smi:

Tue Apr 3 16:25:39 2012

+------------------------------------------------------+

| NVIDIA-SMI 2.285.05 Driver Version: 285.05.33 |

|-------------------------------+----------------------+----------------------+

| Nb. Name | Bus Id Disp. | Volatile ECC SB / DB |

| Fan Temp Power Usage /Cap | Memory Usage | GPU Util. Compute M. |

|===============================+======================+======================|

| 0. Tesla C2075 | 0000:06:00.0 Off | 0 0 |

| 30% 52 C P0 80W / 225W | 0% 10MB / 5375MB | 99% Default |

|-------------------------------+----------------------+----------------------|

| Compute processes: GPU Memory |

| GPU PID Process name Usage |

|=============================================================================|

| No running compute processes found |

+-----------------------------------------------------------------------------+



now if I try to run an application, say vectorAdd: the status is like this:

Tue Apr 3 16:26:29 2012

+------------------------------------------------------+

| NVIDIA-SMI 2.285.05 Driver Version: 285.05.33 |

|-------------------------------+----------------------+----------------------+

| Nb. Name | Bus Id Disp. | Volatile ECC SB / DB |

| Fan Temp Power Usage /Cap | Memory Usage | GPU Util. Compute M. |

|===============================+======================+======================|

| 0. Tesla C2075 | 0000:06:00.0 Off | 0 0 |

| 30% 52 C P12 32W / 225W | 1% 59MB / 5375MB | 0% Default |

|-------------------------------+----------------------+----------------------|

| Compute processes: GPU Memory |

| GPU PID Process name Usage |

|=============================================================================|

| 0. 9361 ...A_GPU_Computing_SDK/C/bin/linux/release/vectorAdd 47MB |

+-----------------------------------------------------------------------------+



Nothing seems to happen on the card, GPU Util(isation, I assume) is stuck at 0% and the code just hangs, no error no nothing. The only thing I see is that one of the CPUs is at 100%.



Any ideas what might be wrong?



Thanks a lot, MW

#1
Posted 04/03/2012 02:29 PM   
Have you connected both power connectors?
Have you connected both power connectors?

#2
Posted 04/03/2012 03:18 PM   
[quote name='mfatica' date='03 April 2012 - 03:18 PM' timestamp='1333466298' post='1391430']
Have you connected both power connectors?
[/quote]

Thanks for the hint. Yes, I actually used one 8 pin connector instead of two six pin connectors as described as an alternative in the installation in the manual. I also checked that this connector is sitting really tight. But maybe two 6 pin connectors work better?

[Edit] Checked also to run the card with two 6 pin connectors - same result: deviceQuery is OK, everything else from the SDK hangs, card shows no utilization of compute ressources, TOP shows that the processes uses 100% cpu power 'instead'.

[Edit]
I checked a couple of function, even simplePrintf, all show the same behaviour. They spit out a couple of lines such as:

[simplePrintf] starting...

GPU Device 0: "Tesla C2075" with compute capability 2.0

Device 0: "Tesla C2075" with Compute 2.0 capability
printf() is called. Output:

... and then nothing happens any more, no error, no warning, 1 CPU for the process at 100% (all other cores on the machine idle).

What really puzzles me is the absence of error messages....

MW
[quote name='mfatica' date='03 April 2012 - 03:18 PM' timestamp='1333466298' post='1391430']

Have you connected both power connectors?





Thanks for the hint. Yes, I actually used one 8 pin connector instead of two six pin connectors as described as an alternative in the installation in the manual. I also checked that this connector is sitting really tight. But maybe two 6 pin connectors work better?



[Edit] Checked also to run the card with two 6 pin connectors - same result: deviceQuery is OK, everything else from the SDK hangs, card shows no utilization of compute ressources, TOP shows that the processes uses 100% cpu power 'instead'.



[Edit]

I checked a couple of function, even simplePrintf, all show the same behaviour. They spit out a couple of lines such as:



[simplePrintf] starting...



GPU Device 0: "Tesla C2075" with compute capability 2.0



Device 0: "Tesla C2075" with Compute 2.0 capability

printf() is called. Output:



... and then nothing happens any more, no error, no warning, 1 CPU for the process at 100% (all other cores on the machine idle).



What really puzzles me is the absence of error messages....



MW

#3
Posted 04/03/2012 05:17 PM   
Do you have write permissions in the directory where you are running the SDK samples? Most of those try and write a log file to disk, and they hang if they can't write to disk....
Do you have write permissions in the directory where you are running the SDK samples? Most of those try and write a log file to disk, and they hang if they can't write to disk....

#4
Posted 04/03/2012 06:54 PM   
[quote name='avidday' date='03 April 2012 - 06:54 PM' timestamp='1333479256' post='1391524']
Do you have write permissions in the directory where you are running the SDK samples? Most of those try and write a log file to disk, and they hang if they can't write to disk....
[/quote]

Intersting point. I installed and compiled the SDK in my home directory, so read/write permissions should be OK there. Do the system-wide CUDA libraries need any special permissions?

[Edit] Tried running everything as root, same problem. We're accessing the libraries via adding them in /etc/ld.so.conf.d/CUDA.conf, and subsequent ldconfig. The really strange thing is that deviceQuery runs (I guess it also compiled on the system, insn'it?), all other programs don't. ..

[Edit] Resetting the card with nvidia-smi seems initially succesful:
$> nvidia-smi -r --id=0
GPU 0000:06:00.0 was successfully reset.

But after this the card is gone:
$> ~/NVIDIA_GPU_Computing_SDK/C/bin/linux/release$ ./deviceQuery
[deviceQuery] starting...

./deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 10
-> invalid device ordinal
[deviceQuery] test results...
FAILED

> exiting in 3 seconds: 3...2...1...done!

dmesg returns:
[ 305.559141] NVRM: Xid (0000:06:00): 31, Ch 00000000, engmask 00000101, intr 10000000
[ 314.338656] NVRM: Xid (0000:06:00): 44, 0000 00000000 00000000 00000000 00000000 00000000
[ 316.990208] NVRM: Xid (0000:06:00): 31, Ch 00000000, engmask 00000101, intr 10000000
[ 317.155736] NVRM: Xid (0000:06:00): 31, Ch 00000001, engmask 00000101, intr 10000000
[ 317.166933] NVRM: Xid (0000:06:00): 31, Ch 00000002, engmask 00000101, intr 10000000
[ 317.181420] NVRM: Xid (0000:06:00): 31, Ch 00000003, engmask 00000101, intr 10000000
[ 343.705342] NVRM: Xid (0000:06:00): 31, Ch 00000000, engmask 00000101, intr 10000000
[ 352.309550] NVRM: Xid (0000:06:00): 44, 0000 00000000 00000000 00000000 00000000 00000000
[64551.752616] NVRM: RmInitAdapter failed! (0x27:0xffffffff:1193)
[64551.752640] NVRM: rm_init_adapter(0) failed
[64582.002530] NVRM: RmInitAdapter failed! (0x27:0xffffffff:1193)
[64582.002569] NVRM: rm_init_adapter(0) failed
[64584.468063] NVRM: RmInitAdapter failed! (0x27:0xffffffff:1193)
[64584.468104] NVRM: rm_init_adapter(0) failed


Is this normal behaviour??

MW
[quote name='avidday' date='03 April 2012 - 06:54 PM' timestamp='1333479256' post='1391524']

Do you have write permissions in the directory where you are running the SDK samples? Most of those try and write a log file to disk, and they hang if they can't write to disk....





Intersting point. I installed and compiled the SDK in my home directory, so read/write permissions should be OK there. Do the system-wide CUDA libraries need any special permissions?



[Edit] Tried running everything as root, same problem. We're accessing the libraries via adding them in /etc/ld.so.conf.d/CUDA.conf, and subsequent ldconfig. The really strange thing is that deviceQuery runs (I guess it also compiled on the system, insn'it?), all other programs don't. ..



[Edit] Resetting the card with nvidia-smi seems initially succesful:

$> nvidia-smi -r --id=0

GPU 0000:06:00.0 was successfully reset.



But after this the card is gone:

$> ~/NVIDIA_GPU_Computing_SDK/C/bin/linux/release$ ./deviceQuery

[deviceQuery] starting...



./deviceQuery Starting...



CUDA Device Query (Runtime API) version (CUDART static linking)



cudaGetDeviceCount returned 10

-> invalid device ordinal

[deviceQuery] test results...

FAILED



> exiting in 3 seconds: 3...2...1...done!



dmesg returns:

[ 305.559141] NVRM: Xid (0000:06:00): 31, Ch 00000000, engmask 00000101, intr 10000000

[ 314.338656] NVRM: Xid (0000:06:00): 44, 0000 00000000 00000000 00000000 00000000 00000000

[ 316.990208] NVRM: Xid (0000:06:00): 31, Ch 00000000, engmask 00000101, intr 10000000

[ 317.155736] NVRM: Xid (0000:06:00): 31, Ch 00000001, engmask 00000101, intr 10000000

[ 317.166933] NVRM: Xid (0000:06:00): 31, Ch 00000002, engmask 00000101, intr 10000000

[ 317.181420] NVRM: Xid (0000:06:00): 31, Ch 00000003, engmask 00000101, intr 10000000

[ 343.705342] NVRM: Xid (0000:06:00): 31, Ch 00000000, engmask 00000101, intr 10000000

[ 352.309550] NVRM: Xid (0000:06:00): 44, 0000 00000000 00000000 00000000 00000000 00000000

[64551.752616] NVRM: RmInitAdapter failed! (0x27:0xffffffff:1193)

[64551.752640] NVRM: rm_init_adapter(0) failed

[64582.002530] NVRM: RmInitAdapter failed! (0x27:0xffffffff:1193)

[64582.002569] NVRM: rm_init_adapter(0) failed

[64584.468063] NVRM: RmInitAdapter failed! (0x27:0xffffffff:1193)

[64584.468104] NVRM: rm_init_adapter(0) failed





Is this normal behaviour??



MW

#5
Posted 04/04/2012 12:14 PM   
A related question - is there any way of telling the card is not getting enough power?

MW
A related question - is there any way of telling the card is not getting enough power?



MW

#6
Posted 04/04/2012 04:44 PM   
Have you tried the card in a different machine? I looked at the big list of error codes, and those suggest that something is seriously wrong with your hardware somewhere.
Have you tried the card in a different machine? I looked at the big list of error codes, and those suggest that something is seriously wrong with your hardware somewhere.

#7
Posted 04/04/2012 05:03 PM   
[quote name='tmurray' date='04 April 2012 - 05:03 PM' timestamp='1333559014' post='1391967']
Have you tried the card in a different machine? I looked at the big list of error codes, and those suggest that something is seriously wrong with your hardware somewhere.
[/quote]

I do not get any error messages actually, if I do not try to reset the card. My biggest problem is that the programs start but hang, without any error messages. The machine itself is a server that has been running fine for three years so far no glitch whatsoever. Trying on a duifferent machine would mean taking it from the network and installing all the libraries and stuff again. Do you think the card is broken?

MW
[quote name='tmurray' date='04 April 2012 - 05:03 PM' timestamp='1333559014' post='1391967']

Have you tried the card in a different machine? I looked at the big list of error codes, and those suggest that something is seriously wrong with your hardware somewhere.





I do not get any error messages actually, if I do not try to reset the card. My biggest problem is that the programs start but hang, without any error messages. The machine itself is a server that has been running fine for three years so far no glitch whatsoever. Trying on a duifferent machine would mean taking it from the network and installing all the libraries and stuff again. Do you think the card is broken?



MW

#8
Posted 04/05/2012 02:17 PM   
No, I think your card is doing just fine, it has to do with the ubuntu software stack.

I am having the same problem. On Centos 6.2 it works fine, but with ubuntu 12.04 beta 2 I can only execute deviceQuery successfully, nbody -benchmark just hangs.

So if switching to centos 6.2 is a possible avenue for you, then you have your solution. I unfortunately have to make it work on ubuntu and wonder what would be the next steps to figure out what is going wrong. An strace shows that [b]nbody -benchmark [/b]is waiting for something where it says resource not available:

open("/proc/driver/nvidia/params", O_RDONLY) = 15
fstat(15, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f9d65427000
read(15, "EnableVia4x: 0\nEnableALiAGP: 0\nN"..., 1024) = 456
close(15) = 0
munmap(0x7f9d65427000, 4096) = 0
stat("/dev/nvidiactl", {st_mode=S_IFCHR|0666, st_rdev=makedev(195, 255), ...}) = 0
open("/dev/nvidiactl", O_RDWR) = 15
ioctl(15, 0xc01446ce, 0x7fff4da34400) = 0
ioctl(15, 0xc020462b, 0x7fff4da343f0) = 0
write(12, "\253", 1) = 1
futex(0x7fff4da34430, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {1334807419, 187367000}, ffffffff) = -1 EAGAIN (Resource temporarily unavailable)
pipe([16, 17]) = 0
fcntl(16, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
write(12, "\253", 1) = 1
futex(0x1e936f8, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x1e936f8, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x1e936f8, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x1e936f8, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x1e936f8, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x1e936f8, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x1e936f8, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x1e936f8, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x1e936f8, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x1e936f8, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x1e936f8, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x1e936f8, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x1e936f8, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x1e936f8, FUTEX_WAKE_PRIVATE, 1) = 0
ioctl(4, 0xc020462a, 0x7fff4da34500) = 0
futex(0x1e936f8, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x1e936f8, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x1e936f8, FUTEX_WAKE_PRIVATE, 1) = 0


Looks like the previous nbody run that I aborted is also still registered on the card:
ssh n6 nvidia-smi
Wed Apr 18 21:04:52 2012
+------------------------------------------------------+
| NVIDIA-SMI 2.285.05 Driver Version: 285.05.33 |
|-------------------------------+----------------------+----------------------+
| Nb. Name | Bus Id Disp. | Volatile ECC SB / DB |
| Fan Temp Power Usage /Cap | Memory Usage | GPU Util. Compute M. |
|===============================+======================+======================|
| 0. Tesla M2075 | 0000:05:00.0 Off | 0 0 |
| N/A N/A P12 28W / 225W | 1% 59MB / 5375MB | 0% Default |
|-------------------------------+----------------------+----------------------|
| 1. Tesla M2075 | 0000:03:00.0 Off | 0 0 |
| N/A N/A P12 31W / 225W | 1% 59MB / 5375MB | 0% Default |
|-------------------------------+----------------------+----------------------|
| Compute processes: GPU Memory |
| GPU PID Process name Usage |
|=============================================================================|
| 0. 11978 /home/hybrid/nbody 47MB |
| 1. 2373 ./nbody 47MB |
+-----------------------------------------------------------------------------+


Michael
No, I think your card is doing just fine, it has to do with the ubuntu software stack.



I am having the same problem. On Centos 6.2 it works fine, but with ubuntu 12.04 beta 2 I can only execute deviceQuery successfully, nbody -benchmark just hangs.



So if switching to centos 6.2 is a possible avenue for you, then you have your solution. I unfortunately have to make it work on ubuntu and wonder what would be the next steps to figure out what is going wrong. An strace shows that nbody -benchmark is waiting for something where it says resource not available:



open("/proc/driver/nvidia/params", O_RDONLY) = 15

fstat(15, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0

mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f9d65427000

read(15, "EnableVia4x: 0\nEnableALiAGP: 0\nN"..., 1024) = 456

close(15) = 0

munmap(0x7f9d65427000, 4096) = 0

stat("/dev/nvidiactl", {st_mode=S_IFCHR|0666, st_rdev=makedev(195, 255), ...}) = 0

open("/dev/nvidiactl", O_RDWR) = 15

ioctl(15, 0xc01446ce, 0x7fff4da34400) = 0

ioctl(15, 0xc020462b, 0x7fff4da343f0) = 0

write(12, "\253", 1) = 1

futex(0x7fff4da34430, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {1334807419, 187367000}, ffffffff) = -1 EAGAIN (Resource temporarily unavailable)

pipe([16, 17]) = 0

fcntl(16, F_SETFL, O_RDONLY|O_NONBLOCK) = 0

write(12, "\253", 1) = 1

futex(0x1e936f8, FUTEX_WAKE_PRIVATE, 1) = 0

futex(0x1e936f8, FUTEX_WAKE_PRIVATE, 1) = 0

futex(0x1e936f8, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)

futex(0x1e936f8, FUTEX_WAKE_PRIVATE, 1) = 0

futex(0x1e936f8, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)

futex(0x1e936f8, FUTEX_WAKE_PRIVATE, 1) = 0

futex(0x1e936f8, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)

futex(0x1e936f8, FUTEX_WAKE_PRIVATE, 1) = 0

futex(0x1e936f8, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)

futex(0x1e936f8, FUTEX_WAKE_PRIVATE, 1) = 0

futex(0x1e936f8, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)

futex(0x1e936f8, FUTEX_WAKE_PRIVATE, 1) = 0

futex(0x1e936f8, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)

futex(0x1e936f8, FUTEX_WAKE_PRIVATE, 1) = 0

ioctl(4, 0xc020462a, 0x7fff4da34500) = 0

futex(0x1e936f8, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)

futex(0x1e936f8, FUTEX_WAKE_PRIVATE, 1) = 0

futex(0x1e936f8, FUTEX_WAKE_PRIVATE, 1) = 0





Looks like the previous nbody run that I aborted is also still registered on the card:

ssh n6 nvidia-smi

Wed Apr 18 21:04:52 2012

+------------------------------------------------------+

| NVIDIA-SMI 2.285.05 Driver Version: 285.05.33 |

|-------------------------------+----------------------+----------------------+

| Nb. Name | Bus Id Disp. | Volatile ECC SB / DB |

| Fan Temp Power Usage /Cap | Memory Usage | GPU Util. Compute M. |

|===============================+======================+======================|

| 0. Tesla M2075 | 0000:05:00.0 Off | 0 0 |

| N/A N/A P12 28W / 225W | 1% 59MB / 5375MB | 0% Default |

|-------------------------------+----------------------+----------------------|

| 1. Tesla M2075 | 0000:03:00.0 Off | 0 0 |

| N/A N/A P12 31W / 225W | 1% 59MB / 5375MB | 0% Default |

|-------------------------------+----------------------+----------------------|

| Compute processes: GPU Memory |

| GPU PID Process name Usage |

|=============================================================================|

| 0. 11978 /home/hybrid/nbody 47MB |

| 1. 2373 ./nbody 47MB |

+-----------------------------------------------------------------------------+





Michael

#9
Posted 04/18/2012 07:31 PM   
[quote name='mrmichaelwill' date='18 April 2012 - 09:31 PM' timestamp='1334777473' post='1397983']
No, I think your card is doing just fine, it has to do with the ubuntu software stack.

futex(0x1e936f8, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x1e936f8, FUTEX_WAKE_PRIVATE, 1) = 0
[/quote]
I'm having the same problem, Ubuntu 11.10 with Linux 3.0.0-17-generic, CUDA 4.1, nvidia-driver 285.05.33, four GeForce GTX 295.

Sometimes a combination of enabling/disabling persistence mode, compute-exclusive mode and reloading the nvidia module help. dmesg has this to say:
[code][286314.871216] NVRM: Xid (0000:15:00): 13, 0001 00000000 000050c0 00000368 00000000 00000100[/code]
[quote name='mrmichaelwill' date='18 April 2012 - 09:31 PM' timestamp='1334777473' post='1397983']

No, I think your card is doing just fine, it has to do with the ubuntu software stack.



futex(0x1e936f8, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)

futex(0x1e936f8, FUTEX_WAKE_PRIVATE, 1) = 0



I'm having the same problem, Ubuntu 11.10 with Linux 3.0.0-17-generic, CUDA 4.1, nvidia-driver 285.05.33, four GeForce GTX 295.



Sometimes a combination of enabling/disabling persistence mode, compute-exclusive mode and reloading the nvidia module help. dmesg has this to say:

[286314.871216] NVRM: Xid (0000:15:00): 13, 0001 00000000 000050c0 00000368 00000000 00000100

#10
Posted 04/19/2012 04:52 PM   
[quote name='heipei' date='19 April 2012 - 04:52 PM' timestamp='1334854365' post='1398377']
I'm having the same problem, Ubuntu 11.10 with Linux 3.0.0-17-generic, CUDA 4.1, nvidia-driver 285.05.33, four GeForce GTX 295.

Sometimes a combination of enabling/disabling persistence mode, compute-exclusive mode and reloading the nvidia module help. dmesg has this to say:
[code][286314.871216] NVRM: Xid (0000:15:00): 13, 0001 00000000 000050c0 00000368 00000000 00000100[/code]
[/quote]

I tried again with the spanking new cuda 4.2, nvidia devdriver 295.41, and it fails still on ubuntu server 12.04 beta 2 with kernel 3.2.0-23-generic, but it works fine on ubuntu server 11.10 with kernel 3.0.0-17-generic.

The only change I needed to make for all examples to compile was to NVIDIA_GPU_Computing_SDK/C/common/common.mk moving the OPENGLLIB linking behind the RENDERCHECKGLLIB linking:

*** common.mk 2012-04-20 21:25:05.497193895 -0700
--- /home/hybrid/nvidia/common.mk 2012-04-20 13:05:19.672992402 -0700
***************
*** 268,285 ****

# If dynamically linking to CUDA and CUDART, we exclude the libraries from the LIB
ifeq ($(USECUDADYNLIB),1)
! LIB += ${OPENGLLIB} $(PARAMGLLIB) $(RENDERCHECKGLLIB) ${LIB} -ldl -rdynamic
else
# static linking, we will statically link against CUDA and CUDART
ifeq ($(USEDRVAPI),1)
! LIB += -lcuda ${OPENGLLIB} $(PARAMGLLIB) $(RENDERCHECKGLLIB) ${LIB}
else
ifeq ($(emu),1)
LIB += -lcudartemu
else
LIB += -lcudart
endif
! LIB += ${OPENGLLIB} $(PARAMGLLIB) $(RENDERCHECKGLLIB) ${LIB}
endif
endif

--- 268,285 ----

# If dynamically linking to CUDA and CUDART, we exclude the libraries from the LIB
ifeq ($(USECUDADYNLIB),1)
! LIB += $(PARAMGLLIB) $(RENDERCHECKGLLIB) ${OPENGLLIB} ${LIB} -ldl -rdynamic
else
# static linking, we will statically link against CUDA and CUDART
ifeq ($(USEDRVAPI),1)
! LIB += -lcuda $(PARAMGLLIB) $(RENDERCHECKGLLIB) ${OPENGLLIB} ${LIB}
else
ifeq ($(emu),1)
LIB += -lcudartemu
else
LIB += -lcudart
endif
! LIB += $(PARAMGLLIB) $(RENDERCHECKGLLIB) ${OPENGLLIB} ${LIB}
endif
endif

So for now I seem to have to stick to ubuntu 11.10 which works with cuda 4.2 as I could not get it to work on 12.04 beta.

Michael
[quote name='heipei' date='19 April 2012 - 04:52 PM' timestamp='1334854365' post='1398377']

I'm having the same problem, Ubuntu 11.10 with Linux 3.0.0-17-generic, CUDA 4.1, nvidia-driver 285.05.33, four GeForce GTX 295.



Sometimes a combination of enabling/disabling persistence mode, compute-exclusive mode and reloading the nvidia module help. dmesg has this to say:

[286314.871216] NVRM: Xid (0000:15:00): 13, 0001 00000000 000050c0 00000368 00000000 00000100






I tried again with the spanking new cuda 4.2, nvidia devdriver 295.41, and it fails still on ubuntu server 12.04 beta 2 with kernel 3.2.0-23-generic, but it works fine on ubuntu server 11.10 with kernel 3.0.0-17-generic.



The only change I needed to make for all examples to compile was to NVIDIA_GPU_Computing_SDK/C/common/common.mk moving the OPENGLLIB linking behind the RENDERCHECKGLLIB linking:



*** common.mk 2012-04-20 21:25:05.497193895 -0700

--- /home/hybrid/nvidia/common.mk 2012-04-20 13:05:19.672992402 -0700

***************

*** 268,285 ****



# If dynamically linking to CUDA and CUDART, we exclude the libraries from the LIB

ifeq ($(USECUDADYNLIB),1)

! LIB += ${OPENGLLIB} $(PARAMGLLIB) $(RENDERCHECKGLLIB) ${LIB} -ldl -rdynamic

else

# static linking, we will statically link against CUDA and CUDART

ifeq ($(USEDRVAPI),1)

! LIB += -lcuda ${OPENGLLIB} $(PARAMGLLIB) $(RENDERCHECKGLLIB) ${LIB}

else

ifeq ($(emu),1)

LIB += -lcudartemu

else

LIB += -lcudart

endif

! LIB += ${OPENGLLIB} $(PARAMGLLIB) $(RENDERCHECKGLLIB) ${LIB}

endif

endif



--- 268,285 ----



# If dynamically linking to CUDA and CUDART, we exclude the libraries from the LIB

ifeq ($(USECUDADYNLIB),1)

! LIB += $(PARAMGLLIB) $(RENDERCHECKGLLIB) ${OPENGLLIB} ${LIB} -ldl -rdynamic

else

# static linking, we will statically link against CUDA and CUDART

ifeq ($(USEDRVAPI),1)

! LIB += -lcuda $(PARAMGLLIB) $(RENDERCHECKGLLIB) ${OPENGLLIB} ${LIB}

else

ifeq ($(emu),1)

LIB += -lcudartemu

else

LIB += -lcudart

endif

! LIB += $(PARAMGLLIB) $(RENDERCHECKGLLIB) ${OPENGLLIB} ${LIB}

endif

endif



So for now I seem to have to stick to ubuntu 11.10 which works with cuda 4.2 as I could not get it to work on 12.04 beta.



Michael

#11
Posted 04/20/2012 08:47 PM   
I had the same thing. At least update to the very newest version (295.41), then update and install/build+make the toolkit to 4.2 and the sdk to 4.2.
Then go in de SDK/C/.. run the ./deviceQuery. And then it prints a few lines and hangs. This is your solution: WAIT 5 MINUTES. I'm serious. After that, it works like a charm, otherwise try runnings things as root at least.
I had the same thing. At least update to the very newest version (295.41), then update and install/build+make the toolkit to 4.2 and the sdk to 4.2.

Then go in de SDK/C/.. run the ./deviceQuery. And then it prints a few lines and hangs. This is your solution: WAIT 5 MINUTES. I'm serious. After that, it works like a charm, otherwise try runnings things as root at least.

#12
Posted 04/21/2012 08:14 PM   
The SOLUTION: Disable IOMMU feature in the bios. After that it works on 12.04 the same as it works under 11.10, no more hangs.

Michael

[quote name='mrmichaelwill' date='20 April 2012 - 08:47 PM' timestamp='1334954822' post='1398859']
I tried again with the spanking new cuda 4.2, nvidia devdriver 295.41, and it fails still on ubuntu server 12.04 beta 2 with kernel 3.2.0-23-generic, but it works fine on ubuntu server 11.10 with kernel 3.0.0-17-generic.

The only change I needed to make for all examples to compile was to NVIDIA_GPU_Computing_SDK/C/common/common.mk moving the OPENGLLIB linking behind the RENDERCHECKGLLIB linking:

*** common.mk 2012-04-20 21:25:05.497193895 -0700
--- /home/hybrid/nvidia/common.mk 2012-04-20 13:05:19.672992402 -0700
***************
*** 268,285 ****

# If dynamically linking to CUDA and CUDART, we exclude the libraries from the LIB
ifeq ($(USECUDADYNLIB),1)
! LIB += ${OPENGLLIB} $(PARAMGLLIB) $(RENDERCHECKGLLIB) ${LIB} -ldl -rdynamic
else
# static linking, we will statically link against CUDA and CUDART
ifeq ($(USEDRVAPI),1)
! LIB += -lcuda ${OPENGLLIB} $(PARAMGLLIB) $(RENDERCHECKGLLIB) ${LIB}
else
ifeq ($(emu),1)
LIB += -lcudartemu
else
LIB += -lcudart
endif
! LIB += ${OPENGLLIB} $(PARAMGLLIB) $(RENDERCHECKGLLIB) ${LIB}
endif
endif

--- 268,285 ----

# If dynamically linking to CUDA and CUDART, we exclude the libraries from the LIB
ifeq ($(USECUDADYNLIB),1)
! LIB += $(PARAMGLLIB) $(RENDERCHECKGLLIB) ${OPENGLLIB} ${LIB} -ldl -rdynamic
else
# static linking, we will statically link against CUDA and CUDART
ifeq ($(USEDRVAPI),1)
! LIB += -lcuda $(PARAMGLLIB) $(RENDERCHECKGLLIB) ${OPENGLLIB} ${LIB}
else
ifeq ($(emu),1)
LIB += -lcudartemu
else
LIB += -lcudart
endif
! LIB += $(PARAMGLLIB) $(RENDERCHECKGLLIB) ${OPENGLLIB} ${LIB}
endif
endif

So for now I seem to have to stick to ubuntu 11.10 which works with cuda 4.2 as I could not get it to work on 12.04 beta.

Michael
[/quote]
The SOLUTION: Disable IOMMU feature in the bios. After that it works on 12.04 the same as it works under 11.10, no more hangs.



Michael



[quote name='mrmichaelwill' date='20 April 2012 - 08:47 PM' timestamp='1334954822' post='1398859']

I tried again with the spanking new cuda 4.2, nvidia devdriver 295.41, and it fails still on ubuntu server 12.04 beta 2 with kernel 3.2.0-23-generic, but it works fine on ubuntu server 11.10 with kernel 3.0.0-17-generic.



The only change I needed to make for all examples to compile was to NVIDIA_GPU_Computing_SDK/C/common/common.mk moving the OPENGLLIB linking behind the RENDERCHECKGLLIB linking:



*** common.mk 2012-04-20 21:25:05.497193895 -0700

--- /home/hybrid/nvidia/common.mk 2012-04-20 13:05:19.672992402 -0700

***************

*** 268,285 ****



# If dynamically linking to CUDA and CUDART, we exclude the libraries from the LIB

ifeq ($(USECUDADYNLIB),1)

! LIB += ${OPENGLLIB} $(PARAMGLLIB) $(RENDERCHECKGLLIB) ${LIB} -ldl -rdynamic

else

# static linking, we will statically link against CUDA and CUDART

ifeq ($(USEDRVAPI),1)

! LIB += -lcuda ${OPENGLLIB} $(PARAMGLLIB) $(RENDERCHECKGLLIB) ${LIB}

else

ifeq ($(emu),1)

LIB += -lcudartemu

else

LIB += -lcudart

endif

! LIB += ${OPENGLLIB} $(PARAMGLLIB) $(RENDERCHECKGLLIB) ${LIB}

endif

endif



--- 268,285 ----



# If dynamically linking to CUDA and CUDART, we exclude the libraries from the LIB

ifeq ($(USECUDADYNLIB),1)

! LIB += $(PARAMGLLIB) $(RENDERCHECKGLLIB) ${OPENGLLIB} ${LIB} -ldl -rdynamic

else

# static linking, we will statically link against CUDA and CUDART

ifeq ($(USEDRVAPI),1)

! LIB += -lcuda $(PARAMGLLIB) $(RENDERCHECKGLLIB) ${OPENGLLIB} ${LIB}

else

ifeq ($(emu),1)

LIB += -lcudartemu

else

LIB += -lcudart

endif

! LIB += $(PARAMGLLIB) $(RENDERCHECKGLLIB) ${OPENGLLIB} ${LIB}

endif

endif



So for now I seem to have to stick to ubuntu 11.10 which works with cuda 4.2 as I could not get it to work on 12.04 beta.



Michael

#13
Posted 04/23/2012 08:40 PM   
Scroll To Top