CUDA accessing ALL devices, even those which are blacklisted
Hi all,

I'm using linux cgroups (through SLURM, http://www.schedmd.com/) to control CUDA and OpenCL jobs on machines with multiple nvidia gpu. Cgroups are nices since they isolate processes and prevent them from accessing unallocated resources.

I can effectively prevent code to run on unallocated gpu, but the it seems that just initializing CUDA requires visiting every nvidia gpu in the system. Because cgroups is preventing this, every run will fail. For example, running /opt/cuda/sdk/C/bin/linux/release/deviceQuery through this setup will fail:
standard output:
[code]/opt/cuda/sdk/C/bin/linux/release/deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 10
-> invalid device ordinal[/code]
error output:
[code][deviceQuery] starting...
[deviceQuery] test results...
FAILED
> exiting in 3 seconds: 3...2...1...done!
srun: error: shockwave: task 0: Exited with exit code 1[/code]

Running deviceQuery outside of the queueing system works just fine:
[code]$ /opt/cuda/sdk/C/bin/linux/release/deviceQuery
[deviceQuery] starting...

/opt/cuda/sdk/C/bin/linux/release/deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Found 3 CUDA Capable device(s)

Device 0: "GeForce GTX 580"
CUDA Driver Version / Runtime Version 4.1 / 4.1
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 3072 MBytes (3220897792 bytes)
(16) Multiprocessors x (32) CUDA Cores/MP: 512 CUDA Cores
GPU Clock Speed: 1.54 GHz
Memory Clock rate: 2004.00 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 786432 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and execution: Yes with 1 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: "GeForce 210"
CUDA Driver Version / Runtime Version 4.1 / 4.1
CUDA Capability Major/Minor version number: 1.2
Total amount of global memory: 512 MBytes (536674304 bytes)
( 2) Multiprocessors x ( 8) CUDA Cores/MP: 16 CUDA Cores
GPU Clock Speed: 1.23 GHz
Memory Clock rate: 600.00 Mhz
Memory Bus Width: 64-bit
Max Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(8192) x 512, 2D=(8192,8192) x 512
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Concurrent copy and execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: No
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 3 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 2: "GeForce GTX 580"
CUDA Driver Version / Runtime Version 4.1 / 4.1
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 3072 MBytes (3220897792 bytes)
(16) Multiprocessors x (32) CUDA Cores/MP: 512 CUDA Cores
GPU Clock Speed: 1.54 GHz
Memory Clock rate: 2004.00 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 786432 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and execution: Yes with 1 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 65 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4.1, CUDA Runtime Version = 4.1, NumDevs = 3, Device = GeForce GTX 580, Device = GeForce 210
[deviceQuery] test results...
PASSED

> exiting in 3 seconds: 3...2...1...done![/code]

Is there a way to prevent CUDA from trying to access every possible devices on the system? Or at least not fail completely when it cannot access one device (that it shouldn't access anyway)?
Hi all,



I'm using linux cgroups (through SLURM, http://www.schedmd.com/) to control CUDA and OpenCL jobs on machines with multiple nvidia gpu. Cgroups are nices since they isolate processes and prevent them from accessing unallocated resources.



I can effectively prevent code to run on unallocated gpu, but the it seems that just initializing CUDA requires visiting every nvidia gpu in the system. Because cgroups is preventing this, every run will fail. For example, running /opt/cuda/sdk/C/bin/linux/release/deviceQuery through this setup will fail:

standard output:

/opt/cuda/sdk/C/bin/linux/release/deviceQuery Starting...



CUDA Device Query (Runtime API) version (CUDART static linking)



cudaGetDeviceCount returned 10

-> invalid device ordinal


error output:

[deviceQuery] starting...

[deviceQuery] test results...

FAILED

> exiting in 3 seconds: 3...2...1...done!

srun: error: shockwave: task 0: Exited with exit code 1




Running deviceQuery outside of the queueing system works just fine:

$ /opt/cuda/sdk/C/bin/linux/release/deviceQuery

[deviceQuery] starting...



/opt/cuda/sdk/C/bin/linux/release/deviceQuery Starting...



CUDA Device Query (Runtime API) version (CUDART static linking)



Found 3 CUDA Capable device(s)



Device 0: "GeForce GTX 580"

CUDA Driver Version / Runtime Version 4.1 / 4.1

CUDA Capability Major/Minor version number: 2.0

Total amount of global memory: 3072 MBytes (3220897792 bytes)

(16) Multiprocessors x (32) CUDA Cores/MP: 512 CUDA Cores

GPU Clock Speed: 1.54 GHz

Memory Clock rate: 2004.00 Mhz

Memory Bus Width: 384-bit

L2 Cache Size: 786432 bytes

Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)

Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 49152 bytes

Total number of registers available per block: 32768

Warp size: 32

Maximum number of threads per block: 1024

Maximum sizes of each dimension of a block: 1024 x 1024 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535

Maximum memory pitch: 2147483647 bytes

Texture alignment: 512 bytes

Concurrent copy and execution: Yes with 1 copy engine(s)

Run time limit on kernels: No

Integrated GPU sharing Host Memory: No

Support host page-locked memory mapping: Yes

Concurrent kernel execution: Yes

Alignment requirement for Surfaces: Yes

Device has ECC support enabled: No

Device is using TCC driver mode: No

Device supports Unified Addressing (UVA): Yes

Device PCI Bus ID / PCI location ID: 1 / 0

Compute Mode:

< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >



Device 1: "GeForce 210"

CUDA Driver Version / Runtime Version 4.1 / 4.1

CUDA Capability Major/Minor version number: 1.2

Total amount of global memory: 512 MBytes (536674304 bytes)

( 2) Multiprocessors x ( 8) CUDA Cores/MP: 16 CUDA Cores

GPU Clock Speed: 1.23 GHz

Memory Clock rate: 600.00 Mhz

Memory Bus Width: 64-bit

Max Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)

Max Layered Texture Size (dim) x layers 1D=(8192) x 512, 2D=(8192,8192) x 512

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 16384 bytes

Total number of registers available per block: 16384

Warp size: 32

Maximum number of threads per block: 512

Maximum sizes of each dimension of a block: 512 x 512 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

Maximum memory pitch: 2147483647 bytes

Texture alignment: 256 bytes

Concurrent copy and execution: Yes with 1 copy engine(s)

Run time limit on kernels: Yes

Integrated GPU sharing Host Memory: No

Support host page-locked memory mapping: Yes

Concurrent kernel execution: No

Alignment requirement for Surfaces: Yes

Device has ECC support enabled: No

Device is using TCC driver mode: No

Device supports Unified Addressing (UVA): No

Device PCI Bus ID / PCI location ID: 3 / 0

Compute Mode:

< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >



Device 2: "GeForce GTX 580"

CUDA Driver Version / Runtime Version 4.1 / 4.1

CUDA Capability Major/Minor version number: 2.0

Total amount of global memory: 3072 MBytes (3220897792 bytes)

(16) Multiprocessors x (32) CUDA Cores/MP: 512 CUDA Cores

GPU Clock Speed: 1.54 GHz

Memory Clock rate: 2004.00 Mhz

Memory Bus Width: 384-bit

L2 Cache Size: 786432 bytes

Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)

Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 49152 bytes

Total number of registers available per block: 32768

Warp size: 32

Maximum number of threads per block: 1024

Maximum sizes of each dimension of a block: 1024 x 1024 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535

Maximum memory pitch: 2147483647 bytes

Texture alignment: 512 bytes

Concurrent copy and execution: Yes with 1 copy engine(s)

Run time limit on kernels: No

Integrated GPU sharing Host Memory: No

Support host page-locked memory mapping: Yes

Concurrent kernel execution: Yes

Alignment requirement for Surfaces: Yes

Device has ECC support enabled: No

Device is using TCC driver mode: No

Device supports Unified Addressing (UVA): Yes

Device PCI Bus ID / PCI location ID: 65 / 0

Compute Mode:

< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >



deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4.1, CUDA Runtime Version = 4.1, NumDevs = 3, Device = GeForce GTX 580, Device = GeForce 210

[deviceQuery] test results...

PASSED



> exiting in 3 seconds: 3...2...1...done!




Is there a way to prevent CUDA from trying to access every possible devices on the system? Or at least not fail completely when it cannot access one device (that it shouldn't access anyway)?

#1
Posted 02/01/2012 12:33 AM   
You can use the CUDA_VISIBLE_DEVICES environment variable to control visibility. If you ran :

[code]CUDA_VISIBLE_DEVICES=0,2 /opt/cuda/sdk/C/bin/linux/release/deviceQuery[/code]

you will see output only for the GTX580's. The driver will not try to initialize on the GT210.

(See http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_C_Best_Practices_Guide.pdf Section 12.5 for a complete explanation of this variable and its effects)
You can use the CUDA_VISIBLE_DEVICES environment variable to control visibility. If you ran :



CUDA_VISIBLE_DEVICES=0,2 /opt/cuda/sdk/C/bin/linux/release/deviceQuery




you will see output only for the GTX580's. The driver will not try to initialize on the GT210.



(See http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_C_Best_Practices_Guide.pdf Section 12.5 for a complete explanation of this variable and its effects)

#2
Posted 04/25/2012 04:05 AM   
Hi, This is an old issue, but I believe it's still there, even in CUDA 6.5. The CUDA_VISIBLE_DEVICES doesn't seem to have any effect on the cudaGetDeviceCount() call. I have the exact same setup as the original poster, and I can demonstrate this. On a server with 8 GPUs, I run a Slurm job in which I request only 1 CPU, so my job runs in a cgroup where access to all /dev/nvidiaX devices is forbidden except for one. nvidia-smi -L works fine: [code] $ nvidia-smi -L GPU 0: Tesla K20Xm (UUID: GPU-88032b92-4cc2-0c14-0182-c3ccf6daba67) Unable to determine the device handle for gpu 0000:05:00.0: Unknown Error Unable to determine the device handle for gpu 0000:08:00.0: Unknown Error Unable to determine the device handle for gpu 0000:09:00.0: Unknown Error Unable to determine the device handle for gpu 0000:85:00.0: Unknown Error Unable to determine the device handle for gpu 0000:86:00.0: Unknown Error Unable to determine the device handle for gpu 0000:89:00.0: Unknown Error Unable to determine the device handle for gpu 0000:8A:00.0: Unknown Error [/code] Only access to GPU 0 is allowed. But then, deviceQuery fails because it tries to access all the GPUs, even when CUDA_VISIBLE_DEVICES is set: [code] $ CUDA_VISIBLE_DEVICES=0 ./deviceQuery ./deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) cudaGetDeviceCount returned 10 -> invalid device ordinal Result = FAIL [/code] It's obvious when running it through strace: [code] $ CUDA_VISIBLE_DEVICES=0 strace ./deviceQuery [...] stat("/dev/nvidia0", {st_mode=S_IFCHR|0666, st_rdev=makedev(195, 0), ...}) = 0 open("/dev/nvidia0", O_RDWR) = 4 fcntl(4, F_SETFD, FD_CLOEXEC) = 0 ioctl(3, 0xc020462a, 0x7fffb0964550) = 0 open("/proc/driver/nvidia/params", O_RDONLY) = 5 fstat(5, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7feaba2fa000 read(5, "Mobile: 4294967295\nResmanDebugLe"..., 1024) = 413 close(5) = 0 munmap(0x7feaba2fa000, 4096) = 0 stat("/dev/nvidia1", {st_mode=S_IFCHR|0666, st_rdev=makedev(195, 1), ...}) = 0 open("/dev/nvidia1", O_RDWR) = -1 EPERM (Operation not permitted) ioctl(3, 0xc0104629, 0x7fffb0964720) = 0 close(3) = 0 close(4) = 0 munmap(0x3231000000, 16149104) = 0 write(1, "cudaGetDeviceCount returned 10\n", 31cudaGetDeviceCount returned 10 ) = 31 write(1, "-> invalid device ordinal\n", 26-> invalid device ordinal ) = 26 write(1, "Result = FAIL\n", 14Result = FAIL ) = 14 exit_group(1) [/code] Could you please advise on this, and confirm that setting CUDA_VISIBLE_DEVICES should have an effect on the cudaGetDeviceCount() function? Because it apparently has none.
Hi,

This is an old issue, but I believe it's still there, even in CUDA 6.5. The CUDA_VISIBLE_DEVICES doesn't seem to have any effect on the cudaGetDeviceCount() call.

I have the exact same setup as the original poster, and I can demonstrate this. On a server with 8 GPUs, I run a Slurm job in which I request only 1 CPU, so my job runs in a cgroup where access to all /dev/nvidiaX devices is forbidden except for one. nvidia-smi -L works fine:

$ nvidia-smi -L
GPU 0: Tesla K20Xm (UUID: GPU-88032b92-4cc2-0c14-0182-c3ccf6daba67)
Unable to determine the device handle for gpu 0000:05:00.0: Unknown Error
Unable to determine the device handle for gpu 0000:08:00.0: Unknown Error
Unable to determine the device handle for gpu 0000:09:00.0: Unknown Error
Unable to determine the device handle for gpu 0000:85:00.0: Unknown Error
Unable to determine the device handle for gpu 0000:86:00.0: Unknown Error
Unable to determine the device handle for gpu 0000:89:00.0: Unknown Error
Unable to determine the device handle for gpu 0000:8A:00.0: Unknown Error

Only access to GPU 0 is allowed.

But then, deviceQuery fails because it tries to access all the GPUs, even when CUDA_VISIBLE_DEVICES is set:

$ CUDA_VISIBLE_DEVICES=0 ./deviceQuery
./deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 10
-> invalid device ordinal
Result = FAIL


It's obvious when running it through strace:

$ CUDA_VISIBLE_DEVICES=0 strace ./deviceQuery
[...]
stat("/dev/nvidia0", {st_mode=S_IFCHR|0666, st_rdev=makedev(195, 0), ...}) = 0
open("/dev/nvidia0", O_RDWR) = 4
fcntl(4, F_SETFD, FD_CLOEXEC) = 0
ioctl(3, 0xc020462a, 0x7fffb0964550) = 0
open("/proc/driver/nvidia/params", O_RDONLY) = 5
fstat(5, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7feaba2fa000
read(5, "Mobile: 4294967295\nResmanDebugLe"..., 1024) = 413
close(5) = 0
munmap(0x7feaba2fa000, 4096) = 0
stat("/dev/nvidia1", {st_mode=S_IFCHR|0666, st_rdev=makedev(195, 1), ...}) = 0
open("/dev/nvidia1", O_RDWR) = -1 EPERM (Operation not permitted)
ioctl(3, 0xc0104629, 0x7fffb0964720) = 0
close(3) = 0
close(4) = 0
munmap(0x3231000000, 16149104) = 0
write(1, "cudaGetDeviceCount returned 10\n", 31cudaGetDeviceCount returned 10
) = 31
write(1, "-> invalid device ordinal\n", 26-> invalid device ordinal
) = 26
write(1, "Result = FAIL\n", 14Result = FAIL
) = 14
exit_group(1)


Could you please advise on this, and confirm that setting CUDA_VISIBLE_DEVICES should have an effect on the cudaGetDeviceCount() function? Because it apparently has none.

#3
Posted 10/16/2014 01:06 AM   
The enumeration order created by nvidia-smi may not match the enumeration order created by deviceQuery. Therefore I think you are assuming that since nvidia-smi does not fail on device "0" that deviceQuery should be fine with device "0". But they may not be the same physical device. Try running your deviceQuery command with all individual devices selected, one-by-one, and I think you will find one of them that works. CUDA_VISIBLE_DEVICES="0" ./deviceQuery CUDA_VISIBLE_DEVICES="1" ./deviceQuery CUDA_VISIBLE_DEVICES="2" ./deviceQuery etc.
The enumeration order created by nvidia-smi may not match the enumeration order created by deviceQuery.

Therefore I think you are assuming that since nvidia-smi does not fail on device "0" that deviceQuery should be fine with device "0". But they may not be the same physical device. Try running your deviceQuery command with all individual devices selected, one-by-one, and I think you will find one of them that works.

CUDA_VISIBLE_DEVICES="0" ./deviceQuery
CUDA_VISIBLE_DEVICES="1" ./deviceQuery
CUDA_VISIBLE_DEVICES="2" ./deviceQuery
etc.

#4
Posted 10/16/2014 01:59 AM   
Hi, Thanks for your feedback. The GPU ids are actually consistent across all tools, otherwise the CUDA_VISIBLE_DEVICES thing wouldn't make any sense. And the strace bit I posted earlier showed that deviceQuery was trying to access at least 2 devices (nvidia0 and nvidia1) whereas CUDA_VISIBLE_DEVICES contained only 1 id. Anyway, I tried your suggestion, and it doesn't work: [code] $ for i in {0..7}; do echo -n "CUDA_VISIBLE_DEVICES=$i: "; CUDA_VISIBLE_DEVICES=$i ./deviceQuery | grep Result ; done CUDA_VISIBLE_DEVICES=0: Result = FAIL CUDA_VISIBLE_DEVICES=1: Result = FAIL CUDA_VISIBLE_DEVICES=2: Result = FAIL CUDA_VISIBLE_DEVICES=3: Result = FAIL CUDA_VISIBLE_DEVICES=4: Result = FAIL CUDA_VISIBLE_DEVICES=5: Result = FAIL CUDA_VISIBLE_DEVICES=6: Result = FAIL CUDA_VISIBLE_DEVICES=7: Result = FAIL [/code]
Hi,

Thanks for your feedback.

The GPU ids are actually consistent across all tools, otherwise the CUDA_VISIBLE_DEVICES thing wouldn't make any sense. And the strace bit I posted earlier showed that deviceQuery was trying to access at least 2 devices (nvidia0 and nvidia1) whereas CUDA_VISIBLE_DEVICES contained only 1 id.

Anyway, I tried your suggestion, and it doesn't work:

$ for i in {0..7}; do echo -n "CUDA_VISIBLE_DEVICES=$i: "; CUDA_VISIBLE_DEVICES=$i ./deviceQuery | grep Result ; done
CUDA_VISIBLE_DEVICES=0: Result = FAIL
CUDA_VISIBLE_DEVICES=1: Result = FAIL
CUDA_VISIBLE_DEVICES=2: Result = FAIL
CUDA_VISIBLE_DEVICES=3: Result = FAIL
CUDA_VISIBLE_DEVICES=4: Result = FAIL
CUDA_VISIBLE_DEVICES=5: Result = FAIL
CUDA_VISIBLE_DEVICES=6: Result = FAIL
CUDA_VISIBLE_DEVICES=7: Result = FAIL

#5
Posted 10/16/2014 04:31 PM   
what permissions are being set on the device control files? Do you have control over this?
what permissions are being set on the device control files? Do you have control over this?

#6
Posted 10/16/2014 07:10 PM   
The permissions are set through the cgroup devices subsystem, as described here: [url]https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-devices.html[/url] The thing is I [b]do[/b] want to restrict access to certain GPUs. I just want cudaGetDeviceCount() to report the number of GPUs available in CUDA_VISIBLE_DEVICES rather than failing when access is not allowed to a /dev/nvidiaX device.
The permissions are set through the cgroup devices subsystem, as described here: https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-devices.html

The thing is I do want to restrict access to certain GPUs. I just want cudaGetDeviceCount() to report the number of GPUs available in CUDA_VISIBLE_DEVICES rather than failing when access is not allowed to a /dev/nvidiaX device.

#7
Posted 10/16/2014 09:51 PM   
can you list the numerical permissions on the files that are modified in the allocated state, as root if necessary, using ls -l? i.e. create an allocation as if you were assigning a GPU to a user/job, same as the condition where you see the error. Then, in that state, as root, list the permissions of all 8 nvidia device files.
can you list the numerical permissions on the files that are modified in the allocated state, as root if necessary, using ls -l?

i.e. create an allocation as if you were assigning a GPU to a user/job, same as the condition where you see the error. Then, in that state, as root, list the permissions of all 8 nvidia device files.

#8
Posted 10/16/2014 10:17 PM   
Sure, but the access restrictions being made by the cgroup devices subsystem (and thus by the kernel itself), the permissions on the device files are not modified: [code] $ ls -al /dev/nvidia[0-9] crw-rw-rw- 1 root root 195, 0 Oct 2 07:55 /dev/nvidia0 crw-rw-rw- 1 root root 195, 1 Oct 2 07:55 /dev/nvidia1 crw-rw-rw- 1 root root 195, 2 Oct 2 07:55 /dev/nvidia2 crw-rw-rw- 1 root root 195, 3 Oct 2 07:55 /dev/nvidia3 crw-rw-rw- 1 root root 195, 4 Oct 2 07:55 /dev/nvidia4 crw-rw-rw- 1 root root 195, 5 Oct 2 07:55 /dev/nvidia5 crw-rw-rw- 1 root root 195, 6 Oct 2 07:55 /dev/nvidia6 crw-rw-rw- 1 root root 195, 7 Oct 2 07:55 /dev/nvidia7 [/code] Yet: [code] $ cat /dev/nvidia0 cat: /dev/nvidia0: Invalid argument [/code] which is ok, cat doesn't really make sense on /dev/nvidia0, but this is just to show that read() is actually allowed. And: [code] $ cat /dev/nvidia1 cat: /dev/nvidia1: Operation not permitted [/code] Access is denied.
Sure, but the access restrictions being made by the cgroup devices subsystem (and thus by the kernel itself), the permissions on the device files are not modified:

$ ls -al /dev/nvidia[0-9]
crw-rw-rw- 1 root root 195, 0 Oct 2 07:55 /dev/nvidia0
crw-rw-rw- 1 root root 195, 1 Oct 2 07:55 /dev/nvidia1
crw-rw-rw- 1 root root 195, 2 Oct 2 07:55 /dev/nvidia2
crw-rw-rw- 1 root root 195, 3 Oct 2 07:55 /dev/nvidia3
crw-rw-rw- 1 root root 195, 4 Oct 2 07:55 /dev/nvidia4
crw-rw-rw- 1 root root 195, 5 Oct 2 07:55 /dev/nvidia5
crw-rw-rw- 1 root root 195, 6 Oct 2 07:55 /dev/nvidia6
crw-rw-rw- 1 root root 195, 7 Oct 2 07:55 /dev/nvidia7


Yet:
$ cat /dev/nvidia0
cat: /dev/nvidia0: Invalid argument

which is ok, cat doesn't really make sense on /dev/nvidia0, but this is just to show that read() is actually allowed.

And:
$ cat /dev/nvidia1
cat: /dev/nvidia1: Operation not permitted

Access is denied.

#9
Posted 10/17/2014 01:08 AM   
For the record, I got confirmation from NVIDIA engineering that this problem will be fixed in CUDA 7.
For the record, I got confirmation from NVIDIA engineering that this problem will be fixed in CUDA 7.

#10
Posted 10/17/2014 09:43 PM   
Scroll To Top