OpenCL kernel execution error for 3d with clEnqueueNDRangeKernel
Dear NVidia Developers,

I am running to strange limitations on the number of arguments that are passed to kernel.
I’ve designed the simple 3d/2d test to demonstrate the problem (see attached oclSimple3d.zip).
My system is Windows 7 64, NVidia GeForce GTX 260 (driver 285.62) and CUDA 4.0.

According to the “NVIDIA OpenCL BestPracticesGuide” “Shared memory holds the parameters
or arguments that are passed to kernels at launch". The shared memory limit is about 16KB per thread.
Also the limit on the number of arguments is pretty large. See CL_DEVICE_MAX_PARAMETER_SIZE.
The minimum supported value must be at least 256 bytes. On my testing system that is CL_DEVICE_MAX_PARAMETER_SIZE=4352
Please take in mind that my testing 2d/3d kernels are very simple (see “simple3d_test1” kernel).

This “simple3d_test1” kernel executes with no problem on 3D image size 256x256x256.
Although adding more parameters to the kernel see “simple3d_test2”, completely break the execution.
The GPU result DOESN'T MATCH CPU result within allowable tolerance for “simple3d_test2”.
The resulting 3D array has 98% pixels being zeros. I’ve tried many variations of parameters for this kernel
(see kernels simple3d_test2...simple3d_test5). The result is always the same 98% pixels being zeros.
It seems that driver 285.62 does not execute it correctly which make is impossible to use with 3D arrays.
I’ve also tested it for simple 2D array where similar
kernels works fine see “simple2d_test2” for example (also set test3D = false in the oclSimple3D.cpp).
Building it for x86 or x64, does not effect the test results.

Could you please explain or confirm it?

Thanks,
-Denis

[code]
// working kernel in 3D
// test1
__kernel void simple3d_test1(const __global float* in,
uint4 in_size,
__global float* out,
uint4 out_size)
{
uint3 index;
index.x = get_global_id(0);
index.y = get_global_id(1);
index.z = get_global_id(2);

bool isValid = true;
if(index.x >= out_size.x) isValid = false;
if(index.y >= out_size.y) isValid = false;
if(index.z >= out_size.z) isValid = false;

if(isValid)
{
uint gidx = out_size.x *(index.z * out_size.y + index.y) + index.x;
out[gidx] = 1;
}
}
[/code]

[code]
// NOT working kernel in 3D, what is wrong?
// test2
__kernel void simple3d_test2(const __global float* in,
uint4 in_size,
float4 f11,
float4 f12,
float16 a1,
float16 b1,
float16 c1,
__global float* out,
uint4 out_size,
float4 f21,
float4 f22,
float16 a2,
float16 b2,
float16 c2)
{
uint3 index;
index.x = get_global_id(0);
index.y = get_global_id(1);
index.z = get_global_id(2);

bool isValid = true;
if(index.x >= out_size.x) isValid = false;
if(index.y >= out_size.y) isValid = false;
if(index.z >= out_size.z) isValid = false;

if(isValid)
{
uint gidx = out_size.x *(index.z * out_size.y + index.y) + index.x;
out[gidx] = 1;
}
}
[/code]

[code]
// working kernel in 2D
// test2 2D
__kernel void simple2d_test2(const __global float* in,
uint2 in_size,
float2 f11,
float2 f12,
float16 a1,
float16 b1,
float16 c1,
float16 d1,
__global float* out,
uint2 out_size,
float2 f21,
float2 f22,
float16 a2,
float16 b2,
float16 c2,
float16 d2)
{
uint2 index;
index.x = get_global_id(0);
index.y = get_global_id(1);

bool isValid = true;
if(index.x >= out_size.x) isValid = false;
if(index.y >= out_size.y) isValid = false;

if(isValid)
{
unsigned int gidx = out_size.x * index.y + index.x;
out[gidx] = 1;
}
}
[/code]
Dear NVidia Developers,



I am running to strange limitations on the number of arguments that are passed to kernel.

I’ve designed the simple 3d/2d test to demonstrate the problem (see attached oclSimple3d.zip).

My system is Windows 7 64, NVidia GeForce GTX 260 (driver 285.62) and CUDA 4.0.



According to the “NVIDIA OpenCL BestPracticesGuide” “Shared memory holds the parameters

or arguments that are passed to kernels at launch". The shared memory limit is about 16KB per thread.

Also the limit on the number of arguments is pretty large. See CL_DEVICE_MAX_PARAMETER_SIZE.

The minimum supported value must be at least 256 bytes. On my testing system that is CL_DEVICE_MAX_PARAMETER_SIZE=4352

Please take in mind that my testing 2d/3d kernels are very simple (see “simple3d_test1” kernel).



This “simple3d_test1” kernel executes with no problem on 3D image size 256x256x256.

Although adding more parameters to the kernel see “simple3d_test2”, completely break the execution.

The GPU result DOESN'T MATCH CPU result within allowable tolerance for “simple3d_test2”.

The resulting 3D array has 98% pixels being zeros. I’ve tried many variations of parameters for this kernel

(see kernels simple3d_test2...simple3d_test5). The result is always the same 98% pixels being zeros.

It seems that driver 285.62 does not execute it correctly which make is impossible to use with 3D arrays.

I’ve also tested it for simple 2D array where similar

kernels works fine see “simple2d_test2” for example (also set test3D = false in the oclSimple3D.cpp).

Building it for x86 or x64, does not effect the test results.



Could you please explain or confirm it?



Thanks,

-Denis





// working kernel in 3D

// test1

__kernel void simple3d_test1(const __global float* in,

uint4 in_size,

__global float* out,

uint4 out_size)

{

uint3 index;

index.x = get_global_id(0);

index.y = get_global_id(1);

index.z = get_global_id(2);



bool isValid = true;

if(index.x >= out_size.x) isValid = false;

if(index.y >= out_size.y) isValid = false;

if(index.z >= out_size.z) isValid = false;



if(isValid)

{

uint gidx = out_size.x *(index.z * out_size.y + index.y) + index.x;

out[gidx] = 1;

}

}






// NOT working kernel in 3D, what is wrong?

// test2

__kernel void simple3d_test2(const __global float* in,

uint4 in_size,

float4 f11,

float4 f12,

float16 a1,

float16 b1,

float16 c1,

__global float* out,

uint4 out_size,

float4 f21,

float4 f22,

float16 a2,

float16 b2,

float16 c2)

{

uint3 index;

index.x = get_global_id(0);

index.y = get_global_id(1);

index.z = get_global_id(2);



bool isValid = true;

if(index.x >= out_size.x) isValid = false;

if(index.y >= out_size.y) isValid = false;

if(index.z >= out_size.z) isValid = false;



if(isValid)

{

uint gidx = out_size.x *(index.z * out_size.y + index.y) + index.x;

out[gidx] = 1;

}

}






// working kernel in 2D

// test2 2D

__kernel void simple2d_test2(const __global float* in,

uint2 in_size,

float2 f11,

float2 f12,

float16 a1,

float16 b1,

float16 c1,

float16 d1,

__global float* out,

uint2 out_size,

float2 f21,

float2 f22,

float16 a2,

float16 b2,

float16 c2,

float16 d2)

{

uint2 index;

index.x = get_global_id(0);

index.y = get_global_id(1);



bool isValid = true;

if(index.x >= out_size.x) isValid = false;

if(index.y >= out_size.y) isValid = false;



if(isValid)

{

unsigned int gidx = out_size.x * index.y + index.x;

out[gidx] = 1;

}

}
Attachments

oclSimple3d.zip

#1
Posted 11/14/2011 02:06 PM   
Small update to this problem.
This code has been checked with Intel® OpenCL SDK 1.5 and AMD OpenCL SDK
Intel: http://software.intel.com/en-us/articles/vcsource-tools-opencl-sdk/
AMD: http://developer.amd.com/documentation/articles/pages/opencl-and-the-amd-app-sdk.aspx

The result is that this problem does NOT exist on Intel and AMD implementations of OpenCL 1.1.
I hope it would be fixed in next NVidia 290.x drivers.

Thanks,
-Denis
Small update to this problem.

This code has been checked with Intel® OpenCL SDK 1.5 and AMD OpenCL SDK

Intel: http://software.intel.com/en-us/articles/vcsource-tools-opencl-sdk/

AMD: http://developer.amd.com/documentation/articles/pages/opencl-and-the-amd-app-sdk.aspx



The result is that this problem does NOT exist on Intel and AMD implementations of OpenCL 1.1.

I hope it would be fixed in next NVidia 290.x drivers.



Thanks,

-Denis

#2
Posted 12/07/2011 01:08 PM   
[quote name='dpshamonin' date='07 December 2011 - 03:08 PM' timestamp='1323263317' post='1338828']
Small update to this problem.
This code has been checked with Intel® OpenCL SDK 1.5 and AMD OpenCL SDK
Intel: http://software.intel.com/en-us/articles/vcsource-tools-opencl-sdk/
AMD: http://developer.amd.com/documentation/articles/pages/opencl-and-the-amd-app-sdk.aspx

The result is that this problem does NOT exist on Intel and AMD implementations of OpenCL 1.1.
I hope it would be fixed in next NVidia 290.x drivers.

Thanks,
-Denis
[/quote]

Seems strange looking at the code but two things I can think about:

1. The maximum parameter size under compute 1.x is 256 bytes, under Fermi it has grown to 4KB
2. The grid under compute 1.x is 2D, not 3D, if you pass a z dimension that is not 1 you will get kernel launch failure. That has changed as well under Fermi
[quote name='dpshamonin' date='07 December 2011 - 03:08 PM' timestamp='1323263317' post='1338828']

Small update to this problem.

This code has been checked with Intel® OpenCL SDK 1.5 and AMD OpenCL SDK

Intel: http://software.intel.com/en-us/articles/vcsource-tools-opencl-sdk/

AMD: http://developer.amd.com/documentation/articles/pages/opencl-and-the-amd-app-sdk.aspx



The result is that this problem does NOT exist on Intel and AMD implementations of OpenCL 1.1.

I hope it would be fixed in next NVidia 290.x drivers.



Thanks,

-Denis





Seems strange looking at the code but two things I can think about:



1. The maximum parameter size under compute 1.x is 256 bytes, under Fermi it has grown to 4KB

2. The grid under compute 1.x is 2D, not 3D, if you pass a z dimension that is not 1 you will get kernel launch failure. That has changed as well under Fermi

#3
Posted 12/11/2011 01:53 PM   
Scroll To Top