OpenCL kernel execution error for 3d with clEnqueueNDRangeKernel

Dear NVidia Developers,

I am running to strange limitations on the number of arguments that are passed to kernel.

I’ve designed the simple 3d/2d test to demonstrate the problem (see attached oclSimple3d.zip).

My system is Windows 7 64, NVidia GeForce GTX 260 (driver 285.62) and CUDA 4.0.

According to the “NVIDIA OpenCL BestPracticesGuide” “Shared memory holds the parameters

or arguments that are passed to kernels at launch". The shared memory limit is about 16KB per thread.

Also the limit on the number of arguments is pretty large. See CL_DEVICE_MAX_PARAMETER_SIZE.

The minimum supported value must be at least 256 bytes. On my testing system that is CL_DEVICE_MAX_PARAMETER_SIZE=4352

Please take in mind that my testing 2d/3d kernels are very simple (see “simple3d_test1” kernel).

This “simple3d_test1” kernel executes with no problem on 3D image size 256x256x256.

Although adding more parameters to the kernel see “simple3d_test2”, completely break the execution.

The GPU result DOESN’T MATCH CPU result within allowable tolerance for “simple3d_test2”.

The resulting 3D array has 98% pixels being zeros. I’ve tried many variations of parameters for this kernel

(see kernels simple3d_test2…simple3d_test5). The result is always the same 98% pixels being zeros.

It seems that driver 285.62 does not execute it correctly which make is impossible to use with 3D arrays.

I’ve also tested it for simple 2D array where similar

kernels works fine see “simple2d_test2” for example (also set test3D = false in the oclSimple3D.cpp).

Building it for x86 or x64, does not effect the test results.

Could you please explain or confirm it?

Thanks,

-Denis

// working kernel in 3D

// test1

__kernel void simple3d_test1(const __global float* in,

                             uint4   in_size,

                             __global float* out,

                             uint4   out_size)

{

  uint3 index;

  index.x = get_global_id(0);

  index.y = get_global_id(1);

  index.z = get_global_id(2);

bool isValid = true;

  if(index.x >= out_size.x) isValid = false;

  if(index.y >= out_size.y) isValid = false;

  if(index.z >= out_size.z) isValid = false;

if(isValid)

  {

    uint gidx = out_size.x *(index.z * out_size.y + index.y) + index.x;

    out[gidx] = 1;

  }

}
// NOT working kernel in 3D, what is wrong?

// test2

__kernel void simple3d_test2(const __global float* in,

                             uint4   in_size,

                             float4  f11,

                             float4  f12,

                             float16 a1,

                             float16 b1,

                             float16 c1,

                             __global float* out,

                             uint4   out_size,

                             float4  f21,

                             float4  f22,

                             float16 a2,

                             float16 b2,

                             float16 c2)

{

  uint3 index;

  index.x = get_global_id(0);

  index.y = get_global_id(1);

  index.z = get_global_id(2);

bool isValid = true;

  if(index.x >= out_size.x) isValid = false;

  if(index.y >= out_size.y) isValid = false;

  if(index.z >= out_size.z) isValid = false;

if(isValid)

  {

    uint gidx = out_size.x *(index.z * out_size.y + index.y) + index.x;

    out[gidx] = 1;

  }

}
// working kernel in 2D

// test2 2D

__kernel void simple2d_test2(const __global float* in,

                             uint2   in_size,

                             float2  f11,

                             float2  f12,

                             float16 a1,

                             float16 b1,

                             float16 c1,

                             float16 d1,

                             __global float* out,

                             uint2   out_size,

                             float2  f21,

                             float2  f22,

                             float16 a2,

                             float16 b2,

                             float16 c2,

                             float16 d2)

{

  uint2 index;

  index.x = get_global_id(0);

  index.y = get_global_id(1);

bool isValid = true;

  if(index.x >= out_size.x) isValid = false;

  if(index.y >= out_size.y) isValid = false;

if(isValid)

  {

    unsigned int gidx = out_size.x * index.y + index.x;

    out[gidx] = 1;

  }

}

oclSimple3d.zip (9.91 KB)

Small update to this problem.
This code has been checked with Intel® OpenCL SDK 1.5 and AMD OpenCL SDK
Intel: Intel® SDK for OpenCL™ Applications
AMD: http://developer.amd.com/documentation/articles/pages/opencl-and-the-amd-app-sdk.aspx

The result is that this problem does NOT exist on Intel and AMD implementations of OpenCL 1.1.
I hope it would be fixed in next NVidia 290.x drivers.

Thanks,
-Denis

Seems strange looking at the code but two things I can think about:

  1. The maximum parameter size under compute 1.x is 256 bytes, under Fermi it has grown to 4KB

  2. The grid under compute 1.x is 2D, not 3D, if you pass a z dimension that is not 1 you will get kernel launch failure. That has changed as well under Fermi