Questions about global and local work size

Hi,

digging into OpenCl reading tutorials some things stayed unclear for me. Here is a collection of my questions regarding local and global work sizes.

1. Must the [font=“Courier New”]global_work_size[/font] be smaller than [font=“Courier New”]CL_DEVICE_MAX_WORK_ITEM_SIZES[/font]?
On my machine [font=“Courier New”]CL_DEVICE_MAX_WORK_ITEM_SIZES = 512, 512, 64[/font].

2. Is [font=“Courier New”]CL_KERNEL_WORK_GROUP_SIZE[/font] the recommended [font=“Courier New”]work_group_size[/font] for the used kernel?
2b. Or is this the only [font=“Courier New”]work_group_size[/font] the GPU allows?

On my machine [font=“Courier New”]CL_KERNEL_WORK_GROUP_SIZE = 512[/font]

3. Do I need to divide into work groups or can I have only one, but not specifying local_work_size?
3b. To what do I have to pay attention, when I only have one work group?

4. What does [font=“Courier New”]CL_DEVICE_MAX_WORK_GROUP_SIZE[/font] mean?
On my machine [font=“Courier New”]CL_DEVICE_MAX_WORK_GROUP_SIZE = 512, 512, 64[/font]
4b. Does this mean, I can have one work group which is as large as the [font=“Courier New”]CL_DEVICE_MAX_WORK_ITEM_SIZES[/font]?

Added by edit:
5. Has [font=“Courier New”]global_work_size[/font] to be a divisor of [font=“Courier New”]CL_DEVICE_MAX_WORK_ITEM_SIZES[/font]?
In my code [font=“Courier New”]global_work_size = 20[/font].

Thanks for your help!

Hi,

digging into OpenCl reading tutorials some things stayed unclear for me. Here is a collection of my questions regarding local and global work sizes.

1. Must the [font=“Courier New”]global_work_size[/font] be smaller than [font=“Courier New”]CL_DEVICE_MAX_WORK_ITEM_SIZES[/font]?
On my machine [font=“Courier New”]CL_DEVICE_MAX_WORK_ITEM_SIZES = 512, 512, 64[/font].

2. Is [font=“Courier New”]CL_KERNEL_WORK_GROUP_SIZE[/font] the recommended [font=“Courier New”]work_group_size[/font] for the used kernel?
2b. Or is this the only [font=“Courier New”]work_group_size[/font] the GPU allows?

On my machine [font=“Courier New”]CL_KERNEL_WORK_GROUP_SIZE = 512[/font]

3. Do I need to divide into work groups or can I have only one, but not specifying local_work_size?
3b. To what do I have to pay attention, when I only have one work group?

4. What does [font=“Courier New”]CL_DEVICE_MAX_WORK_GROUP_SIZE[/font] mean?
On my machine [font=“Courier New”]CL_DEVICE_MAX_WORK_GROUP_SIZE = 512, 512, 64[/font]
4b. Does this mean, I can have one work group which is as large as the [font=“Courier New”]CL_DEVICE_MAX_WORK_ITEM_SIZES[/font]?

Added by edit:
5. Has [font=“Courier New”]global_work_size[/font] to be a divisor of [font=“Courier New”]CL_DEVICE_MAX_WORK_ITEM_SIZES[/font]?
In my code [font=“Courier New”]global_work_size = 20[/font].

Thanks for your help!

I have the same questions exactly… could anyone help please!!

I have the same questions exactly… could anyone help please!!

Similar story here. What really confuses me is that I launch the kernels with a local work group size of CL_DEVICE_MAX_WORK_GROUP_SIZE, but when I use the visual profiler, the “local work group size” is always 1 and the “work group sizeX” is the expected 512.

Is this just misleading information or is my local work group size genuinely much smaller than it should be?

Thanks!

Jan

Similar story here. What really confuses me is that I launch the kernels with a local work group size of CL_DEVICE_MAX_WORK_GROUP_SIZE, but when I use the visual profiler, the “local work group size” is always 1 and the “work group sizeX” is the expected 512.

Is this just misleading information or is my local work group size genuinely much smaller than it should be?

Thanks!

Jan

Are these questions too simple? Did I violate some etiquette? External Image

Are these questions too simple? Did I violate some etiquette? External Image

Hey, I’ll try and answer your questions as best I can - I’m no expert, so someone should verify what I write, so just beware. Before I begin, though, it sounds like you are getting confused on the differences between work items, work sizes, work groups, etc. It’s a lot of terminology, so I understand. I’d recommend grabbing the NVIDIA or AMD OpenCL Programming Guides and read the sections on these. Also, check out MacResearch.org - they have some great OpenCL tutorials that really make it understandable for novices like me.

1. Your global_work_size can be larger than your work item sizes. Your global_work_size is essentially the size of your problem. Your work items are basically pieces of that problem. You always want your work item size to be a multiple of your global work size. Your work item size needs to be less than or equal to the max size, as indicated by the result you gave. Your global_work_size needs to be less than or equal to the result of the CL_DEVICE_MAX_WORK_GROUP_SIZE.

2. Your CL_KERNEL_WORK_GROUP_SIZE is the maximum size allowed by the kernel itself, and that could be different than your device work group size. Always double-check this. From my understanding, it has something to do with register usage, etc, and could be smaller than the device work group size. It should never be bigger, though!

3. You don’t need to manually divide into work groups. You give the OpenCL device a global work size. You can also specify the local work size, or leave it NULL. If you specify it, then the OpenCL runtime will “divide up” the global work size into as many pieces as necessary to complete the entire problem. If you leave it NULL, the OpenCL runtime will try and choose “optimal” settings for the local work size - note, though, that it’s choice for “Optimal” might not be the true optimal choice!

Unfortunately, I’m headed to class now, so I can’t finish responding. Try thinking about the above and see if it helps.

Hey, I’ll try and answer your questions as best I can - I’m no expert, so someone should verify what I write, so just beware. Before I begin, though, it sounds like you are getting confused on the differences between work items, work sizes, work groups, etc. It’s a lot of terminology, so I understand. I’d recommend grabbing the NVIDIA or AMD OpenCL Programming Guides and read the sections on these. Also, check out MacResearch.org - they have some great OpenCL tutorials that really make it understandable for novices like me.

1. Your global_work_size can be larger than your work item sizes. Your global_work_size is essentially the size of your problem. Your work items are basically pieces of that problem. You always want your work item size to be a multiple of your global work size. Your work item size needs to be less than or equal to the max size, as indicated by the result you gave. Your global_work_size needs to be less than or equal to the result of the CL_DEVICE_MAX_WORK_GROUP_SIZE.

2. Your CL_KERNEL_WORK_GROUP_SIZE is the maximum size allowed by the kernel itself, and that could be different than your device work group size. Always double-check this. From my understanding, it has something to do with register usage, etc, and could be smaller than the device work group size. It should never be bigger, though!

3. You don’t need to manually divide into work groups. You give the OpenCL device a global work size. You can also specify the local work size, or leave it NULL. If you specify it, then the OpenCL runtime will “divide up” the global work size into as many pieces as necessary to complete the entire problem. If you leave it NULL, the OpenCL runtime will try and choose “optimal” settings for the local work size - note, though, that it’s choice for “Optimal” might not be the true optimal choice!

Unfortunately, I’m headed to class now, so I can’t finish responding. Try thinking about the above and see if it helps.

Hi Kevin,

thanks for helping out again!

After reading your answers one new question rose:

Does this mean, that in my case [font=“Courier New”]CL_DEVICE_MAX_WORK_GROUP_SIZE[/font] = 512, [font=“Courier New”]globalid[/font] has to be <= 512 as well?

What would I do with such a for loop?

[codebox]int my_array[20000];

for (int i=0;i<20000;i++){

my_array[i]=i;

}[/codebox]

In any case, thanks for your answer, it is really appreciated. I read through the AMD and the NVIDIA guides, but both did leave me uncertain in these questions. I will look into the macresearch.

Hi Kevin,

thanks for helping out again!

After reading your answers one new question rose:

Does this mean, that in my case [font=“Courier New”]CL_DEVICE_MAX_WORK_GROUP_SIZE[/font] = 512, [font=“Courier New”]globalid[/font] has to be <= 512 as well?

What would I do with such a for loop?

[codebox]int my_array[20000];

for (int i=0;i<20000;i++){

my_array[i]=i;

}[/codebox]

In any case, thanks for your answer, it is really appreciated. I read through the AMD and the NVIDIA guides, but both did leave me uncertain in these questions. I will look into the macresearch.

CL_DEVICE_MAX_WORK_GROUP_SIZE is the maximum value of work-group as says the hardware limit name. It means how big you can have the bits of your code, not actually the whole size of your parallel problem (I’m using successfully thousands of threads with hardware limit CL_DEVICE_MAX_WORK_GROUP_SIZE=512, localid < 512).

So it means, when you call clEnqueueNDRangeKernel, the local_work_size can’t be larger than 512. It says nothing about global_work_size.

However, if you ever come across the limit of global_wrok_size you can always divide input array in parts and all threads will work with several input values, not just one. This last mentioned view of algorithm is for most problems the best one.

CL_DEVICE_MAX_WORK_GROUP_SIZE is the maximum value of work-group as says the hardware limit name. It means how big you can have the bits of your code, not actually the whole size of your parallel problem (I’m using successfully thousands of threads with hardware limit CL_DEVICE_MAX_WORK_GROUP_SIZE=512, localid < 512).

So it means, when you call clEnqueueNDRangeKernel, the local_work_size can’t be larger than 512. It says nothing about global_work_size.

However, if you ever come across the limit of global_wrok_size you can always divide input array in parts and all threads will work with several input values, not just one. This last mentioned view of algorithm is for most problems the best one.

hello,

i’m new to opencl. just came across a simple dummy example of matrix multiplication.

i have 2 matrices (e.g. A[MxN] and B[NxP] that give C[MxP] as a result).

IMPORTANT: M, N, P are less or equal to 16 (so, no partitioning is needed. global and local work size is the same, that is the example uses a 2D ND Range (grid) with 1 workgroup).

the problem is that when M<P, C has incorrect values.

my guess is that it has to do with work sizes passed to clEnqueueNDRangeKernel().

host code:

[codebox]

local_worksize[0]=wb;

local_worksize[1]=ha;

global_worksize[0]=wb;

global_worksize[1]=ha;

errcode=clEnqueueNDRangeKernel(queue, kernel,

	2, NULL, global_worksize, local_worksize, 0, NULL, NULL

);

[/codebox]

where ha is M and wb is P.

device code:

[codebox]

__kernel void

matrixMul(__global float *C, __global float *A, __global float *B,

int wA, int wB)

{

// 2D thread ID

int tx=get_global_id(0);

int ty=get_global_id(1);

// value stores the element that is computed by the thread

float value=0;

for(int k=0; k<wA; k++) {

	float elementA=A[ty*wA+k];

	float elementB=B[k*wB+tx];

	value+=elementA*elementB;

}

// write the matrix to device memory (each thread writes one element)

C[ty*wA+tx]=value;

}

[/codebox]

i have checked the values of the arguments passed to the kernel and they are correct.

also, as i said, when M>P||M==P, results are correct.

i tested it on windows and linux.

on windows i ran 2 tests. one for the nvidia gpu and one for the amd phenom cpu.

on linux i only ran the gpu test.

also, when i ran some sdk samples, there is no problem (i see “PASSED” on the console).

drivers on linux:

[codebox]

$cat /proc/driver/nvidia/version

NVRM version: NVIDIA UNIX x86_64 Kernel Module 256.53 Fri Aug 27 20:27:48 PDT 2010

GCC version: gcc version 4.3.5 (Debian 4.3.5-4)

[/codebox]

drivers on windows: latest.

also, for the cpu platform i have isntalled the latest ati stream sdk.

all tests show the same results. this isn’t a drivers’ problem (or else amd and nvidia have the same problem)!

is there something i should know about work size dimensions?

thanx in advance

hello,

i’m new to opencl. just came across a simple dummy example of matrix multiplication.

i have 2 matrices (e.g. A[MxN] and B[NxP] that give C[MxP] as a result).

IMPORTANT: M, N, P are less or equal to 16 (so, no partitioning is needed. global and local work size is the same, that is the example uses a 2D ND Range (grid) with 1 workgroup).

the problem is that when M<P, C has incorrect values.

my guess is that it has to do with work sizes passed to clEnqueueNDRangeKernel().

host code:

[codebox]

local_worksize[0]=wb;

local_worksize[1]=ha;

global_worksize[0]=wb;

global_worksize[1]=ha;

errcode=clEnqueueNDRangeKernel(queue, kernel,

	2, NULL, global_worksize, local_worksize, 0, NULL, NULL

);

[/codebox]

where ha is M and wb is P.

device code:

[codebox]

__kernel void

matrixMul(__global float *C, __global float *A, __global float *B,

int wA, int wB)

{

// 2D thread ID

int tx=get_global_id(0);

int ty=get_global_id(1);

// value stores the element that is computed by the thread

float value=0;

for(int k=0; k<wA; k++) {

	float elementA=A[ty*wA+k];

	float elementB=B[k*wB+tx];

	value+=elementA*elementB;

}

// write the matrix to device memory (each thread writes one element)

C[ty*wA+tx]=value;

}

[/codebox]

i have checked the values of the arguments passed to the kernel and they are correct.

also, as i said, when M>P||M==P, results are correct.

i tested it on windows and linux.

on windows i ran 2 tests. one for the nvidia gpu and one for the amd phenom cpu.

on linux i only ran the gpu test.

also, when i ran some sdk samples, there is no problem (i see “PASSED” on the console).

drivers on linux:

[codebox]

$cat /proc/driver/nvidia/version

NVRM version: NVIDIA UNIX x86_64 Kernel Module 256.53 Fri Aug 27 20:27:48 PDT 2010

GCC version: gcc version 4.3.5 (Debian 4.3.5-4)

[/codebox]

drivers on windows: latest.

also, for the cpu platform i have isntalled the latest ati stream sdk.

all tests show the same results. this isn’t a drivers’ problem (or else amd and nvidia have the same problem)!

is there something i should know about work size dimensions?

thanx in advance

Try to step through your code running OpenCL on CPU (e.g. debug on linux with gdb) and look where it is going wrong. I suspect, it is fault of your code not OpenCL.

Is the result saving right? The line seems to me the reason: C[tywA+tx]=value; I guess it should be C[tywB+tx]=value; but I can be wrong…

Try to step through your code running OpenCL on CPU (e.g. debug on linux with gdb) and look where it is going wrong. I suspect, it is fault of your code not OpenCL.

Is the result saving right? The line seems to me the reason: C[tywA+tx]=value; I guess it should be C[tywB+tx]=value; but I can be wrong…

you were right sir. my bad; stupid mistake (C has wB width). thank you for answering.

also, i have another question regarding local work sizes.

in the kernel i used, every thread writes a resulting matrix element.

for large matrices, i have to partition the problem into sub problems. but if the resulting matrix has an “irregular” size (e.g. has height or width that is a prme number), how should i choose the local work size? i assume that the local work size should be a uniform value (that is, if i use a 16x16 local work size, then for the “edges” of the problem, i cannot change that value, so that all workgroups “fit in” the NDRange; as i said, i cannot have “excess” threads, because every thread writes an element).

thanx in advance

you were right sir. my bad; stupid mistake (C has wB width). thank you for answering.

also, i have another question regarding local work sizes.

in the kernel i used, every thread writes a resulting matrix element.

for large matrices, i have to partition the problem into sub problems. but if the resulting matrix has an “irregular” size (e.g. has height or width that is a prme number), how should i choose the local work size? i assume that the local work size should be a uniform value (that is, if i use a 16x16 local work size, then for the “edges” of the problem, i cannot change that value, so that all workgroups “fit in” the NDRange; as i said, i cannot have “excess” threads, because every thread writes an element).

thanx in advance