Questions about global and local work size
  1 / 2    
Hi,

digging into OpenCl reading tutorials some things stayed unclear for me. Here is a collection of my questions regarding local and global work sizes.

[b]1. Must the [font="Courier New"]global_work_size[/font] be smaller than [font="Courier New"]CL_DEVICE_MAX_WORK_ITEM_SIZES[/font]? [/b]
On my machine [font="Courier New"]CL_DEVICE_MAX_WORK_ITEM_SIZES = 512, 512, 64[/font].

[b]2. Is [font="Courier New"]CL_KERNEL_WORK_GROUP_SIZE[/font] the recommended [font="Courier New"]work_group_size[/font] for the used kernel?
2b. Or is this the only [font="Courier New"]work_group_size[/font] the GPU allows?[/b]
On my machine [font="Courier New"]CL_KERNEL_WORK_GROUP_SIZE = 512[/font]

[b]3. Do I need to divide into work groups or can I have only one, but not specifying local_work_size?
3b. To what do I have to pay attention, when I only have one work group?[/b]

[b]4. What does [font="Courier New"]CL_DEVICE_MAX_WORK_GROUP_SIZE[/font] mean? [/b]
On my machine [font="Courier New"]CL_DEVICE_MAX_WORK_GROUP_SIZE = 512, 512, 64[/font]
4b. Does this mean, I can have one work group which is as large as the [font="Courier New"]CL_DEVICE_MAX_WORK_ITEM_SIZES[/font]?

[b]Added by edit: [/b]
[b]5. Has [font="Courier New"]global_work_size[/font] to be a divisor of [font="Courier New"]CL_DEVICE_MAX_WORK_ITEM_SIZES[/font]?[/b]
In my code [font="Courier New"]global_work_size = 20[/font].

Thanks for your help!
Hi,



digging into OpenCl reading tutorials some things stayed unclear for me. Here is a collection of my questions regarding local and global work sizes.



1. Must the global_work_size be smaller than CL_DEVICE_MAX_WORK_ITEM_SIZES?

On my machine CL_DEVICE_MAX_WORK_ITEM_SIZES = 512, 512, 64.



2. Is CL_KERNEL_WORK_GROUP_SIZE the recommended work_group_size for the used kernel?

2b. Or is this the only work_group_size the GPU allows?


On my machine CL_KERNEL_WORK_GROUP_SIZE = 512



3. Do I need to divide into work groups or can I have only one, but not specifying local_work_size?

3b. To what do I have to pay attention, when I only have one work group?




4. What does CL_DEVICE_MAX_WORK_GROUP_SIZE mean?

On my machine CL_DEVICE_MAX_WORK_GROUP_SIZE = 512, 512, 64

4b. Does this mean, I can have one work group which is as large as the CL_DEVICE_MAX_WORK_ITEM_SIZES?



Added by edit:

5. Has global_work_size to be a divisor of CL_DEVICE_MAX_WORK_ITEM_SIZES?

In my code global_work_size = 20.



Thanks for your help!

#1
Posted 10/14/2010 04:35 PM   
Hi,

digging into OpenCl reading tutorials some things stayed unclear for me. Here is a collection of my questions regarding local and global work sizes.

[b]1. Must the [font="Courier New"]global_work_size[/font] be smaller than [font="Courier New"]CL_DEVICE_MAX_WORK_ITEM_SIZES[/font]? [/b]
On my machine [font="Courier New"]CL_DEVICE_MAX_WORK_ITEM_SIZES = 512, 512, 64[/font].

[b]2. Is [font="Courier New"]CL_KERNEL_WORK_GROUP_SIZE[/font] the recommended [font="Courier New"]work_group_size[/font] for the used kernel?
2b. Or is this the only [font="Courier New"]work_group_size[/font] the GPU allows?[/b]
On my machine [font="Courier New"]CL_KERNEL_WORK_GROUP_SIZE = 512[/font]

[b]3. Do I need to divide into work groups or can I have only one, but not specifying local_work_size?
3b. To what do I have to pay attention, when I only have one work group?[/b]

[b]4. What does [font="Courier New"]CL_DEVICE_MAX_WORK_GROUP_SIZE[/font] mean? [/b]
On my machine [font="Courier New"]CL_DEVICE_MAX_WORK_GROUP_SIZE = 512, 512, 64[/font]
4b. Does this mean, I can have one work group which is as large as the [font="Courier New"]CL_DEVICE_MAX_WORK_ITEM_SIZES[/font]?

[b]Added by edit: [/b]
[b]5. Has [font="Courier New"]global_work_size[/font] to be a divisor of [font="Courier New"]CL_DEVICE_MAX_WORK_ITEM_SIZES[/font]?[/b]
In my code [font="Courier New"]global_work_size = 20[/font].

Thanks for your help!
Hi,



digging into OpenCl reading tutorials some things stayed unclear for me. Here is a collection of my questions regarding local and global work sizes.



1. Must the global_work_size be smaller than CL_DEVICE_MAX_WORK_ITEM_SIZES?

On my machine CL_DEVICE_MAX_WORK_ITEM_SIZES = 512, 512, 64.



2. Is CL_KERNEL_WORK_GROUP_SIZE the recommended work_group_size for the used kernel?

2b. Or is this the only work_group_size the GPU allows?


On my machine CL_KERNEL_WORK_GROUP_SIZE = 512



3. Do I need to divide into work groups or can I have only one, but not specifying local_work_size?

3b. To what do I have to pay attention, when I only have one work group?




4. What does CL_DEVICE_MAX_WORK_GROUP_SIZE mean?

On my machine CL_DEVICE_MAX_WORK_GROUP_SIZE = 512, 512, 64

4b. Does this mean, I can have one work group which is as large as the CL_DEVICE_MAX_WORK_ITEM_SIZES?



Added by edit:

5. Has global_work_size to be a divisor of CL_DEVICE_MAX_WORK_ITEM_SIZES?

In my code global_work_size = 20.



Thanks for your help!

#2
Posted 10/14/2010 04:35 PM   
I have the same questions exactly... could anyone help please!!

[quote name='i+d' post='1130767' date='Oct 14 2010, 05:35 PM']Hi,

digging into OpenCl reading tutorials some things stayed unclear for me. Here is a collection of my questions regarding local and global work sizes.

[b]1. Must the [font="Courier New"]global_work_size[/font] be smaller than [font="Courier New"]CL_DEVICE_MAX_WORK_ITEM_SIZES[/font]? [/b]
On my machine [font="Courier New"]CL_DEVICE_MAX_WORK_ITEM_SIZES = 512, 512, 64[/font].

[b]2. Is [font="Courier New"]CL_KERNEL_WORK_GROUP_SIZE[/font] the recommended [font="Courier New"]work_group_size[/font] for the used kernel?
2b. Or is this the only [font="Courier New"]work_group_size[/font] the GPU allows?[/b]
On my machine [font="Courier New"]CL_KERNEL_WORK_GROUP_SIZE = 512[/font]

[b]3. Do I need to divide into work groups or can I have only one, but not specifying local_work_size?
3b. To what do I have to pay attention, when I only have one work group?[/b]

[b]4. What does [font="Courier New"]CL_DEVICE_MAX_WORK_GROUP_SIZE[/font] mean? [/b]
On my machine [font="Courier New"]CL_DEVICE_MAX_WORK_GROUP_SIZE = 512, 512, 64[/font]
4b. Does this mean, I can have one work group which is as large as the [font="Courier New"]CL_DEVICE_MAX_WORK_ITEM_SIZES[/font]?

[b]Added by edit: [/b]
[b]5. Has [font="Courier New"]global_work_size[/font] to be a divisor of [font="Courier New"]CL_DEVICE_MAX_WORK_ITEM_SIZES[/font]?[/b]
In my code [font="Courier New"]global_work_size = 20[/font].

Thanks for your help![/quote]
I have the same questions exactly... could anyone help please!!



[quote name='i+d' post='1130767' date='Oct 14 2010, 05:35 PM']Hi,



digging into OpenCl reading tutorials some things stayed unclear for me. Here is a collection of my questions regarding local and global work sizes.



1. Must the global_work_size be smaller than CL_DEVICE_MAX_WORK_ITEM_SIZES?

On my machine CL_DEVICE_MAX_WORK_ITEM_SIZES = 512, 512, 64.



2. Is CL_KERNEL_WORK_GROUP_SIZE the recommended work_group_size for the used kernel?

2b. Or is this the only work_group_size the GPU allows?


On my machine CL_KERNEL_WORK_GROUP_SIZE = 512



3. Do I need to divide into work groups or can I have only one, but not specifying local_work_size?

3b. To what do I have to pay attention, when I only have one work group?




4. What does CL_DEVICE_MAX_WORK_GROUP_SIZE mean?

On my machine CL_DEVICE_MAX_WORK_GROUP_SIZE = 512, 512, 64

4b. Does this mean, I can have one work group which is as large as the CL_DEVICE_MAX_WORK_ITEM_SIZES?



Added by edit:

5. Has global_work_size to be a divisor of CL_DEVICE_MAX_WORK_ITEM_SIZES?

In my code global_work_size = 20.



Thanks for your help!

#3
Posted 10/15/2010 01:54 PM   
I have the same questions exactly... could anyone help please!!

[quote name='i+d' post='1130767' date='Oct 14 2010, 05:35 PM']Hi,

digging into OpenCl reading tutorials some things stayed unclear for me. Here is a collection of my questions regarding local and global work sizes.

[b]1. Must the [font="Courier New"]global_work_size[/font] be smaller than [font="Courier New"]CL_DEVICE_MAX_WORK_ITEM_SIZES[/font]? [/b]
On my machine [font="Courier New"]CL_DEVICE_MAX_WORK_ITEM_SIZES = 512, 512, 64[/font].

[b]2. Is [font="Courier New"]CL_KERNEL_WORK_GROUP_SIZE[/font] the recommended [font="Courier New"]work_group_size[/font] for the used kernel?
2b. Or is this the only [font="Courier New"]work_group_size[/font] the GPU allows?[/b]
On my machine [font="Courier New"]CL_KERNEL_WORK_GROUP_SIZE = 512[/font]

[b]3. Do I need to divide into work groups or can I have only one, but not specifying local_work_size?
3b. To what do I have to pay attention, when I only have one work group?[/b]

[b]4. What does [font="Courier New"]CL_DEVICE_MAX_WORK_GROUP_SIZE[/font] mean? [/b]
On my machine [font="Courier New"]CL_DEVICE_MAX_WORK_GROUP_SIZE = 512, 512, 64[/font]
4b. Does this mean, I can have one work group which is as large as the [font="Courier New"]CL_DEVICE_MAX_WORK_ITEM_SIZES[/font]?

[b]Added by edit: [/b]
[b]5. Has [font="Courier New"]global_work_size[/font] to be a divisor of [font="Courier New"]CL_DEVICE_MAX_WORK_ITEM_SIZES[/font]?[/b]
In my code [font="Courier New"]global_work_size = 20[/font].

Thanks for your help![/quote]
I have the same questions exactly... could anyone help please!!



[quote name='i+d' post='1130767' date='Oct 14 2010, 05:35 PM']Hi,



digging into OpenCl reading tutorials some things stayed unclear for me. Here is a collection of my questions regarding local and global work sizes.



1. Must the global_work_size be smaller than CL_DEVICE_MAX_WORK_ITEM_SIZES?

On my machine CL_DEVICE_MAX_WORK_ITEM_SIZES = 512, 512, 64.



2. Is CL_KERNEL_WORK_GROUP_SIZE the recommended work_group_size for the used kernel?

2b. Or is this the only work_group_size the GPU allows?


On my machine CL_KERNEL_WORK_GROUP_SIZE = 512



3. Do I need to divide into work groups or can I have only one, but not specifying local_work_size?

3b. To what do I have to pay attention, when I only have one work group?




4. What does CL_DEVICE_MAX_WORK_GROUP_SIZE mean?

On my machine CL_DEVICE_MAX_WORK_GROUP_SIZE = 512, 512, 64

4b. Does this mean, I can have one work group which is as large as the CL_DEVICE_MAX_WORK_ITEM_SIZES?



Added by edit:

5. Has global_work_size to be a divisor of CL_DEVICE_MAX_WORK_ITEM_SIZES?

In my code global_work_size = 20.



Thanks for your help!

#4
Posted 10/15/2010 01:54 PM   
[quote name='naroqueen' post='1131156' date='Oct 15 2010, 01:54 PM']I have the same questions exactly... could anyone help please!![/quote]

Similar story here. What really confuses me is that I launch the kernels with a local work group size of CL_DEVICE_MAX_WORK_GROUP_SIZE, but when I use the visual profiler, the "local work group size" is always 1 and the "work group sizeX" is the expected 512.
Is this just misleading information or is my local work group size genuinely much smaller than it should be?
Thanks!
Jan
[quote name='naroqueen' post='1131156' date='Oct 15 2010, 01:54 PM']I have the same questions exactly... could anyone help please!!



Similar story here. What really confuses me is that I launch the kernels with a local work group size of CL_DEVICE_MAX_WORK_GROUP_SIZE, but when I use the visual profiler, the "local work group size" is always 1 and the "work group sizeX" is the expected 512.

Is this just misleading information or is my local work group size genuinely much smaller than it should be?

Thanks!

Jan

#5
Posted 10/17/2010 02:28 PM   
[quote name='naroqueen' post='1131156' date='Oct 15 2010, 01:54 PM']I have the same questions exactly... could anyone help please!![/quote]

Similar story here. What really confuses me is that I launch the kernels with a local work group size of CL_DEVICE_MAX_WORK_GROUP_SIZE, but when I use the visual profiler, the "local work group size" is always 1 and the "work group sizeX" is the expected 512.
Is this just misleading information or is my local work group size genuinely much smaller than it should be?
Thanks!
Jan
[quote name='naroqueen' post='1131156' date='Oct 15 2010, 01:54 PM']I have the same questions exactly... could anyone help please!!



Similar story here. What really confuses me is that I launch the kernels with a local work group size of CL_DEVICE_MAX_WORK_GROUP_SIZE, but when I use the visual profiler, the "local work group size" is always 1 and the "work group sizeX" is the expected 512.

Is this just misleading information or is my local work group size genuinely much smaller than it should be?

Thanks!

Jan

#6
Posted 10/17/2010 02:28 PM   
Are these questions too simple? Did I violate some etiquette? /confused.gif' class='bbc_emoticon' alt=':confused:' />
Are these questions too simple? Did I violate some etiquette? /confused.gif' class='bbc_emoticon' alt=':confused:' />

#7
Posted 10/18/2010 08:58 AM   
Are these questions too simple? Did I violate some etiquette? /confused.gif' class='bbc_emoticon' alt=':confused:' />
Are these questions too simple? Did I violate some etiquette? /confused.gif' class='bbc_emoticon' alt=':confused:' />

#8
Posted 10/18/2010 08:58 AM   
Hey, I'll try and answer your questions as best I can - I'm no expert, so someone should verify what I write, so just beware. Before I begin, though, it sounds like you are getting confused on the differences between work items, work sizes, work groups, etc. It's a lot of terminology, so I understand. I'd recommend grabbing the NVIDIA or AMD OpenCL Programming Guides and read the sections on these. Also, check out MacResearch.org - they have some great OpenCL tutorials that really make it understandable for novices like me.

[b]1. Your global_work_size can be larger than your work item sizes. [/b]Your global_work_size is essentially the size of your problem. Your work items are basically pieces of that problem. You always want your work item size to be a multiple of your global work size. Your work item size needs to be less than or equal to the max size, as indicated by the result you gave. Your global_work_size needs to be less than or equal to the result of the CL_DEVICE_MAX_WORK_GROUP_SIZE.

[b]2. Your CL_KERNEL_WORK_GROUP_SIZE is the maximum size allowed by the kernel itself, and that could be different than your device work group size. [/b] Always double-check this. From my understanding, it has something to do with register usage, etc, and could be smaller than the device work group size. It should never be bigger, though!

[b]3. You don't need to manually divide into work groups. [/b]You give the OpenCL device a global work size. You can also specify the local work size, or leave it NULL. If you specify it, then the OpenCL runtime will "divide up" the global work size into as many pieces as necessary to complete the entire problem. If you leave it NULL, the OpenCL runtime will try and choose "optimal" settings for the local work size - note, though, that it's choice for "Optimal" might not be the true optimal choice!

Unfortunately, I'm headed to class now, so I can't finish responding. Try thinking about the above and see if it helps.

[quote name='i+d' post='1130767' date='Oct 14 2010, 12:35 PM']Hi,

digging into OpenCl reading tutorials some things stayed unclear for me. Here is a collection of my questions regarding local and global work sizes.

[b]1. Must the [font="Courier New"]global_work_size[/font] be smaller than [font="Courier New"]CL_DEVICE_MAX_WORK_ITEM_SIZES[/font]? [/b]
On my machine [font="Courier New"]CL_DEVICE_MAX_WORK_ITEM_SIZES = 512, 512, 64[/font].

[b]2. Is [font="Courier New"]CL_KERNEL_WORK_GROUP_SIZE[/font] the recommended [font="Courier New"]work_group_size[/font] for the used kernel?
2b. Or is this the only [font="Courier New"]work_group_size[/font] the GPU allows?[/b]
On my machine [font="Courier New"]CL_KERNEL_WORK_GROUP_SIZE = 512[/font]

[b]3. Do I need to divide into work groups or can I have only one, but not specifying local_work_size?
3b. To what do I have to pay attention, when I only have one work group?[/b]

[b]4. What does [font="Courier New"]CL_DEVICE_MAX_WORK_GROUP_SIZE[/font] mean? [/b]
On my machine [font="Courier New"]CL_DEVICE_MAX_WORK_GROUP_SIZE = 512, 512, 64[/font]
4b. Does this mean, I can have one work group which is as large as the [font="Courier New"]CL_DEVICE_MAX_WORK_ITEM_SIZES[/font]?

[b]Added by edit: [/b]
[b]5. Has [font="Courier New"]global_work_size[/font] to be a divisor of [font="Courier New"]CL_DEVICE_MAX_WORK_ITEM_SIZES[/font]?[/b]
In my code [font="Courier New"]global_work_size = 20[/font].

Thanks for your help![/quote]
Hey, I'll try and answer your questions as best I can - I'm no expert, so someone should verify what I write, so just beware. Before I begin, though, it sounds like you are getting confused on the differences between work items, work sizes, work groups, etc. It's a lot of terminology, so I understand. I'd recommend grabbing the NVIDIA or AMD OpenCL Programming Guides and read the sections on these. Also, check out MacResearch.org - they have some great OpenCL tutorials that really make it understandable for novices like me.



1. Your global_work_size can be larger than your work item sizes. Your global_work_size is essentially the size of your problem. Your work items are basically pieces of that problem. You always want your work item size to be a multiple of your global work size. Your work item size needs to be less than or equal to the max size, as indicated by the result you gave. Your global_work_size needs to be less than or equal to the result of the CL_DEVICE_MAX_WORK_GROUP_SIZE.



2. Your CL_KERNEL_WORK_GROUP_SIZE is the maximum size allowed by the kernel itself, and that could be different than your device work group size. Always double-check this. From my understanding, it has something to do with register usage, etc, and could be smaller than the device work group size. It should never be bigger, though!



3. You don't need to manually divide into work groups. You give the OpenCL device a global work size. You can also specify the local work size, or leave it NULL. If you specify it, then the OpenCL runtime will "divide up" the global work size into as many pieces as necessary to complete the entire problem. If you leave it NULL, the OpenCL runtime will try and choose "optimal" settings for the local work size - note, though, that it's choice for "Optimal" might not be the true optimal choice!



Unfortunately, I'm headed to class now, so I can't finish responding. Try thinking about the above and see if it helps.



[quote name='i+d' post='1130767' date='Oct 14 2010, 12:35 PM']Hi,



digging into OpenCl reading tutorials some things stayed unclear for me. Here is a collection of my questions regarding local and global work sizes.



1. Must the global_work_size be smaller than CL_DEVICE_MAX_WORK_ITEM_SIZES?

On my machine CL_DEVICE_MAX_WORK_ITEM_SIZES = 512, 512, 64.



2. Is CL_KERNEL_WORK_GROUP_SIZE the recommended work_group_size for the used kernel?

2b. Or is this the only work_group_size the GPU allows?


On my machine CL_KERNEL_WORK_GROUP_SIZE = 512



3. Do I need to divide into work groups or can I have only one, but not specifying local_work_size?

3b. To what do I have to pay attention, when I only have one work group?




4. What does CL_DEVICE_MAX_WORK_GROUP_SIZE mean?

On my machine CL_DEVICE_MAX_WORK_GROUP_SIZE = 512, 512, 64

4b. Does this mean, I can have one work group which is as large as the CL_DEVICE_MAX_WORK_ITEM_SIZES?



Added by edit:

5. Has global_work_size to be a divisor of CL_DEVICE_MAX_WORK_ITEM_SIZES?

In my code global_work_size = 20.



Thanks for your help!

#9
Posted 10/18/2010 02:22 PM   
Hey, I'll try and answer your questions as best I can - I'm no expert, so someone should verify what I write, so just beware. Before I begin, though, it sounds like you are getting confused on the differences between work items, work sizes, work groups, etc. It's a lot of terminology, so I understand. I'd recommend grabbing the NVIDIA or AMD OpenCL Programming Guides and read the sections on these. Also, check out MacResearch.org - they have some great OpenCL tutorials that really make it understandable for novices like me.

[b]1. Your global_work_size can be larger than your work item sizes. [/b]Your global_work_size is essentially the size of your problem. Your work items are basically pieces of that problem. You always want your work item size to be a multiple of your global work size. Your work item size needs to be less than or equal to the max size, as indicated by the result you gave. Your global_work_size needs to be less than or equal to the result of the CL_DEVICE_MAX_WORK_GROUP_SIZE.

[b]2. Your CL_KERNEL_WORK_GROUP_SIZE is the maximum size allowed by the kernel itself, and that could be different than your device work group size. [/b] Always double-check this. From my understanding, it has something to do with register usage, etc, and could be smaller than the device work group size. It should never be bigger, though!

[b]3. You don't need to manually divide into work groups. [/b]You give the OpenCL device a global work size. You can also specify the local work size, or leave it NULL. If you specify it, then the OpenCL runtime will "divide up" the global work size into as many pieces as necessary to complete the entire problem. If you leave it NULL, the OpenCL runtime will try and choose "optimal" settings for the local work size - note, though, that it's choice for "Optimal" might not be the true optimal choice!

Unfortunately, I'm headed to class now, so I can't finish responding. Try thinking about the above and see if it helps.

[quote name='i+d' post='1130767' date='Oct 14 2010, 12:35 PM']Hi,

digging into OpenCl reading tutorials some things stayed unclear for me. Here is a collection of my questions regarding local and global work sizes.

[b]1. Must the [font="Courier New"]global_work_size[/font] be smaller than [font="Courier New"]CL_DEVICE_MAX_WORK_ITEM_SIZES[/font]? [/b]
On my machine [font="Courier New"]CL_DEVICE_MAX_WORK_ITEM_SIZES = 512, 512, 64[/font].

[b]2. Is [font="Courier New"]CL_KERNEL_WORK_GROUP_SIZE[/font] the recommended [font="Courier New"]work_group_size[/font] for the used kernel?
2b. Or is this the only [font="Courier New"]work_group_size[/font] the GPU allows?[/b]
On my machine [font="Courier New"]CL_KERNEL_WORK_GROUP_SIZE = 512[/font]

[b]3. Do I need to divide into work groups or can I have only one, but not specifying local_work_size?
3b. To what do I have to pay attention, when I only have one work group?[/b]

[b]4. What does [font="Courier New"]CL_DEVICE_MAX_WORK_GROUP_SIZE[/font] mean? [/b]
On my machine [font="Courier New"]CL_DEVICE_MAX_WORK_GROUP_SIZE = 512, 512, 64[/font]
4b. Does this mean, I can have one work group which is as large as the [font="Courier New"]CL_DEVICE_MAX_WORK_ITEM_SIZES[/font]?

[b]Added by edit: [/b]
[b]5. Has [font="Courier New"]global_work_size[/font] to be a divisor of [font="Courier New"]CL_DEVICE_MAX_WORK_ITEM_SIZES[/font]?[/b]
In my code [font="Courier New"]global_work_size = 20[/font].

Thanks for your help![/quote]
Hey, I'll try and answer your questions as best I can - I'm no expert, so someone should verify what I write, so just beware. Before I begin, though, it sounds like you are getting confused on the differences between work items, work sizes, work groups, etc. It's a lot of terminology, so I understand. I'd recommend grabbing the NVIDIA or AMD OpenCL Programming Guides and read the sections on these. Also, check out MacResearch.org - they have some great OpenCL tutorials that really make it understandable for novices like me.



1. Your global_work_size can be larger than your work item sizes. Your global_work_size is essentially the size of your problem. Your work items are basically pieces of that problem. You always want your work item size to be a multiple of your global work size. Your work item size needs to be less than or equal to the max size, as indicated by the result you gave. Your global_work_size needs to be less than or equal to the result of the CL_DEVICE_MAX_WORK_GROUP_SIZE.



2. Your CL_KERNEL_WORK_GROUP_SIZE is the maximum size allowed by the kernel itself, and that could be different than your device work group size. Always double-check this. From my understanding, it has something to do with register usage, etc, and could be smaller than the device work group size. It should never be bigger, though!



3. You don't need to manually divide into work groups. You give the OpenCL device a global work size. You can also specify the local work size, or leave it NULL. If you specify it, then the OpenCL runtime will "divide up" the global work size into as many pieces as necessary to complete the entire problem. If you leave it NULL, the OpenCL runtime will try and choose "optimal" settings for the local work size - note, though, that it's choice for "Optimal" might not be the true optimal choice!



Unfortunately, I'm headed to class now, so I can't finish responding. Try thinking about the above and see if it helps.



[quote name='i+d' post='1130767' date='Oct 14 2010, 12:35 PM']Hi,



digging into OpenCl reading tutorials some things stayed unclear for me. Here is a collection of my questions regarding local and global work sizes.



1. Must the global_work_size be smaller than CL_DEVICE_MAX_WORK_ITEM_SIZES?

On my machine CL_DEVICE_MAX_WORK_ITEM_SIZES = 512, 512, 64.



2. Is CL_KERNEL_WORK_GROUP_SIZE the recommended work_group_size for the used kernel?

2b. Or is this the only work_group_size the GPU allows?


On my machine CL_KERNEL_WORK_GROUP_SIZE = 512



3. Do I need to divide into work groups or can I have only one, but not specifying local_work_size?

3b. To what do I have to pay attention, when I only have one work group?




4. What does CL_DEVICE_MAX_WORK_GROUP_SIZE mean?

On my machine CL_DEVICE_MAX_WORK_GROUP_SIZE = 512, 512, 64

4b. Does this mean, I can have one work group which is as large as the CL_DEVICE_MAX_WORK_ITEM_SIZES?



Added by edit:

5. Has global_work_size to be a divisor of CL_DEVICE_MAX_WORK_ITEM_SIZES?

In my code global_work_size = 20.



Thanks for your help!

#10
Posted 10/18/2010 02:22 PM   
Hi Kevin,

thanks for helping out again!

After reading your answers one new question rose:
[quote name='HolyGeneralK' post='1132588' date='Oct 18 2010, 04:22 PM']Your global_work_size needs to be less than or equal to the result of the CL_DEVICE_MAX_WORK_GROUP_SIZE.[/quote]
Does this mean, that in my case [font="Courier New"]CL_DEVICE_MAX_WORK_GROUP_SIZE[/font] = 512, [font="Courier New"]globalid[/font] has to be <= 512 as well?
What would I do with such a for loop?

[codebox]int my_array[20000];
for (int i=0;i<20000;i++){
my_array[i]=i;
}[/codebox]

[quote name='HolyGeneralK' post='1132588' date='Oct 18 2010, 04:22 PM']Hey, I'll try and answer your questions as best I can - I'm no expert, so someone should verify what I write, so just beware. Before I begin, though, it sounds like you are getting confused on the differences between work items, work sizes, work groups, etc. It's a lot of terminology, so I understand. I'd recommend grabbing the NVIDIA or AMD OpenCL Programming Guides and read the sections on these. Also, check out MacResearch.org - they have some great OpenCL tutorials that really make it understandable for novices like me.[/quote]
In any case, thanks for your answer, it is really appreciated. I read through the AMD and the NVIDIA guides, but both did leave me uncertain in these questions. I will look into the macresearch.
Hi Kevin,



thanks for helping out again!



After reading your answers one new question rose:

[quote name='HolyGeneralK' post='1132588' date='Oct 18 2010, 04:22 PM']Your global_work_size needs to be less than or equal to the result of the CL_DEVICE_MAX_WORK_GROUP_SIZE.

Does this mean, that in my case CL_DEVICE_MAX_WORK_GROUP_SIZE = 512, globalid has to be <= 512 as well?

What would I do with such a for loop?



[codebox]int my_array[20000];

for (int i=0;i<20000;i++){

my_array[i]=i;

}[/codebox]



[quote name='HolyGeneralK' post='1132588' date='Oct 18 2010, 04:22 PM']Hey, I'll try and answer your questions as best I can - I'm no expert, so someone should verify what I write, so just beware. Before I begin, though, it sounds like you are getting confused on the differences between work items, work sizes, work groups, etc. It's a lot of terminology, so I understand. I'd recommend grabbing the NVIDIA or AMD OpenCL Programming Guides and read the sections on these. Also, check out MacResearch.org - they have some great OpenCL tutorials that really make it understandable for novices like me.

In any case, thanks for your answer, it is really appreciated. I read through the AMD and the NVIDIA guides, but both did leave me uncertain in these questions. I will look into the macresearch.

#11
Posted 10/21/2010 01:57 PM   
Hi Kevin,

thanks for helping out again!

After reading your answers one new question rose:
[quote name='HolyGeneralK' post='1132588' date='Oct 18 2010, 04:22 PM']Your global_work_size needs to be less than or equal to the result of the CL_DEVICE_MAX_WORK_GROUP_SIZE.[/quote]
Does this mean, that in my case [font="Courier New"]CL_DEVICE_MAX_WORK_GROUP_SIZE[/font] = 512, [font="Courier New"]globalid[/font] has to be <= 512 as well?
What would I do with such a for loop?

[codebox]int my_array[20000];
for (int i=0;i<20000;i++){
my_array[i]=i;
}[/codebox]

[quote name='HolyGeneralK' post='1132588' date='Oct 18 2010, 04:22 PM']Hey, I'll try and answer your questions as best I can - I'm no expert, so someone should verify what I write, so just beware. Before I begin, though, it sounds like you are getting confused on the differences between work items, work sizes, work groups, etc. It's a lot of terminology, so I understand. I'd recommend grabbing the NVIDIA or AMD OpenCL Programming Guides and read the sections on these. Also, check out MacResearch.org - they have some great OpenCL tutorials that really make it understandable for novices like me.[/quote]
In any case, thanks for your answer, it is really appreciated. I read through the AMD and the NVIDIA guides, but both did leave me uncertain in these questions. I will look into the macresearch.
Hi Kevin,



thanks for helping out again!



After reading your answers one new question rose:

[quote name='HolyGeneralK' post='1132588' date='Oct 18 2010, 04:22 PM']Your global_work_size needs to be less than or equal to the result of the CL_DEVICE_MAX_WORK_GROUP_SIZE.

Does this mean, that in my case CL_DEVICE_MAX_WORK_GROUP_SIZE = 512, globalid has to be <= 512 as well?

What would I do with such a for loop?



[codebox]int my_array[20000];

for (int i=0;i<20000;i++){

my_array[i]=i;

}[/codebox]



[quote name='HolyGeneralK' post='1132588' date='Oct 18 2010, 04:22 PM']Hey, I'll try and answer your questions as best I can - I'm no expert, so someone should verify what I write, so just beware. Before I begin, though, it sounds like you are getting confused on the differences between work items, work sizes, work groups, etc. It's a lot of terminology, so I understand. I'd recommend grabbing the NVIDIA or AMD OpenCL Programming Guides and read the sections on these. Also, check out MacResearch.org - they have some great OpenCL tutorials that really make it understandable for novices like me.

In any case, thanks for your answer, it is really appreciated. I read through the AMD and the NVIDIA guides, but both did leave me uncertain in these questions. I will look into the macresearch.

#12
Posted 10/21/2010 01:57 PM   
CL_DEVICE_MAX_WORK_GROUP_SIZE is the maximum value of work-group as says the hardware limit name. It means how big you can have the bits of your code, not actually the whole size of your parallel problem (I'm using successfully thousands of threads with hardware limit CL_DEVICE_MAX_WORK_GROUP_SIZE=512, localid < 512).

So it means, when you call clEnqueueNDRangeKernel, the local_work_size can't be larger than 512. It says nothing about global_work_size.

However, if you ever come across the limit of global_wrok_size you can always divide input array in parts and all threads will work with several input values, not just one. This last mentioned view of algorithm is for most problems the best one.
CL_DEVICE_MAX_WORK_GROUP_SIZE is the maximum value of work-group as says the hardware limit name. It means how big you can have the bits of your code, not actually the whole size of your parallel problem (I'm using successfully thousands of threads with hardware limit CL_DEVICE_MAX_WORK_GROUP_SIZE=512, localid < 512).



So it means, when you call clEnqueueNDRangeKernel, the local_work_size can't be larger than 512. It says nothing about global_work_size.



However, if you ever come across the limit of global_wrok_size you can always divide input array in parts and all threads will work with several input values, not just one. This last mentioned view of algorithm is for most problems the best one.

#13
Posted 10/21/2010 05:19 PM   
CL_DEVICE_MAX_WORK_GROUP_SIZE is the maximum value of work-group as says the hardware limit name. It means how big you can have the bits of your code, not actually the whole size of your parallel problem (I'm using successfully thousands of threads with hardware limit CL_DEVICE_MAX_WORK_GROUP_SIZE=512, localid < 512).

So it means, when you call clEnqueueNDRangeKernel, the local_work_size can't be larger than 512. It says nothing about global_work_size.

However, if you ever come across the limit of global_wrok_size you can always divide input array in parts and all threads will work with several input values, not just one. This last mentioned view of algorithm is for most problems the best one.
CL_DEVICE_MAX_WORK_GROUP_SIZE is the maximum value of work-group as says the hardware limit name. It means how big you can have the bits of your code, not actually the whole size of your parallel problem (I'm using successfully thousands of threads with hardware limit CL_DEVICE_MAX_WORK_GROUP_SIZE=512, localid < 512).



So it means, when you call clEnqueueNDRangeKernel, the local_work_size can't be larger than 512. It says nothing about global_work_size.



However, if you ever come across the limit of global_wrok_size you can always divide input array in parts and all threads will work with several input values, not just one. This last mentioned view of algorithm is for most problems the best one.

#14
Posted 10/21/2010 05:19 PM   
hello,
i'm new to opencl. just came across a simple dummy example of matrix multiplication.
i have 2 matrices (e.g. A[MxN] and B[NxP] that give C[MxP] as a result).
IMPORTANT: M, N, P are less or equal to 16 (so, no partitioning is needed. global and local work size is the same, that is the example uses a 2D ND Range (grid) with 1 workgroup).
the problem is that when M<P, C has incorrect values.
my guess is that it has to do with work sizes passed to clEnqueueNDRangeKernel().

host code:
[codebox]
local_worksize[0]=wb;
local_worksize[1]=ha;
global_worksize[0]=wb;
global_worksize[1]=ha;

errcode=clEnqueueNDRangeKernel(queue, kernel,
2, NULL, global_worksize, local_worksize, 0, NULL, NULL
);
[/codebox]
where ha is M and wb is P.

device code:
[codebox]
__kernel void
matrixMul(__global float *C, __global float *A, __global float *B,
int wA, int wB)
{
// 2D thread ID
int tx=get_global_id(0);
int ty=get_global_id(1);

// value stores the element that is computed by the thread
float value=0;

for(int k=0; k<wA; k++) {
float elementA=A[ty*wA+k];
float elementB=B[k*wB+tx];
value+=elementA*elementB;
}


// write the matrix to device memory (each thread writes one element)
C[ty*wA+tx]=value;
}
[/codebox]

i have checked the values of the arguments passed to the kernel and they are correct.
also, as i said, when M>P||M==P, results are correct.

i tested it on windows and linux.
on windows i ran 2 tests. one for the nvidia gpu and one for the amd phenom cpu.
on linux i only ran the gpu test.
also, when i ran some sdk samples, there is no problem (i see "PASSED" on the console).

drivers on linux:
[codebox]
$cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 256.53 Fri Aug 27 20:27:48 PDT 2010
GCC version: gcc version 4.3.5 (Debian 4.3.5-4)
[/codebox]
drivers on windows: latest.
also, for the cpu platform i have isntalled the latest ati stream sdk.

all tests show the same results. this isn't a drivers' problem (or else amd and nvidia have the same problem)!

is there something i should know about work size dimensions?
thanx in advance
hello,

i'm new to opencl. just came across a simple dummy example of matrix multiplication.

i have 2 matrices (e.g. A[MxN] and B[NxP] that give C[MxP] as a result).

IMPORTANT: M, N, P are less or equal to 16 (so, no partitioning is needed. global and local work size is the same, that is the example uses a 2D ND Range (grid) with 1 workgroup).

the problem is that when M<P, C has incorrect values.

my guess is that it has to do with work sizes passed to clEnqueueNDRangeKernel().



host code:

[codebox]

local_worksize[0]=wb;

local_worksize[1]=ha;

global_worksize[0]=wb;

global_worksize[1]=ha;



errcode=clEnqueueNDRangeKernel(queue, kernel,

2, NULL, global_worksize, local_worksize, 0, NULL, NULL

);

[/codebox]

where ha is M and wb is P.



device code:

[codebox]

__kernel void

matrixMul(__global float *C, __global float *A, __global float *B,

int wA, int wB)

{

// 2D thread ID

int tx=get_global_id(0);

int ty=get_global_id(1);



// value stores the element that is computed by the thread

float value=0;



for(int k=0; k<wA; k++) {

float elementA=A[ty*wA+k];

float elementB=B[k*wB+tx];

value+=elementA*elementB;

}





// write the matrix to device memory (each thread writes one element)

C[ty*wA+tx]=value;

}

[/codebox]



i have checked the values of the arguments passed to the kernel and they are correct.

also, as i said, when M>P||M==P, results are correct.



i tested it on windows and linux.

on windows i ran 2 tests. one for the nvidia gpu and one for the amd phenom cpu.

on linux i only ran the gpu test.

also, when i ran some sdk samples, there is no problem (i see "PASSED" on the console).



drivers on linux:

[codebox]

$cat /proc/driver/nvidia/version

NVRM version: NVIDIA UNIX x86_64 Kernel Module 256.53 Fri Aug 27 20:27:48 PDT 2010

GCC version: gcc version 4.3.5 (Debian 4.3.5-4)

[/codebox]

drivers on windows: latest.

also, for the cpu platform i have isntalled the latest ati stream sdk.



all tests show the same results. this isn't a drivers' problem (or else amd and nvidia have the same problem)!



is there something i should know about work size dimensions?

thanx in advance

#15
Posted 10/29/2010 09:13 AM   
  1 / 2    
Scroll To Top