launch kernels in parallel?
  1 / 2    
Hi!

I tried to launch two kernels in parallel by calling "clEnqueueNDRangeKernel" for both kernels and expected this function to immediately return after the kernel has been enqueued, but "clEnqueueNDRangeKernel" only returns after the kernel has completed. Obviously no parallel invocation is possible this way. The command stream has the CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE property set, and this property is supported for the device (GTX280) according to "clGetDeviceInfo". The read/write/map functions have a flag to select blocking/non-blocking operation, is there a similar mechanism to enforce a non-blocking kernel launch?

Thanks & kind regards,
Markus
Hi!



I tried to launch two kernels in parallel by calling "clEnqueueNDRangeKernel" for both kernels and expected this function to immediately return after the kernel has been enqueued, but "clEnqueueNDRangeKernel" only returns after the kernel has completed. Obviously no parallel invocation is possible this way. The command stream has the CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE property set, and this property is supported for the device (GTX280) according to "clGetDeviceInfo". The read/write/map functions have a flag to select blocking/non-blocking operation, is there a similar mechanism to enforce a non-blocking kernel launch?



Thanks & kind regards,

Markus

#1
Posted 07/31/2009 03:17 PM   
To my knowledge, there is not flag because they shouldn't ever block. I don't know why you're seeing this behavior. Out of order queueing isn't necessary, either.
To my knowledge, there is not flag because they shouldn't ever block. I don't know why you're seeing this behavior. Out of order queueing isn't necessary, either.

#2
Posted 07/31/2009 03:18 PM   
[quote name='hpux735' post='572477' date='Jul 31 2009, 05:18 PM']To my knowledge, there is not flag because they shouldn't ever block. I don't know why you're seeing this behavior. Out of order queueing isn't necessary, either.[/quote]
Very interesting... can you confirm that a kernel call (clEnqueueNDRangeKernel) is non-blocking on your system? Which platform are you using?

Thanks,
Markus
[quote name='hpux735' post='572477' date='Jul 31 2009, 05:18 PM']To my knowledge, there is not flag because they shouldn't ever block. I don't know why you're seeing this behavior. Out of order queueing isn't necessary, either.

Very interesting... can you confirm that a kernel call (clEnqueueNDRangeKernel) is non-blocking on your system? Which platform are you using?



Thanks,

Markus

#3
Posted 07/31/2009 06:16 PM   
[quote name='grabner' post='572474' date='Jul 31 2009, 05:17 PM']I tried to launch two kernels in parallel by calling "clEnqueueNDRangeKernel" for both kernels and expected this function to immediately return after the kernel has been enqueued, but "clEnqueueNDRangeKernel" only returns after the kernel has completed.[/quote]
A simple example to demonstrate this behavior is attached. It uses the OpenCL profiling operations to record the timing, and it is evident from the program's output that processing is strictly sequential (the numbers are nanoseconds after queuing the first kernel):

kernel 0:
command queued: 0
command submit: 0
command start: 34496
command end: 925968896

kernel 1:
command queued: 927198144
command submit: 927198144
command start: 927212640
command end: 1853145888

Kernel 1 gets queued only after kernel 0 has completed. So I have two questions:
*) Does OpenCL support parallel execution of different kernels (or more than one instance of the same kernel as in this example)?
*) If yes, how can two kernels be launched to execute concurrently?

Thanks & kind regards,
Markus
[quote name='grabner' post='572474' date='Jul 31 2009, 05:17 PM']I tried to launch two kernels in parallel by calling "clEnqueueNDRangeKernel" for both kernels and expected this function to immediately return after the kernel has been enqueued, but "clEnqueueNDRangeKernel" only returns after the kernel has completed.

A simple example to demonstrate this behavior is attached. It uses the OpenCL profiling operations to record the timing, and it is evident from the program's output that processing is strictly sequential (the numbers are nanoseconds after queuing the first kernel):



kernel 0:

command queued: 0

command submit: 0

command start: 34496

command end: 925968896



kernel 1:

command queued: 927198144

command submit: 927198144

command start: 927212640

command end: 1853145888



Kernel 1 gets queued only after kernel 0 has completed. So I have two questions:

*) Does OpenCL support parallel execution of different kernels (or more than one instance of the same kernel as in this example)?

*) If yes, how can two kernels be launched to execute concurrently?



Thanks & kind regards,

Markus
Attachments

concurrency.zip

#4
Posted 08/03/2009 02:42 PM   
I'm not sure about this but I would be surprised if it was possible to launch concurrent kernels on GPUs. You can't do that in CUDA. Unless you mean just async queuing?
I'm not sure about this but I would be surprised if it was possible to launch concurrent kernels on GPUs. You can't do that in CUDA. Unless you mean just async queuing?

#5
Posted 08/04/2009 12:25 AM   
[quote name='_Big_Mac_' post='573788' date='Aug 4 2009, 02:25 AM']I'm not sure about this but I would be surprised if it was possible to launch concurrent kernels on GPUs. You can't do that in CUDA.[/quote]
Yes, but is this a restriction of the CUDA API/driver or the underlying hardware? In the latter case, it obviously won't work in OpenCL either.

[quote]Unless you mean just async queuing?[/quote]
This would be useful even if the kernels are processed sequentially, but in the example I posted recently, queuing is sequential as well. On the other hand, a CUDA kernel can be launched asynchronously, so I expected the same to be possible in OpenCL (and hoped that it could even run kernels in parallel).

Kind regards,
Markus
[quote name='_Big_Mac_' post='573788' date='Aug 4 2009, 02:25 AM']I'm not sure about this but I would be surprised if it was possible to launch concurrent kernels on GPUs. You can't do that in CUDA.

Yes, but is this a restriction of the CUDA API/driver or the underlying hardware? In the latter case, it obviously won't work in OpenCL either.



Unless you mean just async queuing?


This would be useful even if the kernels are processed sequentially, but in the example I posted recently, queuing is sequential as well. On the other hand, a CUDA kernel can be launched asynchronously, so I expected the same to be possible in OpenCL (and hoped that it could even run kernels in parallel).



Kind regards,

Markus

#6
Posted 08/04/2009 02:20 PM   
The restriction in concurrent kernel executions is imposed by the hardware AFAIK.

You [i]should[/i] be able to queue kernels asynchronously, if I read the specs correctly, but don't ask me how, I'm a newbie to OpenCL :)
The restriction in concurrent kernel executions is imposed by the hardware AFAIK.



You should be able to queue kernels asynchronously, if I read the specs correctly, but don't ask me how, I'm a newbie to OpenCL :)

#7
Posted 08/04/2009 10:45 PM   
[quote name='_Big_Mac_' post='574227' date='Aug 5 2009, 12:45 AM']The restriction in concurrent kernel executions is imposed by the hardware AFAIK.

You [i]should[/i] be able to queue kernels asynchronously, if I read the specs correctly, but don't ask me how, I'm a newbie to OpenCL :)[/quote]
Yes, I also can't find any blocking requirements for clEnqueueNDRangeKernel in the specs. Seems like a bug in Nvidia's OpenCL implementation, or do I overlook something?

Kind regards,
Markus
[quote name='_Big_Mac_' post='574227' date='Aug 5 2009, 12:45 AM']The restriction in concurrent kernel executions is imposed by the hardware AFAIK.



You should be able to queue kernels asynchronously, if I read the specs correctly, but don't ask me how, I'm a newbie to OpenCL :)

Yes, I also can't find any blocking requirements for clEnqueueNDRangeKernel in the specs. Seems like a bug in Nvidia's OpenCL implementation, or do I overlook something?



Kind regards,

Markus

#8
Posted 08/05/2009 11:41 AM   
[quote name='grabner' post='572557' date='Jul 31 2009, 11:16 AM']Very interesting... can you confirm that a kernel call (clEnqueueNDRangeKernel) is non-blocking on your system? Which platform are you using?

Thanks,
Markus[/quote]

Yah, I can demonstrate that enqueuing a kernel in OpenCL doesn't block. I actually have to wait on the returned event before I copy memory otherwise I get stale data. I can't say what I'm using, but it isn't the NVIDIA implementation.
[quote name='grabner' post='572557' date='Jul 31 2009, 11:16 AM']Very interesting... can you confirm that a kernel call (clEnqueueNDRangeKernel) is non-blocking on your system? Which platform are you using?



Thanks,

Markus



Yah, I can demonstrate that enqueuing a kernel in OpenCL doesn't block. I actually have to wait on the returned event before I copy memory otherwise I get stale data. I can't say what I'm using, but it isn't the NVIDIA implementation.

#9
Posted 08/07/2009 06:25 PM   
Concurrent execution with a single gpu does not apply to two kernels but to the execution of a kernel parallel to a memory transfer operation.
e.g. Parallel execution of a 'clEnqueueNDRangeKernel' and 'clEnqueueCopyBuffer' commands.

I did not manage to demonstrate this behavior on my machine.

Did anybody manage to demonstrate the concurrent execution ?

[edit] - After a few tries i managed to perform concurrent copies.
Concurrent execution with a single gpu does not apply to two kernels but to the execution of a kernel parallel to a memory transfer operation.

e.g. Parallel execution of a 'clEnqueueNDRangeKernel' and 'clEnqueueCopyBuffer' commands.



I did not manage to demonstrate this behavior on my machine.



Did anybody manage to demonstrate the concurrent execution ?



[edit] - After a few tries i managed to perform concurrent copies.

#10
Posted 03/08/2010 09:54 AM   
[quote name='grabner' post='574453' date='Aug 5 2009, 07:41 PM']Yes, I also can't find any blocking requirements for clEnqueueNDRangeKernel in the specs. Seems like a bug in Nvidia's OpenCL implementation, or do I overlook something?

Kind regards,
Markus[/quote]

Hi All,

I've hit this same problem with the version 3 of the CUDA Tollkit and GPU Computing SDK (my driver version is 197.16) - it seems that clEnqueueNDRangeKernel blocks until the kernel has completed execution. This makes it impossible for a single thread on the host to enqueue work for multiple devices.

In the Jumpstart Guide (http://developer.download.nvidia.com/OpenCL/NVIDIA_OpenCL_JumpStart_Guide.pdf) on page 13 it states, "Both kernel launch functions (CUDA and OpenCL) are asynchronous, i.e. they return immediately after scheduling the kernel to be executed on the GPU." It appears that this statement is incorrect.

If anyone from NVidia is reading, are you planning on changing this behavior in subsequent releases? That is, can we expect to be able to enqueue OpenCL kernels for asynchronous execution in later versions of your OpenCL implementation?

Thanks,

Dan
[quote name='grabner' post='574453' date='Aug 5 2009, 07:41 PM']Yes, I also can't find any blocking requirements for clEnqueueNDRangeKernel in the specs. Seems like a bug in Nvidia's OpenCL implementation, or do I overlook something?



Kind regards,

Markus



Hi All,



I've hit this same problem with the version 3 of the CUDA Tollkit and GPU Computing SDK (my driver version is 197.16) - it seems that clEnqueueNDRangeKernel blocks until the kernel has completed execution. This makes it impossible for a single thread on the host to enqueue work for multiple devices.



In the Jumpstart Guide (http://developer.download.nvidia.com/OpenCL/NVIDIA_OpenCL_JumpStart_Guide.pdf) on page 13 it states, "Both kernel launch functions (CUDA and OpenCL) are asynchronous, i.e. they return immediately after scheduling the kernel to be executed on the GPU." It appears that this statement is incorrect.



If anyone from NVidia is reading, are you planning on changing this behavior in subsequent releases? That is, can we expect to be able to enqueue OpenCL kernels for asynchronous execution in later versions of your OpenCL implementation?



Thanks,



Dan

#11
Posted 05/05/2010 04:01 AM   
[quote name='Daniel Paull' post='1050582' date='May 5 2010, 12:01 PM']Hi All,

I've hit this same problem with the version 3 of the CUDA Tollkit and GPU Computing SDK (my driver version is 197.16) - it seems that clEnqueueNDRangeKernel blocks until the kernel has completed execution. This makes it impossible for a single thread on the host to enqueue work for multiple devices.

In the Jumpstart Guide (http://developer.download.nvidia.com/OpenCL/NVIDIA_OpenCL_JumpStart_Guide.pdf) on page 13 it states, "Both kernel launch functions (CUDA and OpenCL) are asynchronous, i.e. they return immediately after scheduling the kernel to be executed on the GPU." It appears that this statement is incorrect.

If anyone from NVidia is reading, are you planning on changing this behavior in subsequent releases? That is, can we expect to be able to enqueue OpenCL kernels for asynchronous execution in later versions of your OpenCL implementation?

Thanks,

Dan[/quote]

Just a little more evidence that clEnqueueNDRangeKernel() blocks, I modified the oclSimpleMultiGPU sample provided in the GPU Computing SDK so that it reported command start and end times when profiling is enabled rather than just execution duration. I also modified the kernel code so that it took much longer to execute. The output when using two GPUs shows quite clearly that the kernel for the second GPU does not start executing until the kernel running on the first GPU completes its calculation. The relevant output is as follows (times shown are in microseconds are the three number are, in order of appearance, start, end and duration):

[codebox]
Profiling Information for GPU Processing:

Device 0 : GeForce GTX 260M
Reduce Kernel : 189901803 222986345 33084542 us
Copy Device->Host : 222987730 256593873 33606143 us


Device 1 : GeForce GTX 260M
Reduce Kernel : 227082505 260682302 33599797 us
Copy Device->Host : 260684612 260684871 258 us
[/codebox]

Notice that on Device 1, the kernel start time (227,082,505) is greater than the kernel end time for device 0 (222,986,345). Furthermore, watching the pretty graphs produced by GPUz (on the Sensors page), you can see that times that GPUs are under load are mutually exclusive.

I'm pretty unimpressed that NVidia supplied an example of using multiple GPUs that does not achieve concurrent execution of kernels on separate devices. What exactly is meant to be demonstrated by the example?

Given the blocking nature of clEnqueueNDRangeKernel(), what strategies are used by others to achieve concurrent execution of kernels on separate devices?

Cheers,

Dan
[quote name='Daniel Paull' post='1050582' date='May 5 2010, 12:01 PM']Hi All,



I've hit this same problem with the version 3 of the CUDA Tollkit and GPU Computing SDK (my driver version is 197.16) - it seems that clEnqueueNDRangeKernel blocks until the kernel has completed execution. This makes it impossible for a single thread on the host to enqueue work for multiple devices.



In the Jumpstart Guide (http://developer.download.nvidia.com/OpenCL/NVIDIA_OpenCL_JumpStart_Guide.pdf) on page 13 it states, "Both kernel launch functions (CUDA and OpenCL) are asynchronous, i.e. they return immediately after scheduling the kernel to be executed on the GPU." It appears that this statement is incorrect.



If anyone from NVidia is reading, are you planning on changing this behavior in subsequent releases? That is, can we expect to be able to enqueue OpenCL kernels for asynchronous execution in later versions of your OpenCL implementation?



Thanks,



Dan



Just a little more evidence that clEnqueueNDRangeKernel() blocks, I modified the oclSimpleMultiGPU sample provided in the GPU Computing SDK so that it reported command start and end times when profiling is enabled rather than just execution duration. I also modified the kernel code so that it took much longer to execute. The output when using two GPUs shows quite clearly that the kernel for the second GPU does not start executing until the kernel running on the first GPU completes its calculation. The relevant output is as follows (times shown are in microseconds are the three number are, in order of appearance, start, end and duration):



[codebox]

Profiling Information for GPU Processing:



Device 0 : GeForce GTX 260M

Reduce Kernel : 189901803 222986345 33084542 us

Copy Device->Host : 222987730 256593873 33606143 us





Device 1 : GeForce GTX 260M

Reduce Kernel : 227082505 260682302 33599797 us

Copy Device->Host : 260684612 260684871 258 us

[/codebox]



Notice that on Device 1, the kernel start time (227,082,505) is greater than the kernel end time for device 0 (222,986,345). Furthermore, watching the pretty graphs produced by GPUz (on the Sensors page), you can see that times that GPUs are under load are mutually exclusive.



I'm pretty unimpressed that NVidia supplied an example of using multiple GPUs that does not achieve concurrent execution of kernels on separate devices. What exactly is meant to be demonstrated by the example?



Given the blocking nature of clEnqueueNDRangeKernel(), what strategies are used by others to achieve concurrent execution of kernels on separate devices?



Cheers,



Dan

#12
Posted 05/05/2010 06:05 AM   
Hi all,

I've had some discussion over at the Khronos forums and the consensus is that a blocking clEnqueueNDRangeKernel is a bug. See this thread:

[url="http://www.khronos.org/message_boards/viewtopic.php?f=28&t=1990"]http://www.khronos.org/message_boards/view...f=28&t=1990[/url]

And this post is this thread where it was suggested that, "As far as your comment on being surprised that clEnqueueNDRangeKernel on some implementations is blocking, I would suggest that you file a bug and work with the vendor in question to resolve this issue. It is certainly the intent of the spec and I know more than one implementation where this is not the case."

[url="http://www.khronos.org/message_boards/viewtopic.php?p=7321#p7321"]http://www.khronos.org/message_boards/view...hp?p=7321#p7321[/url]

I could not work out where to file a bug report - is this forum the place to do it, or is there an issue tracking system that I should be using?

Cheers,

Dan
Hi all,



I've had some discussion over at the Khronos forums and the consensus is that a blocking clEnqueueNDRangeKernel is a bug. See this thread:



http://www.khronos.org/message_boards/view...f=28&t=1990



And this post is this thread where it was suggested that, "As far as your comment on being surprised that clEnqueueNDRangeKernel on some implementations is blocking, I would suggest that you file a bug and work with the vendor in question to resolve this issue. It is certainly the intent of the spec and I know more than one implementation where this is not the case."



http://www.khronos.org/message_boards/view...hp?p=7321#p7321



I could not work out where to file a bug report - is this forum the place to do it, or is there an issue tracking system that I should be using?



Cheers,



Dan

#13
Posted 05/13/2010 07:28 AM   
If you look at the Tesla info sheet on page 2 you will see:
- NVIDIA GIGATHREAD ENGINE
Maximizes the throughput by faster context switching that is 10X faster than previous architecture, concurrent kernel execution, and improved thread block scheduling.

Source: [url="http://www.nvidia.com/docs/IO/43395/NV_DS_...final_lores.pdf"]http://www.nvidia.com/docs/IO/43395/NV_DS_...final_lores.pdf[/url]

I would not change to out of order execution lightly without testing on a GF100, I think.
If you look at the Tesla info sheet on page 2 you will see:

- NVIDIA GIGATHREAD ENGINE

Maximizes the throughput by faster context switching that is 10X faster than previous architecture, concurrent kernel execution, and improved thread block scheduling.



Source: http://www.nvidia.com/docs/IO/43395/NV_DS_...final_lores.pdf



I would not change to out of order execution lightly without testing on a GF100, I think.

#14
Posted 05/13/2010 02:54 PM   
[quote name='jcpalmer' post='1055094' date='May 13 2010, 10:54 PM']I would not change to out of order execution lightly without testing on a GF100, I think.[/quote]

No one is changing anything, except maybe NVidia to fall in line with the OpenCL 1.0 spec. clEnqueueNDRangeKernel must not block - though not stated explicitly, I am convinced that this can be deduced by reading various parts of the spec. That is, unless NVidia does not want to support concurrent execution of kernels on separate devices??

Cheers,

Dan
[quote name='jcpalmer' post='1055094' date='May 13 2010, 10:54 PM']I would not change to out of order execution lightly without testing on a GF100, I think.



No one is changing anything, except maybe NVidia to fall in line with the OpenCL 1.0 spec. clEnqueueNDRangeKernel must not block - though not stated explicitly, I am convinced that this can be deduced by reading various parts of the spec. That is, unless NVidia does not want to support concurrent execution of kernels on separate devices??



Cheers,



Dan

#15
Posted 05/13/2010 11:55 PM   
  1 / 2    
Scroll To Top