Out-of-order execution doesn't work!
Hi, /wallbash.gif' class='bbc_emoticon' alt=':wallbash:' />

I'd like to write a simple sample to demonstrate the out-of-order execution model and synchronization mechanism based on events. However it doesn't work as expected and everything is same as in-order execution. I test it on OpenSUSE 12.1 with GTX 570 and driver is 295.20.
The key codes are simple. There are three kernels: empty, caesarCipher and divergence.
[code]
//The most simple kernel. Just do nothing
__kernel void empty(){
return;
}

//Naive cipher.
__kernel void caesarCipher(__global char* plainText){
size_t id=get_global_id(0);
plainText[id]=(plainText[id]+id)%256;
}

//Simple divergence computation for 1D data.
__kernel void divergence(__global float* fieldData, __global float* divergenceData, float factor){
size_t id=get_global_id(0);

float upValue=fieldData[id];
float bottomValue=upValue;

if(id>=2){
bottomValue=fieldData[id-2];
}
size_t globalSize=get_global_size(0);
if(id<globalSize-2){
upValue=fieldData[id+2];
}

divergenceData[id]=(upValue-bottomValue)*factor;
}

[/code]
Kernel caesarCipher and deivergence need source data. So there five main steps includes writing two data buffer and executing three kernel.
[code]
clCommandQueue= cl::CommandQueue(clContext, *(clDevices.begin()),CL_QUEUE_PROFILING_ENABLE|CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE,&clError);
cl::Kernel emptyKernel(clProgram, "empty");
cl::Kernel divergenceKernel(clProgram, "divergence");
cl::Kernel caesarCipherKernel(clProgram, "caesarCipher");

//Data buffer
cl::Buffer plaintextBuffer(clContext, CL_MEM_READ_WRITE, num*sizeof(cl_char));
cl::Buffer fieldDataBuffer(clContext, CL_MEM_READ_ONLY, num*sizeof(cl_float));
cl::Buffer divergenceDataBuffer(clContext, CL_MEM_WRITE_ONLY, num*sizeof(cl_float));
caesarCipherKernel.setArg<cl::Buffer > (0, plaintextBuffer);
divergenceKernel.setArg<cl::Buffer > (0, fieldDataBuffer) ;
divergenceKernel.setArg<cl::Buffer > (1, divergenceDataBuffer) ;
divergenceKernel.setArg<cl_float > (2, 0.25) ;

cl::Event writePlainTextEvent;
cl::Event writeFieldDataEvent;
cl::Event executeEmptyKernelEvent;
cl::Event executeCaesarKernelEvent;
cl::Event executeDivergenceKernelEvent;

//Write source data for kernel divergence
clCommandQueue.enqueueWriteBuffer(fieldDataBuffer, CL_FALSE, 0, num*sizeof(cl_float), fieldData.data(), NULL, &writeFieldDataEvent);

//Execute the kernel empty
clCommandQueue.enqueueNDRangeKernel(emptyKernel, cl::NullRange, cl::NDRange(num), cl::NullRange,NULL,&executeEmptyKernelEvent);

//Execute the kernel divergence
std::vector<cl::Event> divergenceWaitEvents;
divergenceWaitEvents.push_back(writeFieldDataEvent);
clCommandQueue.enqueueNDRangeKernel(divergenceKernel, cl::NullRange, cl::NDRange(num), cl::NullRange,&divergenceWaitEvents,&executeDivergenceKernelEvent);

//Write source data for kernel caesarCipher
clCommandQueue.enqueueWriteBuffer(plaintextBuffer, CL_FALSE, 0, num*sizeof(cl_char),plaintext.data(),NULL,&writePlainTextEvent) ;

//Execute the kernel caesarCipher
std::vector<cl::Event> caesarCipherWaitEvents;
caesarCipherWaitEvents.push_back(writePlainTextEvent);
clCommandQueue.enqueueNDRangeKernel(caesarCipherKernel, cl::NullRange, cl::NDRange(num), cl::NullRange,&caesarCipherWaitEvents,&executeCaesarKernelEvent);

clCommandQueue.finish();
[/code]

I set the flag CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE for command queue and use waiting events for synchronization. I think it is enough for out-of-order execution. The kernel empty depends on no data so it should execute at the very beginning and the writing data into plaintextBuffer should begin just after the finish of writing into fieldDataBuffer. But the result is very same as in order. See the results by Nvidia visual profiler
in-order: [attachment=24708:in-order.png] out-of-order [attachment=24709:out-of-oder.png]

Where is the problem? Anybody has the out-of-order sample? Thanks!
Hi, /wallbash.gif' class='bbc_emoticon' alt=':wallbash:' />



I'd like to write a simple sample to demonstrate the out-of-order execution model and synchronization mechanism based on events. However it doesn't work as expected and everything is same as in-order execution. I test it on OpenSUSE 12.1 with GTX 570 and driver is 295.20.

The key codes are simple. There are three kernels: empty, caesarCipher and divergence.



//The most simple kernel. Just do nothing

__kernel void empty(){

return;

}



//Naive cipher.

__kernel void caesarCipher(__global char* plainText){

size_t id=get_global_id(0);

plainText[id]=(plainText[id]+id)%256;

}



//Simple divergence computation for 1D data.

__kernel void divergence(__global float* fieldData, __global float* divergenceData, float factor){

size_t id=get_global_id(0);



float upValue=fieldData[id];

float bottomValue=upValue;



if(id>=2){

bottomValue=fieldData[id-2];

}

size_t globalSize=get_global_size(0);

if(id<globalSize-2){

upValue=fieldData[id+2];

}



divergenceData[id]=(upValue-bottomValue)*factor;

}




Kernel caesarCipher and deivergence need source data. So there five main steps includes writing two data buffer and executing three kernel.



clCommandQueue= cl::CommandQueue(clContext, *(clDevices.begin()),CL_QUEUE_PROFILING_ENABLE|CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE,&clError);

cl::Kernel emptyKernel(clProgram, "empty");

cl::Kernel divergenceKernel(clProgram, "divergence");

cl::Kernel caesarCipherKernel(clProgram, "caesarCipher");



//Data buffer

cl::Buffer plaintextBuffer(clContext, CL_MEM_READ_WRITE, num*sizeof(cl_char));

cl::Buffer fieldDataBuffer(clContext, CL_MEM_READ_ONLY, num*sizeof(cl_float));

cl::Buffer divergenceDataBuffer(clContext, CL_MEM_WRITE_ONLY, num*sizeof(cl_float));

caesarCipherKernel.setArg<cl::Buffer > (0, plaintextBuffer);

divergenceKernel.setArg<cl::Buffer > (0, fieldDataBuffer) ;

divergenceKernel.setArg<cl::Buffer > (1, divergenceDataBuffer) ;

divergenceKernel.setArg<cl_float > (2, 0.25) ;



cl::Event writePlainTextEvent;

cl::Event writeFieldDataEvent;

cl::Event executeEmptyKernelEvent;

cl::Event executeCaesarKernelEvent;

cl::Event executeDivergenceKernelEvent;



//Write source data for kernel divergence

clCommandQueue.enqueueWriteBuffer(fieldDataBuffer, CL_FALSE, 0, num*sizeof(cl_float), fieldData.data(), NULL, &writeFieldDataEvent);



//Execute the kernel empty

clCommandQueue.enqueueNDRangeKernel(emptyKernel, cl::NullRange, cl::NDRange(num), cl::NullRange,NULL,&executeEmptyKernelEvent);



//Execute the kernel divergence

std::vector<cl::Event> divergenceWaitEvents;

divergenceWaitEvents.push_back(writeFieldDataEvent);

clCommandQueue.enqueueNDRangeKernel(divergenceKernel, cl::NullRange, cl::NDRange(num), cl::NullRange,&divergenceWaitEvents,&executeDivergenceKernelEvent);



//Write source data for kernel caesarCipher

clCommandQueue.enqueueWriteBuffer(plaintextBuffer, CL_FALSE, 0, num*sizeof(cl_char),plaintext.data(),NULL,&writePlainTextEvent) ;



//Execute the kernel caesarCipher

std::vector<cl::Event> caesarCipherWaitEvents;

caesarCipherWaitEvents.push_back(writePlainTextEvent);

clCommandQueue.enqueueNDRangeKernel(caesarCipherKernel, cl::NullRange, cl::NDRange(num), cl::NullRange,&caesarCipherWaitEvents,&executeCaesarKernelEvent);



clCommandQueue.finish();




I set the flag CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE for command queue and use waiting events for synchronization. I think it is enough for out-of-order execution. The kernel empty depends on no data so it should execute at the very beginning and the writing data into plaintextBuffer should begin just after the finish of writing into fieldDataBuffer. But the result is very same as in order. See the results by Nvidia visual profiler

in-order: [attachment=24708:in-order.png] out-of-order [attachment=24709:out-of-oder.png]



Where is the problem? Anybody has the out-of-order sample? Thanks!

#1
Posted 03/08/2012 04:16 AM   
Any advice? Nobody tried the out-of-order mode?
Besides, it is slow to get reply in this forum. I doubt there are staffs from Nvidia who deal with these posts.
Any advice? Nobody tried the out-of-order mode?

Besides, it is slow to get reply in this forum. I doubt there are staffs from Nvidia who deal with these posts.

#2
Posted 03/12/2012 04:56 AM   
Our C2050 testbed seems to be out-of-order, so I cannot check myself. But does the platform support out-of-order exec at all? AFAIK Intel is the only vendor who has implemented out-of-order queues. AMD definately works in-order, even when told otherwise, and I'd bet a mediocre amount NV has not implemented it yet either.
Our C2050 testbed seems to be out-of-order, so I cannot check myself. But does the platform support out-of-order exec at all? AFAIK Intel is the only vendor who has implemented out-of-order queues. AMD definately works in-order, even when told otherwise, and I'd bet a mediocre amount NV has not implemented it yet either.

#3
Posted 03/12/2012 07:55 AM   
[quote name='Meteorhead' date='12 March 2012 - 03:55 PM' timestamp='1331538953' post='1381611']
Our C2050 testbed seems to be out-of-order, so I cannot check myself. But does the platform support out-of-order exec at all? AFAIK Intel is the only vendor who has implemented out-of-order queues. AMD definately works in-order, even when told otherwise, and I'd bet a mediocre amount NV has not implemented it yet either.
[/quote]

Thanks for your reply.
According to the OpenCL Programming Guide for the CUDA Architecture, my GTX 570 should be capable of out-of-order execution.

[quote]3.2.2
Device Level
At a lower level, the application should maximize parallel execution between the multiprocessors of a device.
For devices of compute capability 1.x, only one kernel can execute on a device at one time, so the kernel should be launched with at least as many thread blocks as there are multiprocessors in the device.
For devices of compute capability 2.0, multiple kernels can execute concurrently on a device, so maximum utilization can also be achieved by using queues to enable enough kernels to execute concurrently.
[/quote]

Maybe the driver doesn't implement it. There is no news of OpenCL from Nvidia for long time. I am not sure they are still working on it.
[quote name='Meteorhead' date='12 March 2012 - 03:55 PM' timestamp='1331538953' post='1381611']

Our C2050 testbed seems to be out-of-order, so I cannot check myself. But does the platform support out-of-order exec at all? AFAIK Intel is the only vendor who has implemented out-of-order queues. AMD definately works in-order, even when told otherwise, and I'd bet a mediocre amount NV has not implemented it yet either.





Thanks for your reply.

According to the OpenCL Programming Guide for the CUDA Architecture, my GTX 570 should be capable of out-of-order execution.



3.2.2

Device Level

At a lower level, the application should maximize parallel execution between the multiprocessors of a device.

For devices of compute capability 1.x, only one kernel can execute on a device at one time, so the kernel should be launched with at least as many thread blocks as there are multiprocessors in the device.

For devices of compute capability 2.0, multiple kernels can execute concurrently on a device, so maximum utilization can also be achieved by using queues to enable enough kernels to execute concurrently.





Maybe the driver doesn't implement it. There is no news of OpenCL from Nvidia for long time. I am not sure they are still working on it.

#4
Posted 03/12/2012 10:38 AM   
If DeivceQuery sample application does not return CL_OUT_OF_ORDER_EXEC platform capability, then host side out-of-order exec is not implemented (in the runtime/drivers). Indeed, NV does not advertise OpenCL that much, however that will change when many applications will support it. However, competition by then will be way ahead.
If DeivceQuery sample application does not return CL_OUT_OF_ORDER_EXEC platform capability, then host side out-of-order exec is not implemented (in the runtime/drivers). Indeed, NV does not advertise OpenCL that much, however that will change when many applications will support it. However, competition by then will be way ahead.

#5
Posted 03/12/2012 10:05 PM   
[quote name='Meteorhead' date='13 March 2012 - 06:05 AM' timestamp='1331589909' post='1381858']
If DeivceQuery sample application does not return CL_OUT_OF_ORDER_EXEC platform capability, then host side out-of-order exec is not implemented (in the runtime/drivers). Indeed, NV does not advertise OpenCL that much, however that will change when many applications will support it. However, competition by then will be way ahead.
[/quote]

Thanks for your reply!
The sample oclDeviceQuery showed the result as follow.
[quote]
OpenCL SW Info:

CL_PLATFORM_NAME: NVIDIA CUDA
CL_PLATFORM_VERSION: OpenCL 1.1 CUDA 4.2.1
OpenCL SDK Revision: 7027912


OpenCL Device Info:

2 devices found supporting OpenCL:

---------------------------------
Device GeForce GTX 570
---------------------------------
CL_DEVICE_NAME: GeForce GTX 570
CL_DEVICE_VENDOR: NVIDIA Corporation
CL_DRIVER_VERSION: 295.20
CL_DEVICE_VERSION: OpenCL 1.1 CUDA
CL_DEVICE_OPENCL_C_VERSION: OpenCL C 1.1
CL_DEVICE_TYPE: CL_DEVICE_TYPE_GPU
CL_DEVICE_MAX_COMPUTE_UNITS: 15
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3
CL_DEVICE_MAX_WORK_ITEM_SIZES: 1024 / 1024 / 64
CL_DEVICE_MAX_WORK_GROUP_SIZE: 1024
CL_DEVICE_MAX_CLOCK_FREQUENCY: 1464 MHz
CL_DEVICE_ADDRESS_BITS: 32
CL_DEVICE_MAX_MEM_ALLOC_SIZE: 319 MByte
CL_DEVICE_GLOBAL_MEM_SIZE: 1279 MByte
CL_DEVICE_ERROR_CORRECTION_SUPPORT: no
CL_DEVICE_LOCAL_MEM_TYPE: local
CL_DEVICE_LOCAL_MEM_SIZE: 48 KByte
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByte
CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE
CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_PROFILING_ENABLE
CL_DEVICE_IMAGE_SUPPORT: 1
CL_DEVICE_MAX_READ_IMAGE_ARGS: 128
CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 8
CL_DEVICE_SINGLE_FP_CONFIG: denorms INF-quietNaNs round-to-nearest r
...
[/quote]

We can find "CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE". So out-of-order mode should be supported by Nvidia' implementation. Anybody gave a try?
[quote name='Meteorhead' date='13 March 2012 - 06:05 AM' timestamp='1331589909' post='1381858']

If DeivceQuery sample application does not return CL_OUT_OF_ORDER_EXEC platform capability, then host side out-of-order exec is not implemented (in the runtime/drivers). Indeed, NV does not advertise OpenCL that much, however that will change when many applications will support it. However, competition by then will be way ahead.





Thanks for your reply!

The sample oclDeviceQuery showed the result as follow.



OpenCL SW Info:



CL_PLATFORM_NAME: NVIDIA CUDA

CL_PLATFORM_VERSION: OpenCL 1.1 CUDA 4.2.1

OpenCL SDK Revision: 7027912





OpenCL Device Info:



2 devices found supporting OpenCL:



---------------------------------

Device GeForce GTX 570

---------------------------------

CL_DEVICE_NAME: GeForce GTX 570

CL_DEVICE_VENDOR: NVIDIA Corporation

CL_DRIVER_VERSION: 295.20

CL_DEVICE_VERSION: OpenCL 1.1 CUDA

CL_DEVICE_OPENCL_C_VERSION: OpenCL C 1.1

CL_DEVICE_TYPE: CL_DEVICE_TYPE_GPU

CL_DEVICE_MAX_COMPUTE_UNITS: 15

CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3

CL_DEVICE_MAX_WORK_ITEM_SIZES: 1024 / 1024 / 64

CL_DEVICE_MAX_WORK_GROUP_SIZE: 1024

CL_DEVICE_MAX_CLOCK_FREQUENCY: 1464 MHz

CL_DEVICE_ADDRESS_BITS: 32

CL_DEVICE_MAX_MEM_ALLOC_SIZE: 319 MByte

CL_DEVICE_GLOBAL_MEM_SIZE: 1279 MByte

CL_DEVICE_ERROR_CORRECTION_SUPPORT: no

CL_DEVICE_LOCAL_MEM_TYPE: local

CL_DEVICE_LOCAL_MEM_SIZE: 48 KByte

CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByte

CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE

CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_PROFILING_ENABLE

CL_DEVICE_IMAGE_SUPPORT: 1

CL_DEVICE_MAX_READ_IMAGE_ARGS: 128

CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 8

CL_DEVICE_SINGLE_FP_CONFIG: denorms INF-quietNaNs round-to-nearest r

...





We can find "CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE". So out-of-order mode should be supported by Nvidia' implementation. Anybody gave a try?

#6
Posted 03/13/2012 03:15 AM   
well done![img]http://www.nobod.info/g.gif[/img]
well done!Image

#7
Posted 03/14/2012 04:17 AM   
Hello,

did anyone get a solution to this problem?? since I have the same thing.
Hello,



did anyone get a solution to this problem?? since I have the same thing.

#8
Posted 03/29/2012 01:53 PM   
[quote name='mohd' date='29 March 2012 - 09:53 PM' timestamp='1333029232' post='1389521']
Hello,

did anyone get a solution to this problem?? since I have the same thing.
[/quote]

I have given up finding the solution. Let's assume there is no out-of-order execution in current implementation of Nvidia.
You can get out-of-order execution using the Intel OpenCL SDK although the schedule of commands is not very predictable.
[quote name='mohd' date='29 March 2012 - 09:53 PM' timestamp='1333029232' post='1389521']

Hello,



did anyone get a solution to this problem?? since I have the same thing.





I have given up finding the solution. Let's assume there is no out-of-order execution in current implementation of Nvidia.

You can get out-of-order execution using the Intel OpenCL SDK although the schedule of commands is not very predictable.

#9
Posted 03/30/2012 02:27 AM   
[quote name='ZHAO Peng' date='30 March 2012 - 04:27 AM' timestamp='1333074441' post='1389766']
I have given up finding the solution. Let's assume there is no out-of-order execution in current implementation of Nvidia.
You can get out-of-order execution using the Intel OpenCL SDK although the schedule of commands is not very predictable.
[/quote]


Thank you for your reply, the problem that I have no choice, only to work with fermi (Nvidia card ) /angry.gif' class='bbc_emoticon' alt=':angry:' />
[quote name='ZHAO Peng' date='30 March 2012 - 04:27 AM' timestamp='1333074441' post='1389766']

I have given up finding the solution. Let's assume there is no out-of-order execution in current implementation of Nvidia.

You can get out-of-order execution using the Intel OpenCL SDK although the schedule of commands is not very predictable.







Thank you for your reply, the problem that I have no choice, only to work with fermi (Nvidia card ) /angry.gif' class='bbc_emoticon' alt=':angry:' />

#10
Posted 03/30/2012 02:55 PM   
I have yet to see this confirmed, but in my personal experience out-of-order execution is not working as it should. I reimplemented my in-order pipeline to use events in a dependency graph for out-of-order execution, as there was simultaneous data transfer / computation parallelism possible, yet everything is still handled in the same order as the instructions are sent to the queue.

So my guess is that these cards support it, but only in the sense that your program does not crash when you run your executable. In practice, I so far see no gain. I just keep myself quiet by telling myself I made my code future-proof...
I have yet to see this confirmed, but in my personal experience out-of-order execution is not working as it should. I reimplemented my in-order pipeline to use events in a dependency graph for out-of-order execution, as there was simultaneous data transfer / computation parallelism possible, yet everything is still handled in the same order as the instructions are sent to the queue.



So my guess is that these cards support it, but only in the sense that your program does not crash when you run your executable. In practice, I so far see no gain. I just keep myself quiet by telling myself I made my code future-proof...

#11
Posted 04/16/2012 09:46 AM   
Scroll To Top