Hi, External Image
I’d like to write a simple sample to demonstrate the out-of-order execution model and synchronization mechanism based on events. However it doesn’t work as expected and everything is same as in-order execution. I test it on OpenSUSE 12.1 with GTX 570 and driver is 295.20.
The key codes are simple. There are three kernels: empty, caesarCipher and divergence.
//The most simple kernel. Just do nothing
__kernel void empty(){
return;
}
//Naive cipher.
__kernel void caesarCipher(__global char* plainText){
size_t id=get_global_id(0);
plainText[id]=(plainText[id]+id)%256;
}
//Simple divergence computation for 1D data.
__kernel void divergence(__global float* fieldData, __global float* divergenceData, float factor){
size_t id=get_global_id(0);
float upValue=fieldData[id];
float bottomValue=upValue;
if(id>=2){
bottomValue=fieldData[id-2];
}
size_t globalSize=get_global_size(0);
if(id<globalSize-2){
upValue=fieldData[id+2];
}
divergenceData[id]=(upValue-bottomValue)*factor;
}
Kernel caesarCipher and deivergence need source data. So there five main steps includes writing two data buffer and executing three kernel.
clCommandQueue= cl::CommandQueue(clContext, *(clDevices.begin()),CL_QUEUE_PROFILING_ENABLE|CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE,&clError);
cl::Kernel emptyKernel(clProgram, "empty");
cl::Kernel divergenceKernel(clProgram, "divergence");
cl::Kernel caesarCipherKernel(clProgram, "caesarCipher");
//Data buffer
cl::Buffer plaintextBuffer(clContext, CL_MEM_READ_WRITE, num*sizeof(cl_char));
cl::Buffer fieldDataBuffer(clContext, CL_MEM_READ_ONLY, num*sizeof(cl_float));
cl::Buffer divergenceDataBuffer(clContext, CL_MEM_WRITE_ONLY, num*sizeof(cl_float));
caesarCipherKernel.setArg<cl::Buffer > (0, plaintextBuffer);
divergenceKernel.setArg<cl::Buffer > (0, fieldDataBuffer) ;
divergenceKernel.setArg<cl::Buffer > (1, divergenceDataBuffer) ;
divergenceKernel.setArg<cl_float > (2, 0.25) ;
cl::Event writePlainTextEvent;
cl::Event writeFieldDataEvent;
cl::Event executeEmptyKernelEvent;
cl::Event executeCaesarKernelEvent;
cl::Event executeDivergenceKernelEvent;
//Write source data for kernel divergence
clCommandQueue.enqueueWriteBuffer(fieldDataBuffer, CL_FALSE, 0, num*sizeof(cl_float), fieldData.data(), NULL, &writeFieldDataEvent);
//Execute the kernel empty
clCommandQueue.enqueueNDRangeKernel(emptyKernel, cl::NullRange, cl::NDRange(num), cl::NullRange,NULL,&executeEmptyKernelEvent);
//Execute the kernel divergence
std::vector<cl::Event> divergenceWaitEvents;
divergenceWaitEvents.push_back(writeFieldDataEvent);
clCommandQueue.enqueueNDRangeKernel(divergenceKernel, cl::NullRange, cl::NDRange(num), cl::NullRange,&divergenceWaitEvents,&executeDivergenceKernelEvent);
//Write source data for kernel caesarCipher
clCommandQueue.enqueueWriteBuffer(plaintextBuffer, CL_FALSE, 0, num*sizeof(cl_char),plaintext.data(),NULL,&writePlainTextEvent) ;
//Execute the kernel caesarCipher
std::vector<cl::Event> caesarCipherWaitEvents;
caesarCipherWaitEvents.push_back(writePlainTextEvent);
clCommandQueue.enqueueNDRangeKernel(caesarCipherKernel, cl::NullRange, cl::NDRange(num), cl::NullRange,&caesarCipherWaitEvents,&executeCaesarKernelEvent);
clCommandQueue.finish();
I set the flag CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE for command queue and use waiting events for synchronization. I think it is enough for out-of-order execution. The kernel empty depends on no data so it should execute at the very beginning and the writing data into plaintextBuffer should begin just after the finish of writing into fieldDataBuffer. But the result is very same as in order. See the results by Nvidia visual profiler
in-order:
out-of-order
Where is the problem? Anybody has the out-of-order sample? Thanks!