Out-of-order execution doesn't work!

Hi, External Image

I’d like to write a simple sample to demonstrate the out-of-order execution model and synchronization mechanism based on events. However it doesn’t work as expected and everything is same as in-order execution. I test it on OpenSUSE 12.1 with GTX 570 and driver is 295.20.

The key codes are simple. There are three kernels: empty, caesarCipher and divergence.

//The most simple kernel. Just do nothing

__kernel void empty(){

    return;

}

//Naive cipher. 

__kernel void caesarCipher(__global char* plainText){

     size_t id=get_global_id(0);

     plainText[id]=(plainText[id]+id)%256;

}

//Simple divergence computation for 1D data.

__kernel void divergence(__global float* fieldData, __global float* divergenceData, float factor){

     size_t id=get_global_id(0);

float upValue=fieldData[id];

     float bottomValue=upValue;

if(id>=2){

        bottomValue=fieldData[id-2];

     }

     size_t globalSize=get_global_size(0);

     if(id<globalSize-2){

        upValue=fieldData[id+2];

     }

divergenceData[id]=(upValue-bottomValue)*factor;

}

Kernel caesarCipher and deivergence need source data. So there five main steps includes writing two data buffer and executing three kernel.

clCommandQueue= cl::CommandQueue(clContext, *(clDevices.begin()),CL_QUEUE_PROFILING_ENABLE|CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE,&clError);

         cl::Kernel emptyKernel(clProgram, "empty");

        cl::Kernel divergenceKernel(clProgram, "divergence");

        cl::Kernel caesarCipherKernel(clProgram, "caesarCipher");

//Data buffer

        cl::Buffer plaintextBuffer(clContext, CL_MEM_READ_WRITE, num*sizeof(cl_char));

        cl::Buffer fieldDataBuffer(clContext, CL_MEM_READ_ONLY, num*sizeof(cl_float));

        cl::Buffer divergenceDataBuffer(clContext, CL_MEM_WRITE_ONLY, num*sizeof(cl_float));

        caesarCipherKernel.setArg<cl::Buffer > (0, plaintextBuffer);

        divergenceKernel.setArg<cl::Buffer > (0, fieldDataBuffer) ;

        divergenceKernel.setArg<cl::Buffer > (1, divergenceDataBuffer) ;

        divergenceKernel.setArg<cl_float >   (2, 0.25)  ;

cl::Event writePlainTextEvent;

        cl::Event writeFieldDataEvent;

        cl::Event executeEmptyKernelEvent;

        cl::Event executeCaesarKernelEvent;

        cl::Event executeDivergenceKernelEvent;

//Write source data for kernel divergence

        clCommandQueue.enqueueWriteBuffer(fieldDataBuffer, CL_FALSE, 0, num*sizeof(cl_float), fieldData.data(), NULL, &writeFieldDataEvent);

//Execute the kernel empty

        clCommandQueue.enqueueNDRangeKernel(emptyKernel, cl::NullRange, cl::NDRange(num), cl::NullRange,NULL,&executeEmptyKernelEvent);

//Execute the kernel divergence

        std::vector<cl::Event> divergenceWaitEvents;

        divergenceWaitEvents.push_back(writeFieldDataEvent);

        clCommandQueue.enqueueNDRangeKernel(divergenceKernel, cl::NullRange, cl::NDRange(num), cl::NullRange,&divergenceWaitEvents,&executeDivergenceKernelEvent);

//Write source data for kernel caesarCipher 

        clCommandQueue.enqueueWriteBuffer(plaintextBuffer, CL_FALSE, 0, num*sizeof(cl_char),plaintext.data(),NULL,&writePlainTextEvent) ;

//Execute the kernel caesarCipher

        std::vector<cl::Event> caesarCipherWaitEvents;

        caesarCipherWaitEvents.push_back(writePlainTextEvent);

        clCommandQueue.enqueueNDRangeKernel(caesarCipherKernel, cl::NullRange, cl::NDRange(num), cl::NullRange,&caesarCipherWaitEvents,&executeCaesarKernelEvent);

clCommandQueue.finish();

I set the flag CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE for command queue and use waiting events for synchronization. I think it is enough for out-of-order execution. The kernel empty depends on no data so it should execute at the very beginning and the writing data into plaintextBuffer should begin just after the finish of writing into fieldDataBuffer. But the result is very same as in order. See the results by Nvidia visual profiler

in-order:

out-of-order

Where is the problem? Anybody has the out-of-order sample? Thanks!

Any advice? Nobody tried the out-of-order mode?
Besides, it is slow to get reply in this forum. I doubt there are staffs from Nvidia who deal with these posts.

Our C2050 testbed seems to be out-of-order, so I cannot check myself. But does the platform support out-of-order exec at all? AFAIK Intel is the only vendor who has implemented out-of-order queues. AMD definately works in-order, even when told otherwise, and I’d bet a mediocre amount NV has not implemented it yet either.

Thanks for your reply.

According to the OpenCL Programming Guide for the CUDA Architecture, my GTX 570 should be capable of out-of-order execution.

Maybe the driver doesn’t implement it. There is no news of OpenCL from Nvidia for long time. I am not sure they are still working on it.

If DeivceQuery sample application does not return CL_OUT_OF_ORDER_EXEC platform capability, then host side out-of-order exec is not implemented (in the runtime/drivers). Indeed, NV does not advertise OpenCL that much, however that will change when many applications will support it. However, competition by then will be way ahead.

Thanks for your reply!

The sample oclDeviceQuery showed the result as follow.

We can find “CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE”. So out-of-order mode should be supported by Nvidia’ implementation. Anybody gave a try?

well done!External Media

Hello,

did anyone get a solution to this problem?? since I have the same thing.

I have given up finding the solution. Let’s assume there is no out-of-order execution in current implementation of Nvidia.

You can get out-of-order execution using the Intel OpenCL SDK although the schedule of commands is not very predictable.

Thank you for your reply, the problem that I have no choice, only to work with fermi (Nvidia card ) External Image

I have yet to see this confirmed, but in my personal experience out-of-order execution is not working as it should. I reimplemented my in-order pipeline to use events in a dependency graph for out-of-order execution, as there was simultaneous data transfer / computation parallelism possible, yet everything is still handled in the same order as the instructions are sent to the queue.

So my guess is that these cards support it, but only in the sense that your program does not crash when you run your executable. In practice, I so far see no gain. I just keep myself quiet by telling myself I made my code future-proof…