Out-of-order execution doesn't work!

ZHAO_Peng · March 8, 2012, 4:16am

I’d like to write a simple sample to demonstrate the out-of-order execution model and synchronization mechanism based on events. However it doesn’t work as expected and everything is same as in-order execution. I test it on OpenSUSE 12.1 with GTX 570 and driver is 295.20.

The key codes are simple. There are three kernels: empty, caesarCipher and divergence.

//The most simple kernel. Just do nothing

__kernel void empty(){

    return;

}

//Naive cipher. 

__kernel void caesarCipher(__global char* plainText){

     size_t id=get_global_id(0);

     plainText[id]=(plainText[id]+id)%256;

}

//Simple divergence computation for 1D data.

__kernel void divergence(__global float* fieldData, __global float* divergenceData, float factor){

     size_t id=get_global_id(0);

float upValue=fieldData[id];

     float bottomValue=upValue;

if(id>=2){

        bottomValue=fieldData[id-2];

     }

     size_t globalSize=get_global_size(0);

     if(id<globalSize-2){

        upValue=fieldData[id+2];

     }

divergenceData[id]=(upValue-bottomValue)*factor;

}

Kernel caesarCipher and deivergence need source data. So there five main steps includes writing two data buffer and executing three kernel.

clCommandQueue= cl::CommandQueue(clContext, *(clDevices.begin()),CL_QUEUE_PROFILING_ENABLE|CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE,&clError);

         cl::Kernel emptyKernel(clProgram, "empty");

        cl::Kernel divergenceKernel(clProgram, "divergence");

        cl::Kernel caesarCipherKernel(clProgram, "caesarCipher");

//Data buffer

        cl::Buffer plaintextBuffer(clContext, CL_MEM_READ_WRITE, num*sizeof(cl_char));

        cl::Buffer fieldDataBuffer(clContext, CL_MEM_READ_ONLY, num*sizeof(cl_float));

        cl::Buffer divergenceDataBuffer(clContext, CL_MEM_WRITE_ONLY, num*sizeof(cl_float));

        caesarCipherKernel.setArg<cl::Buffer > (0, plaintextBuffer);

        divergenceKernel.setArg<cl::Buffer > (0, fieldDataBuffer) ;

        divergenceKernel.setArg<cl::Buffer > (1, divergenceDataBuffer) ;

        divergenceKernel.setArg<cl_float >   (2, 0.25)  ;

cl::Event writePlainTextEvent;

        cl::Event writeFieldDataEvent;

        cl::Event executeEmptyKernelEvent;

        cl::Event executeCaesarKernelEvent;

        cl::Event executeDivergenceKernelEvent;

//Write source data for kernel divergence

        clCommandQueue.enqueueWriteBuffer(fieldDataBuffer, CL_FALSE, 0, num*sizeof(cl_float), fieldData.data(), NULL, &writeFieldDataEvent);

//Execute the kernel empty

        clCommandQueue.enqueueNDRangeKernel(emptyKernel, cl::NullRange, cl::NDRange(num), cl::NullRange,NULL,&executeEmptyKernelEvent);

//Execute the kernel divergence

        std::vector<cl::Event> divergenceWaitEvents;

        divergenceWaitEvents.push_back(writeFieldDataEvent);

        clCommandQueue.enqueueNDRangeKernel(divergenceKernel, cl::NullRange, cl::NDRange(num), cl::NullRange,&divergenceWaitEvents,&executeDivergenceKernelEvent);

//Write source data for kernel caesarCipher 

        clCommandQueue.enqueueWriteBuffer(plaintextBuffer, CL_FALSE, 0, num*sizeof(cl_char),plaintext.data(),NULL,&writePlainTextEvent) ;

//Execute the kernel caesarCipher

        std::vector<cl::Event> caesarCipherWaitEvents;

        caesarCipherWaitEvents.push_back(writePlainTextEvent);

        clCommandQueue.enqueueNDRangeKernel(caesarCipherKernel, cl::NullRange, cl::NDRange(num), cl::NullRange,&caesarCipherWaitEvents,&executeCaesarKernelEvent);

clCommandQueue.finish();

I set the flag CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE for command queue and use waiting events for synchronization. I think it is enough for out-of-order execution. The kernel empty depends on no data so it should execute at the very beginning and the writing data into plaintextBuffer should begin just after the finish of writing into fieldDataBuffer. But the result is very same as in order. See the results by Nvidia visual profiler

in-order:

out-of-order

Where is the problem? Anybody has the out-of-order sample? Thanks!

ZHAO_Peng · March 12, 2012, 4:56am

Any advice? Nobody tried the out-of-order mode?
Besides, it is slow to get reply in this forum. I doubt there are staffs from Nvidia who deal with these posts.

Meteorhead · March 12, 2012, 7:55am

Our C2050 testbed seems to be out-of-order, so I cannot check myself. But does the platform support out-of-order exec at all? AFAIK Intel is the only vendor who has implemented out-of-order queues. AMD definately works in-order, even when told otherwise, and I’d bet a mediocre amount NV has not implemented it yet either.

ZHAO_Peng · March 12, 2012, 10:38am

Thanks for your reply.

According to the OpenCL Programming Guide for the CUDA Architecture, my GTX 570 should be capable of out-of-order execution.

Maybe the driver doesn’t implement it. There is no news of OpenCL from Nvidia for long time. I am not sure they are still working on it.

Meteorhead · March 12, 2012, 10:05pm

If DeivceQuery sample application does not return CL_OUT_OF_ORDER_EXEC platform capability, then host side out-of-order exec is not implemented (in the runtime/drivers). Indeed, NV does not advertise OpenCL that much, however that will change when many applications will support it. However, competition by then will be way ahead.

ZHAO_Peng · March 13, 2012, 3:15am

Thanks for your reply!

The sample oclDeviceQuery showed the result as follow.

OpenCL SW Info:

CL_PLATFORM_NAME: NVIDIA CUDA

CL_PLATFORM_VERSION: OpenCL 1.1 CUDA 4.2.1

OpenCL SDK Revision: 7027912

OpenCL Device Info:

2 devices found supporting OpenCL:

Device GeForce GTX 570

CL_DEVICE_NAME: GeForce GTX 570

CL_DEVICE_VENDOR: NVIDIA Corporation

CL_DRIVER_VERSION: 295.20

CL_DEVICE_VERSION: OpenCL 1.1 CUDA

CL_DEVICE_OPENCL_C_VERSION: OpenCL C 1.1

CL_DEVICE_TYPE: CL_DEVICE_TYPE_GPU

CL_DEVICE_MAX_COMPUTE_UNITS: 15

CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3

CL_DEVICE_MAX_WORK_ITEM_SIZES: 1024 / 1024 / 64

CL_DEVICE_MAX_WORK_GROUP_SIZE: 1024

CL_DEVICE_MAX_CLOCK_FREQUENCY: 1464 MHz

CL_DEVICE_ADDRESS_BITS: 32

CL_DEVICE_MAX_MEM_ALLOC_SIZE: 319 MByte

CL_DEVICE_GLOBAL_MEM_SIZE: 1279 MByte

CL_DEVICE_ERROR_CORRECTION_SUPPORT: no

CL_DEVICE_LOCAL_MEM_TYPE: local

CL_DEVICE_LOCAL_MEM_SIZE: 48 KByte

CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByte

CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE

CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_PROFILING_ENABLE

CL_DEVICE_IMAGE_SUPPORT: 1

CL_DEVICE_MAX_READ_IMAGE_ARGS: 128

CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 8

CL_DEVICE_SINGLE_FP_CONFIG: denorms INF-quietNaNs round-to-nearest r

…

We can find “CL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE”. So out-of-order mode should be supported by Nvidia’ implementation. Anybody gave a try?

pass · March 14, 2012, 4:17am

well done!External Media

mohd · March 29, 2012, 1:53pm

Hello,

did anyone get a solution to this problem?? since I have the same thing.

ZHAO_Peng · March 30, 2012, 2:27am

I have given up finding the solution. Let’s assume there is no out-of-order execution in current implementation of Nvidia.

You can get out-of-order execution using the Intel OpenCL SDK although the schedule of commands is not very predictable.

mohd · March 30, 2012, 2:55pm

Thank you for your reply, the problem that I have no choice, only to work with fermi (Nvidia card ) External Image

Cmizzie · April 16, 2012, 9:46am

I have yet to see this confirmed, but in my personal experience out-of-order execution is not working as it should. I reimplemented my in-order pipeline to use events in a dependency graph for out-of-order execution, as there was simultaneous data transfer / computation parallelism possible, yet everything is still handled in the same order as the instructions are sent to the queue.

So my guess is that these cards support it, but only in the sense that your program does not crash when you run your executable. In practice, I so far see no gain. I just keep myself quiet by telling myself I made my code future-proof…