eimunic
September 21, 2011, 9:57am
1
Hello,
I am running in a curious problem concerning the unroll pragma.
When I use “#pragma unroll 4” before a loop within my kernel, I don’t get any performance-improvement and NO compiler-warning.
So I’ve unrolled the loop myself which resulted in a good performance gain. Thus I assume that #pragma unroll is not working correctly.
When I explicitly try to enable the cl_nv_prama_unroll extension in my .cpp file, I receive the following warning:
warning #161: unrecognized #pragma
#pragma OPENCL EXTENSION cl_nv_pragma_unroll : enable
^
What am I doing wrong?
Btw, I am using CUDA 4.0 and a Quadro 6000.
Best,
eimunic
Hello,
I am running in a curious problem concerning the unroll pragma.
When I use “#pragma unroll 4” before a loop within my kernel, I don’t get any performance-improvement and NO compiler-warning.
So I’ve unrolled the loop myself which resulted in a good performance gain. Thus I assume that #pragma unroll is not working correctly.
When I explicitly try to enable the cl_nv_prama_unroll extension in my .cpp file, I receive the following warning:
warning #161: unrecognized #pragma
#pragma OPENCL EXTENSION cl_nv_pragma_unroll : enable
^
What am I doing wrong?
Btw, I am using CUDA 4.0 and a Quadro 6000.
Best,
eimunic
This pragma should go in the .cl file not in the .cpp file
eimunic
September 28, 2011, 12:04pm
3
Hi laughingrice,
thanks for the advice. Now it does not produce any compiler-warnings, BUT still no speedup.
Can I somehow determine if the loop unrooling took place (beside looking at the ptx files)?
Best,
eimunic
Hello,
I am still wondering why I do not get any speed up at all.
Here is what the code looks like:
#pragma OPENCL EXTENSION cl_nv_pragma_unroll : enable
__kernel void dense(__global float *Ahat, __global float *bhat, __global float *x,
const int n, const int workload, __global float *buffer)
{
int idx = get_group_id(0) * workload * BLOCKDIM;
int idy = get_group_id(1) * BLOCKDIM + get_local_id(1);
__local float tile[BLOCKDIM][BLOCKDIM+1];
__local float s_x[BLOCKDIM];
/*the dense part*/
float tmp = 0.0f;
int i,j,l;
for(i=0;i < workload;++i){
/*load to shared memory*/
...
barrier(CLK_LOCAL_MEM_FENCE);
/*start computation*/
#ifdef MY_UNROLL
for(j=0;j < get_local_size(1) ;j+=16){
tmp+=tile[get_local_id(1)][j] * s_x[j];
tmp+=tile[get_local_id(1)][j+1] * s_x[j+1];
tmp+=tile[get_local_id(1)][j+2] * s_x[j+2];
tmp+=tile[get_local_id(1)][j+3] * s_x[j+3];
tmp+=tile[get_local_id(1)][j+4] * s_x[j+4];
tmp+=tile[get_local_id(1)][j+5] * s_x[j+5];
tmp+=tile[get_local_id(1)][j+6] * s_x[j+6];
tmp+=tile[get_local_id(1)][j+7] * s_x[j+7];
tmp+=tile[get_local_id(1)][j+8] * s_x[j+8];
tmp+=tile[get_local_id(1)][j+9] * s_x[j+9];
tmp+=tile[get_local_id(1)][j+10] * s_x[j+10];
tmp+=tile[get_local_id(1)][j+11] * s_x[j+11];
tmp+=tile[get_local_id(1)][j+12] * s_x[j+12];
tmp+=tile[get_local_id(1)][j+13] * s_x[j+13];
tmp+=tile[get_local_id(1)][j+14] * s_x[j+14];
tmp+=tile[get_local_id(1)][j+15] * s_x[j+15];
}
#else
#pragma unroll 16
for(j=0;j < get_local_size(1) ;++j){
tmp+=tile[get_local_id(1)][j] * s_x[j];
}
#endif
barrier(CLK_LOCAL_MEM_FENCE);
}
buffer[idy * get_num_groups(0) + get_group_id(0)] = tmp;
}
So if I define MY_UNROLL I get a speedup of two. From my point of view the two versions (with define MY_UNROLL and without) should yield the same runtime, but they don’t.
Thus it seems that the #pragma unroll is not working.
Can someone help me out, please?
Best,
eimunic
Two things that I can think about. Partial unrolling requires an extra conditional to handle the tail cases, maybe that takes away too much of the benefit for some reason. Another thing is that I seem to recall that the compiler had some issue with unrolling either an internal or external loop (don’t recall which). Try manually setting workload to 1 and removing the external loop and see if that helps.
Also watch the compiler output for any warning regarding failed loop unrolling, or compare ptx code and see if the loop is actually unrolled.