Extension cl_nv_pragma_unroll doesn't seem to work
Hello,

I am running in a curious problem concerning the unroll pragma.

When I use "#pragma unroll 4" before a loop within my kernel, I don't get any performance-improvement and NO compiler-warning.
So I've unrolled the loop myself which resulted in a good performance gain. Thus I assume that #pragma unroll is not working correctly.

When I explicitly try to enable the cl_nv_prama_unroll extension in my .cpp file, I receive the following warning:
[code]
warning #161: unrecognized #pragma
#pragma OPENCL EXTENSION cl_nv_pragma_unroll : enable
^[/code]

What am I doing wrong?

Btw, I am using CUDA 4.0 and a Quadro 6000.

Best,
eimunic
Hello,



I am running in a curious problem concerning the unroll pragma.



When I use "#pragma unroll 4" before a loop within my kernel, I don't get any performance-improvement and NO compiler-warning.

So I've unrolled the loop myself which resulted in a good performance gain. Thus I assume that #pragma unroll is not working correctly.



When I explicitly try to enable the cl_nv_prama_unroll extension in my .cpp file, I receive the following warning:



warning #161: unrecognized #pragma

#pragma OPENCL EXTENSION cl_nv_pragma_unroll : enable

^




What am I doing wrong?



Btw, I am using CUDA 4.0 and a Quadro 6000.



Best,

eimunic

#1
Posted 09/21/2011 09:57 AM   
[quote name='eimunic' date='21 September 2011 - 11:57 AM' timestamp='1316599077' post='1296356']
Hello,

I am running in a curious problem concerning the unroll pragma.

When I use "#pragma unroll 4" before a loop within my kernel, I don't get any performance-improvement and NO compiler-warning.
So I've unrolled the loop myself which resulted in a good performance gain. Thus I assume that #pragma unroll is not working correctly.

When I explicitly try to enable the cl_nv_prama_unroll extension in my .cpp file, I receive the following warning:
[code]
warning #161: unrecognized #pragma
#pragma OPENCL EXTENSION cl_nv_pragma_unroll : enable
^[/code]

What am I doing wrong?

Btw, I am using CUDA 4.0 and a Quadro 6000.

Best,
eimunic
[/quote]

This pragma should go in the .cl file not in the .cpp file
[quote name='eimunic' date='21 September 2011 - 11:57 AM' timestamp='1316599077' post='1296356']

Hello,



I am running in a curious problem concerning the unroll pragma.



When I use "#pragma unroll 4" before a loop within my kernel, I don't get any performance-improvement and NO compiler-warning.

So I've unrolled the loop myself which resulted in a good performance gain. Thus I assume that #pragma unroll is not working correctly.



When I explicitly try to enable the cl_nv_prama_unroll extension in my .cpp file, I receive the following warning:



warning #161: unrecognized #pragma

#pragma OPENCL EXTENSION cl_nv_pragma_unroll : enable

^




What am I doing wrong?



Btw, I am using CUDA 4.0 and a Quadro 6000.



Best,

eimunic





This pragma should go in the .cl file not in the .cpp file

#2
Posted 09/25/2011 10:08 AM   
Hi laughingrice,

thanks for the advice. Now it does not produce any compiler-warnings, BUT still no speedup.

Can I somehow determine if the loop unrooling took place (beside looking at the ptx files)?

Best,
eimunic
Hi laughingrice,



thanks for the advice. Now it does not produce any compiler-warnings, BUT still no speedup.



Can I somehow determine if the loop unrooling took place (beside looking at the ptx files)?



Best,

eimunic

#3
Posted 09/28/2011 12:04 PM   
Hello,

I am still wondering why I do not get any speed up at all.

Here is what the code looks like:

[code]#pragma OPENCL EXTENSION cl_nv_pragma_unroll : enable

__kernel void dense(__global float *Ahat, __global float *bhat, __global float *x,
const int n, const int workload, __global float *buffer)
{

int idx = get_group_id(0) * workload * BLOCKDIM;
int idy = get_group_id(1) * BLOCKDIM + get_local_id(1);

__local float tile[BLOCKDIM][BLOCKDIM+1];
__local float s_x[BLOCKDIM];

/*the dense part*/
float tmp = 0.0f;
int i,j,l;

for(i=0;i < workload;++i){
/*load to shared memory*/
...
barrier(CLK_LOCAL_MEM_FENCE);

/*start computation*/
#ifdef MY_UNROLL
for(j=0;j < get_local_size(1) ;j+=16){
tmp+=tile[get_local_id(1)][j] * s_x[j];
tmp+=tile[get_local_id(1)][j+1] * s_x[j+1];
tmp+=tile[get_local_id(1)][j+2] * s_x[j+2];
tmp+=tile[get_local_id(1)][j+3] * s_x[j+3];
tmp+=tile[get_local_id(1)][j+4] * s_x[j+4];
tmp+=tile[get_local_id(1)][j+5] * s_x[j+5];
tmp+=tile[get_local_id(1)][j+6] * s_x[j+6];
tmp+=tile[get_local_id(1)][j+7] * s_x[j+7];
tmp+=tile[get_local_id(1)][j+8] * s_x[j+8];
tmp+=tile[get_local_id(1)][j+9] * s_x[j+9];
tmp+=tile[get_local_id(1)][j+10] * s_x[j+10];
tmp+=tile[get_local_id(1)][j+11] * s_x[j+11];
tmp+=tile[get_local_id(1)][j+12] * s_x[j+12];
tmp+=tile[get_local_id(1)][j+13] * s_x[j+13];
tmp+=tile[get_local_id(1)][j+14] * s_x[j+14];
tmp+=tile[get_local_id(1)][j+15] * s_x[j+15];
}
#else
#pragma unroll 16
for(j=0;j < get_local_size(1) ;++j){
tmp+=tile[get_local_id(1)][j] * s_x[j];
}
#endif
barrier(CLK_LOCAL_MEM_FENCE);
}
buffer[idy * get_num_groups(0) + get_group_id(0)] = tmp;

}
[/code]

So if I define MY_UNROLL I get a speedup of two. From my point of view the two versions (with define MY_UNROLL and without) should yield the same runtime, but they don't.

Thus it seems that the #pragma unroll is not working.

Can someone help me out, please?

Best,
eimunic
Hello,



I am still wondering why I do not get any speed up at all.



Here is what the code looks like:



#pragma OPENCL EXTENSION cl_nv_pragma_unroll : enable



__kernel void dense(__global float *Ahat, __global float *bhat, __global float *x,

const int n, const int workload, __global float *buffer)

{



int idx = get_group_id(0) * workload * BLOCKDIM;

int idy = get_group_id(1) * BLOCKDIM + get_local_id(1);



__local float tile[BLOCKDIM][BLOCKDIM+1];

__local float s_x[BLOCKDIM];



/*the dense part*/

float tmp = 0.0f;

int i,j,l;



for(i=0;i < workload;++i){

/*load to shared memory*/

...

barrier(CLK_LOCAL_MEM_FENCE);



/*start computation*/

#ifdef MY_UNROLL

for(j=0;j < get_local_size(1) ;j+=16){

tmp+=tile[get_local_id(1)][j] * s_x[j];

tmp+=tile[get_local_id(1)][j+1] * s_x[j+1];

tmp+=tile[get_local_id(1)][j+2] * s_x[j+2];

tmp+=tile[get_local_id(1)][j+3] * s_x[j+3];

tmp+=tile[get_local_id(1)][j+4] * s_x[j+4];

tmp+=tile[get_local_id(1)][j+5] * s_x[j+5];

tmp+=tile[get_local_id(1)][j+6] * s_x[j+6];

tmp+=tile[get_local_id(1)][j+7] * s_x[j+7];

tmp+=tile[get_local_id(1)][j+8] * s_x[j+8];

tmp+=tile[get_local_id(1)][j+9] * s_x[j+9];

tmp+=tile[get_local_id(1)][j+10] * s_x[j+10];

tmp+=tile[get_local_id(1)][j+11] * s_x[j+11];

tmp+=tile[get_local_id(1)][j+12] * s_x[j+12];

tmp+=tile[get_local_id(1)][j+13] * s_x[j+13];

tmp+=tile[get_local_id(1)][j+14] * s_x[j+14];

tmp+=tile[get_local_id(1)][j+15] * s_x[j+15];

}

#else

#pragma unroll 16

for(j=0;j < get_local_size(1) ;++j){

tmp+=tile[get_local_id(1)][j] * s_x[j];

}

#endif

barrier(CLK_LOCAL_MEM_FENCE);

}

buffer[idy * get_num_groups(0) + get_group_id(0)] = tmp;



}




So if I define MY_UNROLL I get a speedup of two. From my point of view the two versions (with define MY_UNROLL and without) should yield the same runtime, but they don't.



Thus it seems that the #pragma unroll is not working.



Can someone help me out, please?



Best,

eimunic

#4
Posted 10/06/2011 08:55 AM   
Two things that I can think about. Partial unrolling requires an extra conditional to handle the tail cases, maybe that takes away too much of the benefit for some reason. Another thing is that I seem to recall that the compiler had some issue with unrolling either an internal or external loop (don't recall which). Try manually setting workload to 1 and removing the external loop and see if that helps.
Also watch the compiler output for any warning regarding failed loop unrolling, or compare ptx code and see if the loop is actually unrolled.
Two things that I can think about. Partial unrolling requires an extra conditional to handle the tail cases, maybe that takes away too much of the benefit for some reason. Another thing is that I seem to recall that the compiler had some issue with unrolling either an internal or external loop (don't recall which). Try manually setting workload to 1 and removing the external loop and see if that helps.

Also watch the compiler output for any warning regarding failed loop unrolling, or compare ptx code and see if the loop is actually unrolled.

#5
Posted 10/12/2011 10:34 PM   
Scroll To Top