In my program, I need to get the position of the highest bit of integer value - i.e. something like integer log2.
The following code works perfectly, but too slow:
for(int i = 31; i >= 0; i--)
{
if(x >> i)
 return i;
}
return 0;
The following code works almost perfectly, but sometimes it just returns wrong value and program crashes:
return (uint)(__log2f((x | (x >> 1) | (x >> 2)) - (x >> 1)));
Is there any fast way to calculate integer logarithm in CUDA?
Thanks in advance
In my program, I need to get the position of the highest bit of integer value - i.e. something like integer log2.
The following code works perfectly, but too slow:
for(int i = 31; i >= 0; i--)
{
if(x >> i)
return i;
}
return 0;
The following code works almost perfectly, but sometimes it just returns wrong value and program crashes:
return (uint)(__log2f((x | (x >> 1) | (x >> 2)) - (x >> 1)));
Is there any fast way to calculate integer logarithm in CUDA?
Thanks in advance
[snapback]377102[/snapback]
I think what you want is the function __clz(x)
See page 85 of beta 2.0 programming guide.
__clz(x) returns the number, between 0 and 32 inclusive, of consecutive zero bits
starting at the most significant bit (i.e. bit 31) of integer parameter x.
I think what you want is the function __clz(x)
See page 85 of beta 2.0 programming guide.
__clz(x) returns the number, between 0 and 32 inclusive, of consecutive zero bits
starting at the most significant bit (i.e. bit 31) of integer parameter x.
[snapback]377113[/snapback]
Great thanks! I must have missed it :">
From /usr/local/cuda/include/device_functions.h:
(can someone explain how this works?)
__device_func__(int __clz(int a))
{
return (a)?(158-(__float_as_int(__uint2float_rz((unsigned int)a))>>23)):32;
}
From /usr/local/cuda/include/device_functions.h:
(can someone explain how this works?)
__device_func__(int __clz(int a))
{
 return (a)?(158-(__float_as_int(__uint2float_rz((unsigned int)a))>>23)):32;
}
[snapback]377167[/snapback]
Ah, this is what I was thinking you would do. Floats store numbers in a base-2 mantissa/exponent format. This line is converting the integer to a float, then pulling the exponent bits out of the float representation and manipulating them directly. Notice the minus sign flips it around. __clz() doesn’t directly give you log base 2, but counts how many zeros are on the most significant bit side of the one. That’s going to be 31-log2(a).
Ah, this is what I was thinking you would do. Floats store numbers in a base-2 mantissa/exponent format. This line is converting the integer to a float, then pulling the exponent bits out of the float representation and manipulating them directly. Notice the minus sign flips it around. __clz() doesn’t directly give you log base 2, but counts how many zeros are on the most significant bit side of the one. That’s going to be 31-log2(a).
[snapback]377187[/snapback]
Unfortunately, this function is not 100% precise too (
For example, I have a device function, that multiplies two binary polynomials a and b, that are stored in unsigned ints:
for(int i = 31; i >=0; i--)
if(a & (1 << i))
{
lores ^= (b << i);
hires ^= (b >> (32 - i));
a ^= (1 << i);
}
// modulo operations
All works perfectly. But when I want to speed this up and use __clz:
while(a)
{
i = 31 - __clz(a);
lores ^= (b << i);
hires ^= (b >> (32 - i));
a ^= (1 << i);
}
All other code remains the same, and I get significant performance boost - but sometimes I get ULF (seems to be because of wrong __clz() return value and, therefore, infinite cycle). This occures on different data sets and on different places each run for one data set.