I took the double emulation from the Mandelbrot example and made some adjustments to it. I converted the functions to operators that work with float2 and I added unary negation, division, and some extra functions to store/retrieve values in/from a double single float2. I put them together into a header file in case it's useful for anyone else.

EDIT: I updated the header file. I made the existing * and / operators into "fast" functions and used the more precise * (and equivalent /) operator from the Mandelbrot example.

EDIT: I updated the file again, removing the questionable "fast" functions and using the new multiply from Norbert Juffa posted here in the thread.

EDIT: Another update to include Reimar's bit-style splitting and to switch all multiplications to __fmul_rn to ensure no MAD instructions appear in generated PTX.

I took the double emulation from the Mandelbrot example and made some adjustments to it. I converted the functions to operators that work with float2 and I added unary negation, division, and some extra functions to store/retrieve values in/from a double single float2. I put them together into a header file in case it's useful for anyone else.

EDIT: I updated the header file. I made the existing * and / operators into "fast" functions and used the more precise * (and equivalent /) operator from the Mandelbrot example.

EDIT: I updated the file again, removing the questionable "fast" functions and using the new multiply from Norbert Juffa posted here in the thread.

EDIT: Another update to include Reimar's bit-style splitting and to switch all multiplications to __fmul_rn to ensure no MAD instructions appear in generated PTX.

Keep in mind however that, as hinted in [url="http://forums.nvidia.com/index.php?showtopic=44847"]this thread[/url], there is currently no GPU that supports the fused multiply-add (FMA) operation in single precision. (single-precision MAD on G80 and even GT200 is a very different beast). So the "fast" multiplication and division will not work on current GPUs.

It is not a matter of precision. Both versions of the multiplication have the same accuracy, but the "fast" version is uh... faster when an FMA is present, and broken otherwise.

Additionally, NVCC detects the subexpression a.x * b.x in the following code and computes it only once, which breaks the whole algorithm anyway :
[code]
float c11 = a.x * b.x;
float c21 = a.x * b.x - c11;
[/code]

Actually, GT200 does support the FMA instruction, but only in double precision.

So you can still use this algorithm to perform a double-double multiplication. Something like:
[code]
__device__ inline double2 mul_fast(const double2 a, const double2 b) {
double2 c;

// According to DSFUN90, this is the preferred method of computing
// c11 and c21 on hardware supporting the FMA instruction:

// The result is t1 + t2, after normalization.
c.x = e = t1 + t2;
c.y = t2 - (e - t1);

return c;
}
[/code]

To prevent aggressive optimizations and guarantee portability across future GPUs, you may need to turn every + into a __dadd_rn call and every * into a __dmul_rn call...

It is nice to have such libraries for CUDA, thanks for your work!

Keep in mind however that, as hinted in this thread, there is currently no GPU that supports the fused multiply-add (FMA) operation in single precision. (single-precision MAD on G80 and even GT200 is a very different beast). So the "fast" multiplication and division will not work on current GPUs.

It is not a matter of precision. Both versions of the multiplication have the same accuracy, but the "fast" version is uh... faster when an FMA is present, and broken otherwise.

Additionally, NVCC detects the subexpression a.x * b.x in the following code and computes it only once, which breaks the whole algorithm anyway :

float c11 = a.x * b.x;

float c21 = a.x * b.x - c11;

Actually, GT200 does support the FMA instruction, but only in double precision.

So you can still use this algorithm to perform a double-double multiplication. Something like:

__device__ inline double2 mul_fast(const double2 a, const double2 b) {

double2 c;

// According to DSFUN90, this is the preferred method of computing

// c11 and c21 on hardware supporting the FMA instruction:

// Multilply a.x * b.x using Dekker's method.

double c11 = a.x * b.x;

double c21 = fma(a.x, b.x, c11);

// Compute a.x * b.y + a.y * b.x (only high-order word is needed).

double c2 = a.x * b.y + a.y * b.x;

// Compute (c11, c21) + c2 using Knuth's trick, also adding low-order product.

To prevent aggressive optimizations and guarantee portability across future GPUs, you may need to turn every + into a __dadd_rn call and every * into a __dmul_rn call...

It is nice to have such libraries for CUDA, thanks for your work!

Note that CUDA 2.0 includes some new intrinsic functions, __fadd_rn() and __fmul_rn() that can be used to avoid the compiler using MADs.

Below are some validated double-single add and multiply functions written by Norbert Juffa here at NVIDIA.

Note that in general double-single is an imperfect substitute for true double precision, and we encourage customers who want the highest precision to use GPUs that support doubles in hardware (currently Tesla C1060 and GeForce GTX 260 / 280).

[code]
/* For x and y of opposite sign whose magnitude is within a factor of two
of each other either variant below loses accuracy. Otherwise the result
is within 1.5 ulps of the correctly rounded result with 48-bit mantissa.
Based on the dsadd() function in D.H. Bailey's dsfun package from netlib.
*/
__device__ float2 dblsgl_add (float2 x, float2 y)
{
float2 z;
#if defined(__DEVICE_EMULATION__)
volatile float t1, t2, e;
#else
float t1, t2, e;
#endif
t1 = x.y + y.y;
e = t1 - x.y;
t2 = ((y.y - e) + (x.y - (t1 - e))) + x.x + y.x;
z.y = e = t1 + t2;
z.x = t2 - (e - t1);
return z;
}

/* Based on: Guillaume Da Graça, David Defour. Implementation of Float-Float
* Operators on Graphics Hardware. RNC'7, pp. 23-32, 2006.
*/
__device__ float2 dblsgl_mul (float2 x, float2 y)
{
float2 z;
#if defined(__DEVICE_EMULATION__)
volatile float up, vp, u1, u2, v1, v2, mh, ml;
#else
float up, vp, u1, u2, v1, v2, mh, ml;
#endif
up = x.y * 4097.0f;
u1 = (x.y - up) + up;
u2 = x.y - u1;
vp = y.y * 4097.0f;
v1 = (y.y - vp) + vp;
v2 = y.y - v1;
mh = __fmul_rn(x.y,y.y);
ml = (((u1 * v1 - mh) + u1 * v2) + u2 * v1) + u2 * v2;
ml = (__fmul_rn(x.y,y.x) + __fmul_rn(x.x,y.y)) + ml;
z.y = up = mh + ml;
z.x = (mh - up) + ml;
return z;
}
[/code]

Note that CUDA 2.0 includes some new intrinsic functions, __fadd_rn() and __fmul_rn() that can be used to avoid the compiler using MADs.

Below are some validated double-single add and multiply functions written by Norbert Juffa here at NVIDIA.

Note that in general double-single is an imperfect substitute for true double precision, and we encourage customers who want the highest precision to use GPUs that support doubles in hardware (currently Tesla C1060 and GeForce GTX 260 / 280).

/* For x and y of opposite sign whose magnitude is within a factor of two

of each other either variant below loses accuracy. Otherwise the result

is within 1.5 ulps of the correctly rounded result with 48-bit mantissa.

Based on the dsadd() function in D.H. Bailey's dsfun package from netlib.

*/

__device__ float2 dblsgl_add (float2 x, float2 y)

{

float2 z;

#if defined(__DEVICE_EMULATION__)

volatile float t1, t2, e;

#else

float t1, t2, e;

#endif

t1 = x.y + y.y;

e = t1 - x.y;

t2 = ((y.y - e) + (x.y - (t1 - e))) + x.x + y.x;

z.y = e = t1 + t2;

z.x = t2 - (e - t1);

return z;

}

/* Based on: Guillaume Da Graça, David Defour. Implementation of Float-Float

* Operators on Graphics Hardware. RNC'7, pp. 23-32, 2006.

*/

__device__ float2 dblsgl_mul (float2 x, float2 y)

{

float2 z;

#if defined(__DEVICE_EMULATION__)

volatile float up, vp, u1, u2, v1, v2, mh, ml;

#else

float up, vp, u1, u2, v1, v2, mh, ml;

#endif

up = x.y * 4097.0f;

u1 = (x.y - up) + up;

u2 = x.y - u1;

vp = y.y * 4097.0f;

v1 = (y.y - vp) + vp;

v2 = y.y - v1;

mh = __fmul_rn(x.y,y.y);

ml = (((u1 * v1 - mh) + u1 * v2) + u2 * v1) + u2 * v2;

ml = (__fmul_rn(x.y,y.x) + __fmul_rn(x.x,y.y)) + ml;

[quote name='Sylvain Collange' date='Jul 25 2008, 02:40 PM']Keep in mind however that, as hinted in [url="http://forums.nvidia.com/index.php?showtopic=44847"]this thread[/url], there is currently no GPU that supports the fused multiply-add (FMA) operation in single precision. (single-precision MAD on G80 and even GT200 is a very different beast). So the "fast" multiplication and division will not work on current GPUs.
[/quote]
My understanding is that there is a FMAD instruction, but it is less accurate as it truncates before the addition. In any case, I've removed the fast functions.

[quote name='Simon Green' date='Jul 25 2008, 05:31 PM']Below are some validated double-single add and multiply functions written by Norbert Juffa here at NVIDIA.
[/quote]
The addition function is exactly the same as the one I had and is an exact copy of the dsadd function (with the exception of the high and low floats being swapped). The multiplication function is new to me and I've included it in the header file, albeit preserving the word order I had from DSFUN. I hope that's OK.
[quote name='Simon Green' date='Jul 25 2008, 05:31 PM']Note that in general double-single is an imperfect substitute for true double precision, and we encourage customers who want the highest precision to use GPUs that support doubles in hardware (currently Tesla C1060 and GeForce GTX 260 / 280).
[/quote]
I can understand that this is preferrable, but there's still occasion to use the double single. For my applications, most of the calculations are ok with single precision, but one equation is more sensitive and needs just a little bit more precision. The double single represents an easy and acceptable way to achieve this on existing hardwre.

[quote name='Sylvain Collange' date='Jul 25 2008, 02:40 PM']Keep in mind however that, as hinted in this thread, there is currently no GPU that supports the fused multiply-add (FMA) operation in single precision. (single-precision MAD on G80 and even GT200 is a very different beast). So the "fast" multiplication and division will not work on current GPUs.

My understanding is that there is a FMAD instruction, but it is less accurate as it truncates before the addition. In any case, I've removed the fast functions.

[quote name='Simon Green' date='Jul 25 2008, 05:31 PM']Below are some validated double-single add and multiply functions written by Norbert Juffa here at NVIDIA.

The addition function is exactly the same as the one I had and is an exact copy of the dsadd function (with the exception of the high and low floats being swapped). The multiplication function is new to me and I've included it in the header file, albeit preserving the word order I had from DSFUN. I hope that's OK.

[quote name='Simon Green' date='Jul 25 2008, 05:31 PM']Note that in general double-single is an imperfect substitute for true double precision, and we encourage customers who want the highest precision to use GPUs that support doubles in hardware (currently Tesla C1060 and GeForce GTX 260 / 280).

I can understand that this is preferrable, but there's still occasion to use the double single. For my applications, most of the calculations are ok with single precision, but one equation is more sensitive and needs just a little bit more precision. The double single represents an easy and acceptable way to achieve this on existing hardwre.

[quote name='Simon Green' date='Jul 25 2008, 06:31 PM']Below are some validated double-single add and multiply functions written by Norbert Juffa here at NVIDIA.
[/quote]
Thanks!

[quote name='StickGuy' date='Jul 25 2008, 09:00 PM']My understanding is that there is a FMAD instruction, but it is less accurate as it truncates before the addition.
[/quote]
Yes, it is [b]much[/b] less accurate : a MAD truncates the product to 24 bits before the addition. A true FMA would compute all 48 bits of the product before performing the addition.

The "fast" algorithm works by recovering the 24 low-order bits by making all the hi-order bits cancel during the subtraction :
[code]
float c11 = a.x * b.x; // hi-order bits
float c21 = a.x * b.x - c11; // low-order bits
[/code]

If we try to use this algorithm with a MAD, c21 would end up being mostly 0, and it would not be more accurate than just a single-precision multiplication (actually, slightly less accurate).

[quote]The multiplication function is new to me and I've included it in the header file, albeit preserving the word order I had from DSFUN. I hope that's OK.
[/quote]
As long as you're self-consistent, I guess it's OK :)

Edit : Norbert Juffa / Da Graça-Defour's multiplication seems to neglect the lowest-order bits, and to skip a few normalization steps, which probably causes the advertised larger error.
The DSFUN version should return a correctly rounded result, at a higher computational price though.
(not so sure about that - need to check...)

[quote name='Simon Green' date='Jul 25 2008, 06:31 PM']Below are some validated double-single add and multiply functions written by Norbert Juffa here at NVIDIA.

Thanks!

[quote name='StickGuy' date='Jul 25 2008, 09:00 PM']My understanding is that there is a FMAD instruction, but it is less accurate as it truncates before the addition.

Yes, it is much less accurate : a MAD truncates the product to 24 bits before the addition. A true FMA would compute all 48 bits of the product before performing the addition.

The "fast" algorithm works by recovering the 24 low-order bits by making all the hi-order bits cancel during the subtraction :

float c11 = a.x * b.x; // hi-order bits

float c21 = a.x * b.x - c11; // low-order bits

If we try to use this algorithm with a MAD, c21 would end up being mostly 0, and it would not be more accurate than just a single-precision multiplication (actually, slightly less accurate).

The multiplication function is new to me and I've included it in the header file, albeit preserving the word order I had from DSFUN. I hope that's OK.

As long as you're self-consistent, I guess it's OK :)

Edit : Norbert Juffa / Da Graça-Defour's multiplication seems to neglect the lowest-order bits, and to skip a few normalization steps, which probably causes the advertised larger error.

The DSFUN version should return a correctly rounded result, at a higher computational price though.

[quote name='Sarnath' date='Jul 29 2008, 06:20 AM']StickGuy,
In the "set" function:
"a.x = (float)b;
a.y = (float)(b-a.x);
"
I dont understand how this one works!

A "64-bit - 32-bit" can only be "32-bit"... ?? i.e. A 32-bit + 32-bit can never make a 64-bit number...So, How I wonder how this would work...
[right][snapback]417469[/snapback][/right]
[/quote]
Any arithmetic performed with a double promotes the other arguments to double as well. The idea is that a.x gets the first 23 bits from b's mantissa by simply truncating b to a float. Then, in (b - a.x), a.x's value is promoted to a double, and then the promoted value is subtracted from b's value. The resulting value should be equivalent to b's value with the first 23 bits from b's mantissa cleared. The remaining bits should then be left-shifted by 23 (with an appropriate change to the exponent). The result is then truncated to a float and stored in a.y. This should then store the second 23 bites from b's mantissa in a.y.

A "64-bit - 32-bit" can only be "32-bit"... ?? i.e. A 32-bit + 32-bit can never make a 64-bit number...So, How I wonder how this would work...

[snapback]417469[/snapback]

Any arithmetic performed with a double promotes the other arguments to double as well. The idea is that a.x gets the first 23 bits from b's mantissa by simply truncating b to a float. Then, in (b - a.x), a.x's value is promoted to a double, and then the promoted value is subtracted from b's value. The resulting value should be equivalent to b's value with the first 23 bits from b's mantissa cleared. The remaining bits should then be left-shifted by 23 (with an appropriate change to the exponent). The result is then truncated to a float and stored in a.y. This should then store the second 23 bites from b's mantissa in a.y.

[quote name='StickGuy' date='Jul 29 2008, 02:27 PM']Any arithmetic performed with a double promotes the other arguments to double as well. The idea is that a.x gets the first 23 bits from b's mantissa by simply truncating b to a float. Then, in (b - a.x), a.x's value is promoted to a double, and then the promoted value is subtracted from b's value. The resulting value should be equivalent to b's value with the first 23 bits from b's mantissa cleared. The remaining bits should then be left-shifted by 23 (with an appropriate change to the exponent). The result is then truncated to a float and stored in a.y. This should then store the second 23 bites from b's mantissa in a.y.
[right][snapback]417515[/snapback][/right]
[/quote]

I think I got to go read on how floating points are stored in hardware. Thanks for the explanation though I dont understand much!

[quote name='StickGuy' date='Jul 29 2008, 02:27 PM']Any arithmetic performed with a double promotes the other arguments to double as well. The idea is that a.x gets the first 23 bits from b's mantissa by simply truncating b to a float. Then, in (b - a.x), a.x's value is promoted to a double, and then the promoted value is subtracted from b's value. The resulting value should be equivalent to b's value with the first 23 bits from b's mantissa cleared. The remaining bits should then be left-shifted by 23 (with an appropriate change to the exponent). The result is then truncated to a float and stored in a.y. This should then store the second 23 bites from b's mantissa in a.y.

[snapback]417515[/snapback]

I think I got to go read on how floating points are stored in hardware. Thanks for the explanation though I dont understand much!

I've been trying to write a sqrt function with emulated double precision, and while it works in emulation mode, it gives the wrong result when running on the hardware. The code is:

[code]
// This subroutine employs the following formula (due to Alan Karp):
//
// sqrt(d) = (d * x) + 0.5 * [d - (d * x)^2] * x (approx.)
//
// where x is a single precision approximation to the reciprocal square root,
// and where the multiplications d * x and [] * x are performed with only
// single precision.

if (f2.x == 0)
return setDoublesingle(0);

float t1, t2, t3;
float2 s0, s1;

t1 = rsqrtf(f2.x);
t2 = __fmul_rn(f2.x, t1);

s0.x = t2 * t2;
s0.y = t2 * t2 - s0.x; // PROBLEM HERE

It looks like t2 * t2 is stored then subtracted from itself which then results in 0. However I also used a __fmul_rn for the second t2 * t2 which made the .ptx file look like:

I've been trying to write a sqrt function with emulated double precision, and while it works in emulation mode, it gives the wrong result when running on the hardware. The code is:

// This subroutine employs the following formula (due to Alan Karp):

// where x is a single precision approximation to the reciprocal square root,

// and where the multiplications d * x and [] * x are performed with only

// single precision.

if (f2.x == 0)

return setDoublesingle(0);

float t1, t2, t3;

float2 s0, s1;

t1 = rsqrtf(f2.x);

t2 = __fmul_rn(f2.x, t1);

s0.x = t2 * t2;

s0.y = t2 * t2 - s0.x; // PROBLEM HERE

s1 = f2 - s0;

t3 = float(0.5) * s1.x * t1;

s0.x = t2;

s0.y = 0;

s1.x = t3;

s1.y = 0;

return s0 + s1;

The problem occurs at "PROBLEM HERE" in that this value always is 0 when not running in emulation mode. The .ptx file up to that part is:

rsqrt.f32 %f105, %f101; //

mul.rn.f32 %f106, %f101, %f105; //

mul.f32 %f107, %f106, %f106; //

mov.f32 %f104, %f107; //

sub.f32 %f103, %f107, %f107; //

It looks like t2 * t2 is stored then subtracted from itself which then results in 0. However I also used a __fmul_rn for the second t2 * t2 which made the .ptx file look like:

rsqrt.f32 %f105, %f101; //

mul.rn.f32 %f106, %f101, %f105; //

mul.f32 %f107, %f106, %f106; //

mov.f32 %f104, %f107; //

mul.rn.f32 %f108, %f106, %f106; //

sub.f32 %f103, %f108, %f107; //

Which I think should work, but doesn't. In fact this also breaks the code when running in emulation mode.

[quote name='llamaschool' date='Oct 20 2008, 07:36 PM'][code]
s0.x = t2 * t2;
s0.y = t2 * t2 - s0.x; Â // PROBLEM HERE
[/code]
[right][snapback]454361[/snapback][/right]
[/quote]

That would only work if t2 * t2 were calculated with a higher precision than what
is stored in s0.x, which in the GPU it is not.
You have to use the emulated-double multiplication (splitting t2 in higher and lower part).

When you port routines over from DSFUN, make sure that you use the parts that don't assume the presence of a FMAD instruction. Furthermore, my experiments indicate that using the multiplication operator instead of __fmul_rn has a fairly negligible effect on the accuracy of the double single operations.

I've ported most of the double single routines over to CUDA now, but it's not clear to me how that works with licensing (DSFUN's license is rather vague) so I haven't posted the results here.

When you port routines over from DSFUN, make sure that you use the parts that don't assume the presence of a FMAD instruction. Furthermore, my experiments indicate that using the multiplication operator instead of __fmul_rn has a fairly negligible effect on the accuracy of the double single operations.

I've ported most of the double single routines over to CUDA now, but it's not clear to me how that works with licensing (DSFUN's license is rather vague) so I haven't posted the results here.

[quote name='Reimar' date='Oct 21 2008, 12:21 AM']That would only work if t2 * t2 were calculated with a higher precision than what
is stored in s0.x, which in the GPU it is not.
You have to use the emulated-double multiplication (splitting t2 in higher and lower part).
[right][snapback]454558[/snapback][/right]
[/quote]

Ah, that's what I didn't know. Thanks.

I also found this iterative formula to compute inverse square roots, using the single precision as a first guess, only one iteration is needed to get essentially double precision:

y = 1 / sqrt(x)
y_n+1 = 0.5 * (y_n * (3 - x * y_n^2))

which I guess is probably pretty well known but it does work.

[quote name='Reimar' date='Oct 21 2008, 12:21 AM']That would only work if t2 * t2 were calculated with a higher precision than what

is stored in s0.x, which in the GPU it is not.

You have to use the emulated-double multiplication (splitting t2 in higher and lower part).

[snapback]454558[/snapback]

Ah, that's what I didn't know. Thanks.

I also found this iterative formula to compute inverse square roots, using the single precision as a first guess, only one iteration is needed to get essentially double precision:

y = 1 / sqrt(x)

y_n+1 = 0.5 * (y_n * (3 - x * y_n^2))

which I guess is probably pretty well known but it does work.

So just changing that one line ( s0.y = t2 * t2 - s0.x; ) so that it does the two operations using emulated double precision (putting t2 and s0.x into the higher order part and leaving the lower order part of the new double single variables zero), then saving the higher order part of that result in s0.y totally works, thanks dudes.

So just changing that one line ( s0.y = t2 * t2 - s0.x; ) so that it does the two operations using emulated double precision (putting t2 and s0.x into the higher order part and leaving the lower order part of the new double single variables zero), then saving the higher order part of that result in s0.y totally works, thanks dudes.

EDIT: I updated the header file. I made the existing * and / operators into "fast" functions and used the more precise * (and equivalent /) operator from the Mandelbrot example.

EDIT: I updated the file again, removing the questionable "fast" functions and using the new multiply from Norbert Juffa posted here in the thread.

EDIT: Another update to include Reimar's bit-style splitting and to switch all multiplications to __fmul_rn to ensure no MAD instructions appear in generated PTX.

EDIT: I updated the header file. I made the existing * and / operators into "fast" functions and used the more precise * (and equivalent /) operator from the Mandelbrot example.

EDIT: I updated the file again, removing the questionable "fast" functions and using the new multiply from Norbert Juffa posted here in the thread.

EDIT: Another update to include Reimar's bit-style splitting and to switch all multiplications to __fmul_rn to ensure no MAD instructions appear in generated PTX.

Keep in mind however that, as hinted in [url="http://forums.nvidia.com/index.php?showtopic=44847"]this thread[/url], there is currently no GPU that supports the fused multiply-add (FMA) operation in single precision. (single-precision MAD on G80 and even GT200 is a very different beast). So the "fast" multiplication and division will not work on current GPUs.

It is not a matter of precision. Both versions of the multiplication have the same accuracy, but the "fast" version is uh... faster when an FMA is present, and broken otherwise.

Additionally, NVCC detects the subexpression a.x * b.x in the following code and computes it only once, which breaks the whole algorithm anyway :

[code]

float c11 = a.x * b.x;

float c21 = a.x * b.x - c11;

[/code]

Actually, GT200 does support the FMA instruction, but only in double precision.

So you can still use this algorithm to perform a double-double multiplication. Something like:

[code]

__device__ inline double2 mul_fast(const double2 a, const double2 b) {

double2 c;

// According to DSFUN90, this is the preferred method of computing

// c11 and c21 on hardware supporting the FMA instruction:

// Multilply a.x * b.x using Dekker's method.

double c11 = a.x * b.x;

double c21 = fma(a.x, b.x, c11);

// Compute a.x * b.y + a.y * b.x (only high-order word is needed).

double c2 = a.x * b.y + a.y * b.x;

// Compute (c11, c21) + c2 using Knuth's trick, also adding low-order product.

double t1 = c11 + c2;

double e = t1 - c11;

double t2 = ((c2 - e) + (c11 - (t1 - e))) + c21 + a.y * b.y;

// The result is t1 + t2, after normalization.

c.x = e = t1 + t2;

c.y = t2 - (e - t1);

return c;

}

[/code]

To prevent aggressive optimizations and guarantee portability across future GPUs, you may need to turn every + into a __dadd_rn call and every * into a __dmul_rn call...

It is nice to have such libraries for CUDA, thanks for your work!

Regards,

Sylvain

Keep in mind however that, as hinted in this thread, there is currently no GPU that supports the fused multiply-add (FMA) operation in single precision. (single-precision MAD on G80 and even GT200 is a very different beast). So the "fast" multiplication and division will not work on current GPUs.

It is not a matter of precision. Both versions of the multiplication have the same accuracy, but the "fast" version is uh... faster when an FMA is present, and broken otherwise.

Additionally, NVCC detects the subexpression a.x * b.x in the following code and computes it only once, which breaks the whole algorithm anyway :

Actually, GT200 does support the FMA instruction, but only in double precision.

So you can still use this algorithm to perform a double-double multiplication. Something like:

To prevent aggressive optimizations and guarantee portability across future GPUs, you may need to turn every + into a __dadd_rn call and every * into a __dmul_rn call...

It is nice to have such libraries for CUDA, thanks for your work!

Regards,

Sylvain

Barra, a CUDA-capable GPU simulator

Below are some validated double-single add and multiply functions written by Norbert Juffa here at NVIDIA.

Note that in general double-single is an imperfect substitute for true double precision, and we encourage customers who want the highest precision to use GPUs that support doubles in hardware (currently Tesla C1060 and GeForce GTX 260 / 280).

[code]

/* For x and y of opposite sign whose magnitude is within a factor of two

of each other either variant below loses accuracy. Otherwise the result

is within 1.5 ulps of the correctly rounded result with 48-bit mantissa.

Based on the dsadd() function in D.H. Bailey's dsfun package from netlib.

*/

__device__ float2 dblsgl_add (float2 x, float2 y)

{

float2 z;

#if defined(__DEVICE_EMULATION__)

volatile float t1, t2, e;

#else

float t1, t2, e;

#endif

t1 = x.y + y.y;

e = t1 - x.y;

t2 = ((y.y - e) + (x.y - (t1 - e))) + x.x + y.x;

z.y = e = t1 + t2;

z.x = t2 - (e - t1);

return z;

}

/* Based on: Guillaume Da Graça, David Defour. Implementation of Float-Float

* Operators on Graphics Hardware. RNC'7, pp. 23-32, 2006.

*/

__device__ float2 dblsgl_mul (float2 x, float2 y)

{

float2 z;

#if defined(__DEVICE_EMULATION__)

volatile float up, vp, u1, u2, v1, v2, mh, ml;

#else

float up, vp, u1, u2, v1, v2, mh, ml;

#endif

up = x.y * 4097.0f;

u1 = (x.y - up) + up;

u2 = x.y - u1;

vp = y.y * 4097.0f;

v1 = (y.y - vp) + vp;

v2 = y.y - v1;

mh = __fmul_rn(x.y,y.y);

ml = (((u1 * v1 - mh) + u1 * v2) + u2 * v1) + u2 * v2;

ml = (__fmul_rn(x.y,y.x) + __fmul_rn(x.x,y.y)) + ml;

z.y = up = mh + ml;

z.x = (mh - up) + ml;

return z;

}

[/code]

Below are some validated double-single add and multiply functions written by Norbert Juffa here at NVIDIA.

Note that in general double-single is an imperfect substitute for true double precision, and we encourage customers who want the highest precision to use GPUs that support doubles in hardware (currently Tesla C1060 and GeForce GTX 260 / 280).

[/quote]

My understanding is that there is a FMAD instruction, but it is less accurate as it truncates before the addition. In any case, I've removed the fast functions.

[quote name='Simon Green' date='Jul 25 2008, 05:31 PM']Below are some validated double-single add and multiply functions written by Norbert Juffa here at NVIDIA.

[/quote]

The addition function is exactly the same as the one I had and is an exact copy of the dsadd function (with the exception of the high and low floats being swapped). The multiplication function is new to me and I've included it in the header file, albeit preserving the word order I had from DSFUN. I hope that's OK.

[quote name='Simon Green' date='Jul 25 2008, 05:31 PM']Note that in general double-single is an imperfect substitute for true double precision, and we encourage customers who want the highest precision to use GPUs that support doubles in hardware (currently Tesla C1060 and GeForce GTX 260 / 280).

[/quote]

I can understand that this is preferrable, but there's still occasion to use the double single. For my applications, most of the calculations are ok with single precision, but one equation is more sensitive and needs just a little bit more precision. The double single represents an easy and acceptable way to achieve this on existing hardwre.

My understanding is that there is a FMAD instruction, but it is less accurate as it truncates before the addition. In any case, I've removed the fast functions.

[quote name='Simon Green' date='Jul 25 2008, 05:31 PM']Below are some validated double-single add and multiply functions written by Norbert Juffa here at NVIDIA.

The addition function is exactly the same as the one I had and is an exact copy of the dsadd function (with the exception of the high and low floats being swapped). The multiplication function is new to me and I've included it in the header file, albeit preserving the word order I had from DSFUN. I hope that's OK.

[quote name='Simon Green' date='Jul 25 2008, 05:31 PM']Note that in general double-single is an imperfect substitute for true double precision, and we encourage customers who want the highest precision to use GPUs that support doubles in hardware (currently Tesla C1060 and GeForce GTX 260 / 280).

I can understand that this is preferrable, but there's still occasion to use the double single. For my applications, most of the calculations are ok with single precision, but one equation is more sensitive and needs just a little bit more precision. The double single represents an easy and acceptable way to achieve this on existing hardwre.

[/quote]

Thanks!

[quote name='StickGuy' date='Jul 25 2008, 09:00 PM']My understanding is that there is a FMAD instruction, but it is less accurate as it truncates before the addition.

[/quote]

Yes, it is [b]much[/b] less accurate : a MAD truncates the product to 24 bits before the addition. A true FMA would compute all 48 bits of the product before performing the addition.

The "fast" algorithm works by recovering the 24 low-order bits by making all the hi-order bits cancel during the subtraction :

[code]

float c11 = a.x * b.x; // hi-order bits

float c21 = a.x * b.x - c11; // low-order bits

[/code]

If we try to use this algorithm with a MAD, c21 would end up being mostly 0, and it would not be more accurate than just a single-precision multiplication (actually, slightly less accurate).

[quote]The multiplication function is new to me and I've included it in the header file, albeit preserving the word order I had from DSFUN. I hope that's OK.

[/quote]

As long as you're self-consistent, I guess it's OK :)

Edit : Norbert Juffa / Da Graça-Defour's multiplication seems to neglect the lowest-order bits, and to skip a few normalization steps, which probably causes the advertised larger error.

The DSFUN version should return a correctly rounded result, at a higher computational price though.

(not so sure about that - need to check...)

Thanks!

[quote name='StickGuy' date='Jul 25 2008, 09:00 PM']My understanding is that there is a FMAD instruction, but it is less accurate as it truncates before the addition.

Yes, it is

muchless accurate : a MAD truncates the product to 24 bits before the addition. A true FMA would compute all 48 bits of the product before performing the addition.The "fast" algorithm works by recovering the 24 low-order bits by making all the hi-order bits cancel during the subtraction :

If we try to use this algorithm with a MAD, c21 would end up being mostly 0, and it would not be more accurate than just a single-precision multiplication (actually, slightly less accurate).

As long as you're self-consistent, I guess it's OK :)

Edit : Norbert Juffa / Da Graça-Defour's multiplication seems to neglect the lowest-order bits, and to skip a few normalization steps, which probably causes the advertised larger error.

The DSFUN version should return a correctly rounded result, at a higher computational price though.

(not so sure about that - need to check...)

Barra, a CUDA-capable GPU simulator

this "splitting" code in the multiplication

[code]

up = a.x * 4097.0f;

u1 = (a.x - up) + up;

[/code]

Can be simplified to something like this:

[code]

uint32_t tmp = *(uint32_t *)&a.x;

tmp &= ~0xfff;

u1 = *(float *)&tmp;

[/code]

which AFAICT is compiled to a single PTX/GPU instruction.

this "splitting" code in the multiplication

Can be simplified to something like this:

which AFAICT is compiled to a single PTX/GPU instruction.

First of all, Thanks for sharing!

I have a few doubts - may b it is just my floating point format ignorance...

In the "set" function:

"a.x = (float)b;

a.y = (float)(b-a.x);

"

I dont understand how this one works!

A "64-bit - 32-bit" can only be "32-bit"... ?? i.e. A 32-bit + 32-bit can never make a 64-bit number...So, How I wonder how this would work...

Can you kindly clarify??

THanks

Best Regards,

Sarnath

First of all, Thanks for sharing!

I have a few doubts - may b it is just my floating point format ignorance...

In the "set" function:

"a.x = (float)b;

a.y = (float)(b-a.x);

"

I dont understand how this one works!

A "64-bit - 32-bit" can only be "32-bit"... ?? i.e. A 32-bit + 32-bit can never make a 64-bit number...So, How I wonder how this would work...

Can you kindly clarify??

THanks

Best Regards,

Sarnath

Ignorance Rules; Knowledge Liberates!

In the "set" function:

"a.x = (float)b;

a.y = (float)(b-a.x);

"

I dont understand how this one works!

A "64-bit - 32-bit" can only be "32-bit"... ?? i.e. A 32-bit + 32-bit can never make a 64-bit number...So, How I wonder how this would work...

[right][snapback]417469[/snapback][/right]

[/quote]

Any arithmetic performed with a double promotes the other arguments to double as well. The idea is that a.x gets the first 23 bits from b's mantissa by simply truncating b to a float. Then, in (b - a.x), a.x's value is promoted to a double, and then the promoted value is subtracted from b's value. The resulting value should be equivalent to b's value with the first 23 bits from b's mantissa cleared. The remaining bits should then be left-shifted by 23 (with an appropriate change to the exponent). The result is then truncated to a float and stored in a.y. This should then store the second 23 bites from b's mantissa in a.y.

In the "set" function:

"a.x = (float)b;

a.y = (float)(b-a.x);

"

I dont understand how this one works!

Any arithmetic performed with a double promotes the other arguments to double as well. The idea is that a.x gets the first 23 bits from b's mantissa by simply truncating b to a float. Then, in (b - a.x), a.x's value is promoted to a double, and then the promoted value is subtracted from b's value. The resulting value should be equivalent to b's value with the first 23 bits from b's mantissa cleared. The remaining bits should then be left-shifted by 23 (with an appropriate change to the exponent). The result is then truncated to a float and stored in a.y. This should then store the second 23 bites from b's mantissa in a.y.

[right][snapback]417515[/snapback][/right]

[/quote]

I think I got to go read on how floating points are stored in hardware. Thanks for the explanation though I dont understand much!

I think I got to go read on how floating points are stored in hardware. Thanks for the explanation though I dont understand much!

Ignorance Rules; Knowledge Liberates!

[code]

// This subroutine employs the following formula (due to Alan Karp):

//

// sqrt(d) = (d * x) + 0.5 * [d - (d * x)^2] * x (approx.)

//

// where x is a single precision approximation to the reciprocal square root,

// and where the multiplications d * x and [] * x are performed with only

// single precision.

if (f2.x == 0)

return setDoublesingle(0);

float t1, t2, t3;

float2 s0, s1;

t1 = rsqrtf(f2.x);

t2 = __fmul_rn(f2.x, t1);

s0.x = t2 * t2;

s0.y = t2 * t2 - s0.x; // PROBLEM HERE

s1 = f2 - s0;

t3 = float(0.5) * s1.x * t1;

s0.x = t2;

s0.y = 0;

s1.x = t3;

s1.y = 0;

return s0 + s1;

[/code]

The problem occurs at "PROBLEM HERE" in that this value always is 0 when not running in emulation mode. The .ptx file up to that part is:

[code]

rsqrt.f32 %f105, %f101; //

mul.rn.f32 %f106, %f101, %f105; //

mul.f32 %f107, %f106, %f106; //

mov.f32 %f104, %f107; //

sub.f32 %f103, %f107, %f107; //

[/code]

It looks like t2 * t2 is stored then subtracted from itself which then results in 0. However I also used a __fmul_rn for the second t2 * t2 which made the .ptx file look like:

[code]

rsqrt.f32 %f105, %f101; //

mul.rn.f32 %f106, %f101, %f105; //

mul.f32 %f107, %f106, %f106; //

mov.f32 %f104, %f107; //

mul.rn.f32 %f108, %f106, %f106; //

sub.f32 %f103, %f108, %f107; //

[/code]

Which I think should work, but doesn't. In fact this also breaks the code when running in emulation mode.

Any help would be appreciated.

The problem occurs at "PROBLEM HERE" in that this value always is 0 when not running in emulation mode. The .ptx file up to that part is:

It looks like t2 * t2 is stored then subtracted from itself which then results in 0. However I also used a __fmul_rn for the second t2 * t2 which made the .ptx file look like:

Which I think should work, but doesn't. In fact this also breaks the code when running in emulation mode.

Any help would be appreciated.

s0.x = t2 * t2;

s0.y = t2 * t2 - s0.x; Â // PROBLEM HERE

[/code]

[right][snapback]454361[/snapback][/right]

[/quote]

That would only work if t2 * t2 were calculated with a higher precision than what

is stored in s0.x, which in the GPU it is not.

You have to use the emulated-double multiplication (splitting t2 in higher and lower part).

That would only work if t2 * t2 were calculated with a higher precision than what

is stored in s0.x, which in the GPU it is not.

You have to use the emulated-double multiplication (splitting t2 in higher and lower part).

I've ported most of the double single routines over to CUDA now, but it's not clear to me how that works with licensing (DSFUN's license is rather vague) so I haven't posted the results here.

I've ported most of the double single routines over to CUDA now, but it's not clear to me how that works with licensing (DSFUN's license is rather vague) so I haven't posted the results here.

is stored in s0.x, which in the GPU it is not.

You have to use the emulated-double multiplication (splitting t2 in higher and lower part).

[right][snapback]454558[/snapback][/right]

[/quote]

Ah, that's what I didn't know. Thanks.

I also found this iterative formula to compute inverse square roots, using the single precision as a first guess, only one iteration is needed to get essentially double precision:

y = 1 / sqrt(x)

y_n+1 = 0.5 * (y_n * (3 - x * y_n^2))

which I guess is probably pretty well known but it does work.

is stored in s0.x, which in the GPU it is not.

You have to use the emulated-double multiplication (splitting t2 in higher and lower part).

Ah, that's what I didn't know. Thanks.

I also found this iterative formula to compute inverse square roots, using the single precision as a first guess, only one iteration is needed to get essentially double precision:

y = 1 / sqrt(x)

y_n+1 = 0.5 * (y_n * (3 - x * y_n^2))

which I guess is probably pretty well known but it does work.