Emulated double precision Double single routine header
  1 / 2    
I took the double emulation from the Mandelbrot example and made some adjustments to it. I converted the functions to operators that work with float2 and I added unary negation, division, and some extra functions to store/retrieve values in/from a double single float2. I put them together into a header file in case it's useful for anyone else.

EDIT: I updated the header file. I made the existing * and / operators into "fast" functions and used the more precise * (and equivalent /) operator from the Mandelbrot example.

EDIT: I updated the file again, removing the questionable "fast" functions and using the new multiply from Norbert Juffa posted here in the thread.

EDIT: Another update to include Reimar's bit-style splitting and to switch all multiplications to __fmul_rn to ensure no MAD instructions appear in generated PTX.
I took the double emulation from the Mandelbrot example and made some adjustments to it. I converted the functions to operators that work with float2 and I added unary negation, division, and some extra functions to store/retrieve values in/from a double single float2. I put them together into a header file in case it's useful for anyone else.



EDIT: I updated the header file. I made the existing * and / operators into "fast" functions and used the more precise * (and equivalent /) operator from the Mandelbrot example.



EDIT: I updated the file again, removing the questionable "fast" functions and using the new multiply from Norbert Juffa posted here in the thread.



EDIT: Another update to include Reimar's bit-style splitting and to switch all multiplications to __fmul_rn to ensure no MAD instructions appear in generated PTX.
Attachments

dsmath.h

#1
Posted 07/22/2008 01:25 PM   
Excellent! Thanks for sharing the package, it could be useful to many people.
Excellent! Thanks for sharing the package, it could be useful to many people.

#2
Posted 07/22/2008 06:45 PM   
Good idea!

Keep in mind however that, as hinted in [url="http://forums.nvidia.com/index.php?showtopic=44847"]this thread[/url], there is currently no GPU that supports the fused multiply-add (FMA) operation in single precision. (single-precision MAD on G80 and even GT200 is a very different beast). So the "fast" multiplication and division will not work on current GPUs.

It is not a matter of precision. Both versions of the multiplication have the same accuracy, but the "fast" version is uh... faster when an FMA is present, and broken otherwise.

Additionally, NVCC detects the subexpression a.x * b.x in the following code and computes it only once, which breaks the whole algorithm anyway :
[code]
float c11 = a.x * b.x;
float c21 = a.x * b.x - c11;
[/code]


Actually, GT200 does support the FMA instruction, but only in double precision.

So you can still use this algorithm to perform a double-double multiplication. Something like:
[code]
__device__ inline double2 mul_fast(const double2 a, const double2 b) {
   double2 c;

   // According to DSFUN90, this is the preferred method of computing
   // c11 and c21 on hardware supporting the FMA instruction:

   // Multilply a.x * b.x using Dekker's method.
   double c11 = a.x * b.x;
   double c21 = fma(a.x, b.x, c11);

   // Compute a.x * b.y + a.y * b.x (only high-order word is needed).
   double c2 = a.x * b.y + a.y * b.x;

   // Compute (c11, c21) + c2 using Knuth's trick, also adding low-order product.
   double t1 = c11 + c2;
   double e = t1 - c11;
   double t2 = ((c2 - e) + (c11 - (t1 - e))) + c21 + a.y * b.y;

   // The result is t1 + t2, after normalization.
   c.x = e = t1 + t2;
   c.y = t2 - (e - t1);

   return c;
}
[/code]

To prevent aggressive optimizations and guarantee portability across future GPUs, you may need to turn every + into a __dadd_rn call and every * into a __dmul_rn call...

It is nice to have such libraries for CUDA, thanks for your work!

Regards,
Sylvain
Good idea!



Keep in mind however that, as hinted in this thread, there is currently no GPU that supports the fused multiply-add (FMA) operation in single precision. (single-precision MAD on G80 and even GT200 is a very different beast). So the "fast" multiplication and division will not work on current GPUs.



It is not a matter of precision. Both versions of the multiplication have the same accuracy, but the "fast" version is uh... faster when an FMA is present, and broken otherwise.



Additionally, NVCC detects the subexpression a.x * b.x in the following code and computes it only once, which breaks the whole algorithm anyway :



float c11 = a.x * b.x;

float c21 = a.x * b.x - c11;






Actually, GT200 does support the FMA instruction, but only in double precision.



So you can still use this algorithm to perform a double-double multiplication. Something like:



__device__ inline double2 mul_fast(const double2 a, const double2 b) {

   double2 c;



   // According to DSFUN90, this is the preferred method of computing

   // c11 and c21 on hardware supporting the FMA instruction:



   // Multilply a.x * b.x using Dekker's method.

   double c11 = a.x * b.x;

   double c21 = fma(a.x, b.x, c11);



   // Compute a.x * b.y + a.y * b.x (only high-order word is needed).

   double c2 = a.x * b.y + a.y * b.x;



   // Compute (c11, c21) + c2 using Knuth's trick, also adding low-order product.

   double t1 = c11 + c2;

   double e = t1 - c11;

   double t2 = ((c2 - e) + (c11 - (t1 - e))) + c21 + a.y * b.y;



   // The result is t1 + t2, after normalization.

   c.x = e = t1 + t2;

   c.y = t2 - (e - t1);



   return c;

}




To prevent aggressive optimizations and guarantee portability across future GPUs, you may need to turn every + into a __dadd_rn call and every * into a __dmul_rn call...



It is nice to have such libraries for CUDA, thanks for your work!



Regards,

Sylvain
#3
Posted 07/25/2008 01:40 PM   
Note that CUDA 2.0 includes some new intrinsic functions, __fadd_rn() and __fmul_rn() that can be used to avoid the compiler using MADs.

Below are some validated double-single add and multiply functions written by Norbert Juffa here at NVIDIA.

Note that in general double-single is an imperfect substitute for true double precision, and we encourage customers who want the highest precision to use GPUs that support doubles in hardware (currently Tesla C1060 and GeForce GTX 260 / 280).

[code]
/* For x and y of opposite sign whose magnitude is within a factor of two
  of each other either variant below loses accuracy. Otherwise the result
  is within 1.5 ulps of the correctly rounded result with 48-bit mantissa.
  Based on the dsadd() function in D.H. Bailey's dsfun package from netlib.
*/
__device__ float2 dblsgl_add (float2 x, float2 y)
{
   float2 z;
#if defined(__DEVICE_EMULATION__)
   volatile float t1, t2, e;
#else
   float t1, t2, e;
#endif
   t1 = x.y + y.y;
   e = t1 - x.y;
   t2 = ((y.y - e) + (x.y - (t1 - e))) + x.x + y.x;
   z.y = e = t1 + t2;
   z.x = t2 - (e - t1);
   return z;
}

/* Based on: Guillaume Da Graça, David Defour. Implementation of Float-Float
* Operators on Graphics Hardware. RNC'7, pp. 23-32, 2006.
*/
__device__ float2 dblsgl_mul (float2 x, float2 y)
{
   float2 z;
#if defined(__DEVICE_EMULATION__)
   volatile float up, vp, u1, u2, v1, v2, mh, ml;
#else
   float up, vp, u1, u2, v1, v2, mh, ml;
#endif
   up  = x.y * 4097.0f;
   u1  = (x.y - up) + up;
   u2  = x.y - u1;
   vp  = y.y * 4097.0f;
   v1  = (y.y - vp) + vp;
   v2  = y.y - v1;
   mh  = __fmul_rn(x.y,y.y);
   ml  = (((u1 * v1 - mh) + u1 * v2) + u2 * v1) + u2 * v2;
   ml  = (__fmul_rn(x.y,y.x) + __fmul_rn(x.x,y.y)) + ml;
   z.y = up = mh + ml;
   z.x = (mh - up) + ml;
   return z;
}
[/code]
Note that CUDA 2.0 includes some new intrinsic functions, __fadd_rn() and __fmul_rn() that can be used to avoid the compiler using MADs.



Below are some validated double-single add and multiply functions written by Norbert Juffa here at NVIDIA.



Note that in general double-single is an imperfect substitute for true double precision, and we encourage customers who want the highest precision to use GPUs that support doubles in hardware (currently Tesla C1060 and GeForce GTX 260 / 280).





/* For x and y of opposite sign whose magnitude is within a factor of two

  of each other either variant below loses accuracy. Otherwise the result

  is within 1.5 ulps of the correctly rounded result with 48-bit mantissa.

  Based on the dsadd() function in D.H. Bailey's dsfun package from netlib.

*/

__device__ float2 dblsgl_add (float2 x, float2 y)

{

   float2 z;

#if defined(__DEVICE_EMULATION__)

   volatile float t1, t2, e;

#else

   float t1, t2, e;

#endif

   t1 = x.y + y.y;

   e = t1 - x.y;

   t2 = ((y.y - e) + (x.y - (t1 - e))) + x.x + y.x;

   z.y = e = t1 + t2;

   z.x = t2 - (e - t1);

   return z;

}



/* Based on: Guillaume Da Graça, David Defour. Implementation of Float-Float

* Operators on Graphics Hardware. RNC'7, pp. 23-32, 2006.

*/

__device__ float2 dblsgl_mul (float2 x, float2 y)

{

   float2 z;

#if defined(__DEVICE_EMULATION__)

   volatile float up, vp, u1, u2, v1, v2, mh, ml;

#else

   float up, vp, u1, u2, v1, v2, mh, ml;

#endif

   up  = x.y * 4097.0f;

   u1  = (x.y - up) + up;

   u2  = x.y - u1;

   vp  = y.y * 4097.0f;

   v1  = (y.y - vp) + vp;

   v2  = y.y - v1;

   mh  = __fmul_rn(x.y,y.y);

   ml  = (((u1 * v1 - mh) + u1 * v2) + u2 * v1) + u2 * v2;

   ml  = (__fmul_rn(x.y,y.x) + __fmul_rn(x.x,y.y)) + ml;

   z.y = up = mh + ml;

   z.x = (mh - up) + ml;

   return z;

}

#4
Posted 07/25/2008 04:31 PM   
[quote name='Sylvain Collange' date='Jul 25 2008, 02:40 PM']Keep in mind however that, as hinted in [url="http://forums.nvidia.com/index.php?showtopic=44847"]this thread[/url], there is currently no GPU that supports the fused multiply-add (FMA) operation in single precision. (single-precision MAD on G80 and even GT200 is a very different beast). So the "fast" multiplication and division will not work on current GPUs.
[/quote]
My understanding is that there is a FMAD instruction, but it is less accurate as it truncates before the addition. In any case, I've removed the fast functions.

[quote name='Simon Green' date='Jul 25 2008, 05:31 PM']Below are some validated double-single add and multiply functions written by Norbert Juffa here at NVIDIA.
[/quote]
The addition function is exactly the same as the one I had and is an exact copy of the dsadd function (with the exception of the high and low floats being swapped). The multiplication function is new to me and I've included it in the header file, albeit preserving the word order I had from DSFUN. I hope that's OK.
[quote name='Simon Green' date='Jul 25 2008, 05:31 PM']Note that in general double-single is an imperfect substitute for true double precision, and we encourage customers who want the highest precision to use GPUs that support doubles in hardware (currently Tesla C1060 and GeForce GTX 260 / 280).
[/quote]
I can understand that this is preferrable, but there's still occasion to use the double single. For my applications, most of the calculations are ok with single precision, but one equation is more sensitive and needs just a little bit more precision. The double single represents an easy and acceptable way to achieve this on existing hardwre.
[quote name='Sylvain Collange' date='Jul 25 2008, 02:40 PM']Keep in mind however that, as hinted in this thread, there is currently no GPU that supports the fused multiply-add (FMA) operation in single precision. (single-precision MAD on G80 and even GT200 is a very different beast). So the "fast" multiplication and division will not work on current GPUs.



My understanding is that there is a FMAD instruction, but it is less accurate as it truncates before the addition. In any case, I've removed the fast functions.



[quote name='Simon Green' date='Jul 25 2008, 05:31 PM']Below are some validated double-single add and multiply functions written by Norbert Juffa here at NVIDIA.



The addition function is exactly the same as the one I had and is an exact copy of the dsadd function (with the exception of the high and low floats being swapped). The multiplication function is new to me and I've included it in the header file, albeit preserving the word order I had from DSFUN. I hope that's OK.

[quote name='Simon Green' date='Jul 25 2008, 05:31 PM']Note that in general double-single is an imperfect substitute for true double precision, and we encourage customers who want the highest precision to use GPUs that support doubles in hardware (currently Tesla C1060 and GeForce GTX 260 / 280).



I can understand that this is preferrable, but there's still occasion to use the double single. For my applications, most of the calculations are ok with single precision, but one equation is more sensitive and needs just a little bit more precision. The double single represents an easy and acceptable way to achieve this on existing hardwre.

#5
Posted 07/25/2008 07:00 PM   
[quote name='Simon Green' date='Jul 25 2008, 06:31 PM']Below are some validated double-single add and multiply functions written by Norbert Juffa here at NVIDIA.
[/quote]
Thanks!

[quote name='StickGuy' date='Jul 25 2008, 09:00 PM']My understanding is that there is a FMAD instruction, but it is less accurate as it truncates before the addition.
[/quote]
Yes, it is [b]much[/b] less accurate : a MAD truncates the product to 24 bits before the addition. A true FMA would compute all 48 bits of the product before performing the addition.

The "fast" algorithm works by recovering the 24 low-order bits by making all the hi-order bits cancel during the subtraction :
[code]
float c11 = a.x * b.x;    // hi-order bits
float c21 = a.x * b.x - c11; // low-order bits
[/code]

If we try to use this algorithm with a MAD, c21 would end up being mostly 0, and it would not be more accurate than just a single-precision multiplication (actually, slightly less accurate).

[quote]The multiplication function is new to me and I've included it in the header file, albeit preserving the word order I had from DSFUN.  I hope that's OK.
[/quote]
As long as you're self-consistent, I guess it's OK :)

Edit : Norbert Juffa / Da Graça-Defour's multiplication seems to neglect the lowest-order bits, and to skip a few normalization steps, which probably causes the advertised larger error.
The DSFUN version should return a correctly rounded result, at a higher computational price though.
(not so sure about that - need to check...)
[quote name='Simon Green' date='Jul 25 2008, 06:31 PM']Below are some validated double-single add and multiply functions written by Norbert Juffa here at NVIDIA.



Thanks!



[quote name='StickGuy' date='Jul 25 2008, 09:00 PM']My understanding is that there is a FMAD instruction, but it is less accurate as it truncates before the addition.



Yes, it is much less accurate : a MAD truncates the product to 24 bits before the addition. A true FMA would compute all 48 bits of the product before performing the addition.



The "fast" algorithm works by recovering the 24 low-order bits by making all the hi-order bits cancel during the subtraction :



float c11 = a.x * b.x;    // hi-order bits

float c21 = a.x * b.x - c11; // low-order bits




If we try to use this algorithm with a MAD, c21 would end up being mostly 0, and it would not be more accurate than just a single-precision multiplication (actually, slightly less accurate).



The multiplication function is new to me and I've included it in the header file, albeit preserving the word order I had from DSFUN.  I hope that's OK.



As long as you're self-consistent, I guess it's OK :)



Edit : Norbert Juffa / Da Graça-Defour's multiplication seems to neglect the lowest-order bits, and to skip a few normalization steps, which probably causes the advertised larger error.

The DSFUN version should return a correctly rounded result, at a higher computational price though.

(not so sure about that - need to check...)
#6
Posted 07/26/2008 01:00 PM   
I haven't been able to get clear performance numbers, but going by the numbers of instructions,
this "splitting" code in the multiplication

[code]
up  = a.x * 4097.0f;
u1  = (a.x - up) + up;
[/code]

Can be simplified to something like this:
[code]
       uint32_t tmp = *(uint32_t *)&a.x;
       tmp &= ~0xfff;
       u1 = *(float *)&tmp;
[/code]

which AFAICT is compiled to a single PTX/GPU instruction.
I haven't been able to get clear performance numbers, but going by the numbers of instructions,

this "splitting" code in the multiplication





up  = a.x * 4097.0f;

u1  = (a.x - up) + up;




Can be simplified to something like this:



       uint32_t tmp = *(uint32_t *)&a.x;

       tmp &= ~0xfff;

       u1 = *(float *)&tmp;




which AFAICT is compiled to a single PTX/GPU instruction.

#7
Posted 07/27/2008 08:46 AM   
StickGuy,

First of all, Thanks for sharing!

I have a few doubts - may b it is just my floating point format ignorance...

In the "set" function:
"a.x = (float)b;
a.y = (float)(b-a.x);
"

I dont understand how this one works!

A "64-bit - 32-bit" can only be "32-bit"... ?? i.e. A 32-bit + 32-bit can never make a 64-bit number...So, How I wonder how this would work...

Can you kindly clarify??

THanks
Best Regards,
Sarnath
StickGuy,



First of all, Thanks for sharing!



I have a few doubts - may b it is just my floating point format ignorance...



In the "set" function:

"a.x = (float)b;

a.y = (float)(b-a.x);

"



I dont understand how this one works!



A "64-bit - 32-bit" can only be "32-bit"... ?? i.e. A 32-bit + 32-bit can never make a 64-bit number...So, How I wonder how this would work...



Can you kindly clarify??



THanks

Best Regards,

Sarnath

Ignorance Rules; Knowledge Liberates!

#8
Posted 07/29/2008 05:20 AM   
[quote name='Sarnath' date='Jul 29 2008, 06:20 AM']StickGuy,
In the "set" function:
"a.x = (float)b;
a.y = (float)(b-a.x);
"
I dont understand how this one works!

A "64-bit - 32-bit" can only be "32-bit"... ?? i.e. A 32-bit + 32-bit can never make a 64-bit number...So, How I wonder how this would work...
[right][snapback]417469[/snapback][/right]
[/quote]
Any arithmetic performed with a double promotes the other arguments to double as well. The idea is that a.x gets the first 23 bits from b's mantissa by simply truncating b to a float. Then, in (b - a.x), a.x's value is promoted to a double, and then the promoted value is subtracted from b's value. The resulting value should be equivalent to b's value with the first 23 bits from b's mantissa cleared. The remaining bits should then be left-shifted by 23 (with an appropriate change to the exponent). The result is then truncated to a float and stored in a.y. This should then store the second 23 bites from b's mantissa in a.y.
[quote name='Sarnath' date='Jul 29 2008, 06:20 AM']StickGuy,

In the "set" function:

"a.x = (float)b;

a.y = (float)(b-a.x);

"

I dont understand how this one works!



A "64-bit - 32-bit" can only be "32-bit"... ?? i.e. A 32-bit + 32-bit can never make a 64-bit number...So, How I wonder how this would work...

[snapback]417469[/snapback]




Any arithmetic performed with a double promotes the other arguments to double as well. The idea is that a.x gets the first 23 bits from b's mantissa by simply truncating b to a float. Then, in (b - a.x), a.x's value is promoted to a double, and then the promoted value is subtracted from b's value. The resulting value should be equivalent to b's value with the first 23 bits from b's mantissa cleared. The remaining bits should then be left-shifted by 23 (with an appropriate change to the exponent). The result is then truncated to a float and stored in a.y. This should then store the second 23 bites from b's mantissa in a.y.

#9
Posted 07/29/2008 08:57 AM   
[quote name='StickGuy' date='Jul 29 2008, 02:27 PM']Any arithmetic performed with a double promotes the other arguments to double as well.  The idea is that a.x gets the first 23 bits from b's mantissa by simply truncating b to a float.  Then, in (b - a.x), a.x's value is promoted to a double, and then the promoted value is subtracted from b's value.  The resulting value should be equivalent to b's value with the first 23 bits from b's mantissa cleared.  The remaining bits should then be left-shifted by 23 (with an appropriate change to the exponent).  The result is then truncated to a float and stored in a.y.  This should then store the second 23 bites from b's mantissa in a.y.
[right][snapback]417515[/snapback][/right]
[/quote]

I think I got to go read on how floating points are stored in hardware. Thanks for the explanation though I dont understand much!
[quote name='StickGuy' date='Jul 29 2008, 02:27 PM']Any arithmetic performed with a double promotes the other arguments to double as well.  The idea is that a.x gets the first 23 bits from b's mantissa by simply truncating b to a float.  Then, in (b - a.x), a.x's value is promoted to a double, and then the promoted value is subtracted from b's value.  The resulting value should be equivalent to b's value with the first 23 bits from b's mantissa cleared.  The remaining bits should then be left-shifted by 23 (with an appropriate change to the exponent).  The result is then truncated to a float and stored in a.y.  This should then store the second 23 bites from b's mantissa in a.y.

[snapback]417515[/snapback]






I think I got to go read on how floating points are stored in hardware. Thanks for the explanation though I dont understand much!

Ignorance Rules; Knowledge Liberates!

#10
Posted 07/29/2008 09:06 AM   
I've been trying to write a sqrt function with emulated double precision, and while it works in emulation mode, it gives the wrong result when running on the hardware. The code is:

[code]
// This subroutine employs the following formula (due to Alan Karp):
//
// sqrt(d) = (d * x) + 0.5 * [d - (d * x)^2] * x (approx.)
//
// where x is a single precision approximation to the reciprocal square root,
// and where the multiplications d * x and [] * x are performed with only
// single precision.

if (f2.x == 0)
 return setDoublesingle(0);

float t1, t2, t3;
float2 s0, s1;

t1 = rsqrtf(f2.x);
t2 = __fmul_rn(f2.x, t1);

s0.x = t2 * t2;
s0.y = t2 * t2 - s0.x;  // PROBLEM HERE

s1 = f2 - s0;

t3 = float(0.5) * s1.x * t1;
s0.x = t2;
s0.y = 0;
s1.x = t3;
s1.y = 0;

return s0 + s1;
[/code]

The problem occurs at "PROBLEM HERE" in that this value always is 0 when not running in emulation mode. The .ptx file up to that part is:

[code]
rsqrt.f32  %f105, %f101;       //
mul.rn.f32  %f106, %f101, %f105; //
mul.f32  %f107, %f106, %f106;  //
mov.f32  %f104, %f107;         //
sub.f32  %f103, %f107, %f107;  //
[/code]

It looks like t2 * t2 is stored then subtracted from itself which then results in 0. However I also used a __fmul_rn for the second t2 * t2 which made the .ptx file look like:

[code]
rsqrt.f32  %f105, %f101;       //
mul.rn.f32  %f106, %f101, %f105; //
mul.f32  %f107, %f106, %f106;  //
mov.f32  %f104, %f107;         //
mul.rn.f32  %f108, %f106, %f106; //
sub.f32  %f103, %f108, %f107;  //
[/code]

Which I think should work, but doesn't. In fact this also breaks the code when running in emulation mode.

Any help would be appreciated.
I've been trying to write a sqrt function with emulated double precision, and while it works in emulation mode, it gives the wrong result when running on the hardware. The code is:





// This subroutine employs the following formula (due to Alan Karp):

//

// sqrt(d) = (d * x) + 0.5 * [d - (d * x)^2] * x (approx.)

//

// where x is a single precision approximation to the reciprocal square root,

// and where the multiplications d * x and [] * x are performed with only

// single precision.



if (f2.x == 0)

 return setDoublesingle(0);



float t1, t2, t3;

float2 s0, s1;



t1 = rsqrtf(f2.x);

t2 = __fmul_rn(f2.x, t1);



s0.x = t2 * t2;

s0.y = t2 * t2 - s0.x;  // PROBLEM HERE



s1 = f2 - s0;



t3 = float(0.5) * s1.x * t1;

s0.x = t2;

s0.y = 0;

s1.x = t3;

s1.y = 0;



return s0 + s1;




The problem occurs at "PROBLEM HERE" in that this value always is 0 when not running in emulation mode. The .ptx file up to that part is:





rsqrt.f32  %f105, %f101;       //

mul.rn.f32  %f106, %f101, %f105; //

mul.f32  %f107, %f106, %f106;  //

mov.f32  %f104, %f107;         //

sub.f32  %f103, %f107, %f107;  //




It looks like t2 * t2 is stored then subtracted from itself which then results in 0. However I also used a __fmul_rn for the second t2 * t2 which made the .ptx file look like:





rsqrt.f32  %f105, %f101;       //

mul.rn.f32  %f106, %f101, %f105; //

mul.f32  %f107, %f106, %f106;  //

mov.f32  %f104, %f107;         //

mul.rn.f32  %f108, %f106, %f106; //

sub.f32  %f103, %f108, %f107;  //




Which I think should work, but doesn't. In fact this also breaks the code when running in emulation mode.



Any help would be appreciated.

#11
Posted 10/20/2008 06:36 PM   
[quote name='llamaschool' date='Oct 20 2008, 07:36 PM'][code]
s0.x = t2 * t2;
s0.y = t2 * t2 - s0.x;  // PROBLEM HERE
[/code]
[right][snapback]454361[/snapback][/right]
[/quote]

That would only work if t2 * t2 were calculated with a higher precision than what
is stored in s0.x, which in the GPU it is not.
You have to use the emulated-double multiplication (splitting t2 in higher and lower part).
[quote name='llamaschool' date='Oct 20 2008, 07:36 PM']


s0.x = t2 * t2;

s0.y = t2 * t2 - s0.x;  // PROBLEM HERE


[snapback]454361[/snapback]






That would only work if t2 * t2 were calculated with a higher precision than what

is stored in s0.x, which in the GPU it is not.

You have to use the emulated-double multiplication (splitting t2 in higher and lower part).

#12
Posted 10/21/2008 06:21 AM   
When you port routines over from DSFUN, make sure that you use the parts that don't assume the presence of a FMAD instruction. Furthermore, my experiments indicate that using the multiplication operator instead of __fmul_rn has a fairly negligible effect on the accuracy of the double single operations.

I've ported most of the double single routines over to CUDA now, but it's not clear to me how that works with licensing (DSFUN's license is rather vague) so I haven't posted the results here.
When you port routines over from DSFUN, make sure that you use the parts that don't assume the presence of a FMAD instruction. Furthermore, my experiments indicate that using the multiplication operator instead of __fmul_rn has a fairly negligible effect on the accuracy of the double single operations.



I've ported most of the double single routines over to CUDA now, but it's not clear to me how that works with licensing (DSFUN's license is rather vague) so I haven't posted the results here.

#13
Posted 10/21/2008 12:34 PM   
[quote name='Reimar' date='Oct 21 2008, 12:21 AM']That would only work if t2 * t2 were calculated with a higher precision than what
is stored in s0.x, which in the GPU it is not.
You have to use the emulated-double multiplication (splitting t2 in higher and lower part).
[right][snapback]454558[/snapback][/right]
[/quote]

Ah, that's what I didn't know. Thanks.



I also found this iterative formula to compute inverse square roots, using the single precision as a first guess, only one iteration is needed to get essentially double precision:

y = 1 / sqrt(x)
y_n+1 = 0.5 * (y_n * (3 - x * y_n^2))

which I guess is probably pretty well known but it does work.
[quote name='Reimar' date='Oct 21 2008, 12:21 AM']That would only work if t2 * t2 were calculated with a higher precision than what

is stored in s0.x, which in the GPU it is not.

You have to use the emulated-double multiplication (splitting t2 in higher and lower part).

[snapback]454558[/snapback]






Ah, that's what I didn't know. Thanks.







I also found this iterative formula to compute inverse square roots, using the single precision as a first guess, only one iteration is needed to get essentially double precision:



y = 1 / sqrt(x)

y_n+1 = 0.5 * (y_n * (3 - x * y_n^2))



which I guess is probably pretty well known but it does work.

#14
Posted 10/21/2008 02:55 PM   
So just changing that one line ( s0.y = t2 * t2 - s0.x; ) so that it does the two operations using emulated double precision (putting t2 and s0.x into the higher order part and leaving the lower order part of the new double single variables zero), then saving the higher order part of that result in s0.y totally works, thanks dudes.
So just changing that one line ( s0.y = t2 * t2 - s0.x; ) so that it does the two operations using emulated double precision (putting t2 and s0.x into the higher order part and leaving the lower order part of the new double single variables zero), then saving the higher order part of that result in s0.y totally works, thanks dudes.

#15
Posted 10/21/2008 04:35 PM   
  1 / 2    
Scroll To Top