Emulated double precision Double single routine header

I took the double emulation from the Mandelbrot example and made some adjustments to it. I converted the functions to operators that work with float2 and I added unary negation, division, and some extra functions to store/retrieve values in/from a double single float2. I put them together into a header file in case it’s useful for anyone else.

EDIT: I updated the header file. I made the existing * and / operators into “fast” functions and used the more precise * (and equivalent /) operator from the Mandelbrot example.

EDIT: I updated the file again, removing the questionable “fast” functions and using the new multiply from Norbert Juffa posted here in the thread.

EDIT: Another update to include Reimar’s bit-style splitting and to switch all multiplications to __fmul_rn to ensure no MAD instructions appear in generated PTX.
dsmath.h (5.7 KB)

Excellent! Thanks for sharing the package, it could be useful to many people.

Good idea!

Keep in mind however that, as hinted in this thread, there is currently no GPU that supports the fused multiply-add (FMA) operation in single precision. (single-precision MAD on G80 and even GT200 is a very different beast). So the “fast” multiplication and division will not work on current GPUs.

It is not a matter of precision. Both versions of the multiplication have the same accuracy, but the “fast” version is uh… faster when an FMA is present, and broken otherwise.

Additionally, NVCC detects the subexpression a.x * b.x in the following code and computes it only once, which breaks the whole algorithm anyway :

float c11 = a.x * b.x;

	float c21 = a.x * b.x - c11;

Actually, GT200 does support the FMA instruction, but only in double precision.

So you can still use this algorithm to perform a double-double multiplication. Something like:

__device__ inline double2 mul_fast(const double2 a, const double2 b) {

    double2 c;

   // According to DSFUN90, this is the preferred method of computing

    // c11 and c21 on hardware supporting the FMA instruction:

	

    // Multilply a.x * b.x using Dekker's method.

    double c11 = a.x * b.x;

    double c21 = fma(a.x, b.x, c11);

   // Compute a.x * b.y + a.y * b.x (only high-order word is needed).

    double c2 = a.x * b.y + a.y * b.x;

   // Compute (c11, c21) + c2 using Knuth's trick, also adding low-order product.

    double t1 = c11 + c2;

    double e = t1 - c11;

    double t2 = ((c2 - e) + (c11 - (t1 - e))) + c21 + a.y * b.y;

   // The result is t1 + t2, after normalization.

    c.x = e = t1 + t2;

    c.y = t2 - (e - t1);

   return c;

}

To prevent aggressive optimizations and guarantee portability across future GPUs, you may need to turn every + into a __dadd_rn call and every * into a __dmul_rn call…

It is nice to have such libraries for CUDA, thanks for your work!

Regards,

Sylvain

Note that CUDA 2.0 includes some new intrinsic functions, __fadd_rn() and __fmul_rn() that can be used to avoid the compiler using MADs.

Below are some validated double-single add and multiply functions written by Norbert Juffa here at NVIDIA.

Note that in general double-single is an imperfect substitute for true double precision, and we encourage customers who want the highest precision to use GPUs that support doubles in hardware (currently Tesla C1060 and GeForce GTX 260 / 280).

/* For x and y of opposite sign whose magnitude is within a factor of two

   of each other either variant below loses accuracy. Otherwise the result

   is within 1.5 ulps of the correctly rounded result with 48-bit mantissa.

   Based on the dsadd() function in D.H. Bailey's dsfun package from netlib. 

*/

__device__ float2 dblsgl_add (float2 x, float2 y)

{

    float2 z;

#if defined(__DEVICE_EMULATION__)

    volatile float t1, t2, e;

#else 

    float t1, t2, e;

#endif

    t1 = x.y + y.y;

    e = t1 - x.y;

    t2 = ((y.y - e) + (x.y - (t1 - e))) + x.x + y.x;

    z.y = e = t1 + t2;

    z.x = t2 - (e - t1);

    return z;

}

/* Based on: Guillaume Da Graça, David Defour. Implementation of Float-Float 

 * Operators on Graphics Hardware. RNC'7, pp. 23-32, 2006.

 */

__device__ float2 dblsgl_mul (float2 x, float2 y)

{

    float2 z;

#if defined(__DEVICE_EMULATION__)

    volatile float up, vp, u1, u2, v1, v2, mh, ml;

#else

    float up, vp, u1, u2, v1, v2, mh, ml;

#endif

    up  = x.y * 4097.0f;

    u1  = (x.y - up) + up;

    u2  = x.y - u1;

    vp  = y.y * 4097.0f;

    v1  = (y.y - vp) + vp;

    v2  = y.y - v1;

    mh  = __fmul_rn(x.y,y.y);

    ml  = (((u1 * v1 - mh) + u1 * v2) + u2 * v1) + u2 * v2;

    ml  = (__fmul_rn(x.y,y.x) + __fmul_rn(x.x,y.y)) + ml;

    z.y = up = mh + ml;

    z.x = (mh - up) + ml;

    return z;

}

My understanding is that there is a FMAD instruction, but it is less accurate as it truncates before the addition. In any case, I’ve removed the fast functions.

The addition function is exactly the same as the one I had and is an exact copy of the dsadd function (with the exception of the high and low floats being swapped). The multiplication function is new to me and I’ve included it in the header file, albeit preserving the word order I had from DSFUN. I hope that’s OK.

I can understand that this is preferrable, but there’s still occasion to use the double single. For my applications, most of the calculations are ok with single precision, but one equation is more sensitive and needs just a little bit more precision. The double single represents an easy and acceptable way to achieve this on existing hardwre.

Thanks!

Yes, it is much less accurate : a MAD truncates the product to 24 bits before the addition. A true FMA would compute all 48 bits of the product before performing the addition.

The “fast” algorithm works by recovering the 24 low-order bits by making all the hi-order bits cancel during the subtraction :

float c11 = a.x * b.x;    // hi-order bits

float c21 = a.x * b.x - c11; // low-order bits

If we try to use this algorithm with a MAD, c21 would end up being mostly 0, and it would not be more accurate than just a single-precision multiplication (actually, slightly less accurate).

As long as you’re self-consistent, I guess it’s OK :)

Edit : Norbert Juffa / Da Graça-Defour’s multiplication seems to neglect the lowest-order bits, and to skip a few normalization steps, which probably causes the advertised larger error.

The DSFUN version should return a correctly rounded result, at a higher computational price though.

(not so sure about that - need to check…)

I haven’t been able to get clear performance numbers, but going by the numbers of instructions,

this “splitting” code in the multiplication

up  = a.x * 4097.0f;

	u1  = (a.x - up) + up;

Can be simplified to something like this:

       uint32_t tmp = *(uint32_t *)&a.x;

        tmp &= ~0xfff;

        u1 = *(float *)&tmp;

which AFAICT is compiled to a single PTX/GPU instruction.

StickGuy,

First of all, Thanks for sharing!

I have a few doubts - may b it is just my floating point format ignorance…

In the “set” function:
"a.x = (float)b;
a.y = (float)(b-a.x);
"

I dont understand how this one works!

A “64-bit - 32-bit” can only be “32-bit”… ?? i.e. A 32-bit + 32-bit can never make a 64-bit number…So, How I wonder how this would work…

Can you kindly clarify??

THanks
Best Regards,
Sarnath

Any arithmetic performed with a double promotes the other arguments to double as well. The idea is that a.x gets the first 23 bits from b’s mantissa by simply truncating b to a float. Then, in (b - a.x), a.x’s value is promoted to a double, and then the promoted value is subtracted from b’s value. The resulting value should be equivalent to b’s value with the first 23 bits from b’s mantissa cleared. The remaining bits should then be left-shifted by 23 (with an appropriate change to the exponent). The result is then truncated to a float and stored in a.y. This should then store the second 23 bites from b’s mantissa in a.y.

I think I got to go read on how floating points are stored in hardware. Thanks for the explanation though I dont understand much!

I’ve been trying to write a sqrt function with emulated double precision, and while it works in emulation mode, it gives the wrong result when running on the hardware. The code is:

// This subroutine employs the following formula (due to Alan Karp):

	//

	// sqrt(d) = (d * x) + 0.5 * [d - (d * x)^2] * x (approx.)

	//

	// where x is a single precision approximation to the reciprocal square root,

	// and where the multiplications d * x and [] * x are performed with only

	// single precision.

	if (f2.x == 0)

  return setDoublesingle(0);

	float t1, t2, t3;

	float2 s0, s1;

	t1 = rsqrtf(f2.x);

	t2 = __fmul_rn(f2.x, t1);

	

	s0.x = t2 * t2;

	s0.y = t2 * t2 - s0.x;  // PROBLEM HERE

	s1 = f2 - s0;

	

	t3 = float(0.5) * s1.x * t1;

	s0.x = t2;

	s0.y = 0;

	s1.x = t3;

	s1.y = 0;

	

	return s0 + s1;

The problem occurs at “PROBLEM HERE” in that this value always is 0 when not running in emulation mode. The .ptx file up to that part is:

rsqrt.f32  %f105, %f101;      	// 

	mul.rn.f32  %f106, %f101, %f105;	// 

	mul.f32  %f107, %f106, %f106;  // 

	mov.f32  %f104, %f107;        	// 

	sub.f32  %f103, %f107, %f107;  //

It looks like t2 * t2 is stored then subtracted from itself which then results in 0. However I also used a __fmul_rn for the second t2 * t2 which made the .ptx file look like:

rsqrt.f32  %f105, %f101;      	// 

	mul.rn.f32  %f106, %f101, %f105;	// 

	mul.f32  %f107, %f106, %f106;  // 

	mov.f32  %f104, %f107;        	// 

	mul.rn.f32  %f108, %f106, %f106;	// 

	sub.f32  %f103, %f108, %f107;  //

Which I think should work, but doesn’t. In fact this also breaks the code when running in emulation mode.

Any help would be appreciated.

That would only work if t2 * t2 were calculated with a higher precision than what

is stored in s0.x, which in the GPU it is not.

You have to use the emulated-double multiplication (splitting t2 in higher and lower part).

When you port routines over from DSFUN, make sure that you use the parts that don’t assume the presence of a FMAD instruction. Furthermore, my experiments indicate that using the multiplication operator instead of __fmul_rn has a fairly negligible effect on the accuracy of the double single operations.

I’ve ported most of the double single routines over to CUDA now, but it’s not clear to me how that works with licensing (DSFUN’s license is rather vague) so I haven’t posted the results here.

Ah, that’s what I didn’t know. Thanks.

I also found this iterative formula to compute inverse square roots, using the single precision as a first guess, only one iteration is needed to get essentially double precision:

y = 1 / sqrt(x)

y_n+1 = 0.5 * (y_n * (3 - x * y_n^2))

which I guess is probably pretty well known but it does work.

So just changing that one line ( s0.y = t2 * t2 - s0.x; ) so that it does the two operations using emulated double precision (putting t2 and s0.x into the higher order part and leaving the lower order part of the new double single variables zero), then saving the higher order part of that result in s0.y totally works, thanks dudes.

I noticed you are only declaring it as a device function, not host.

In that case the following is more readable:

__int_as_float(__float_as_int(a) & ~0xfff);

(the initial underscores being optional, unfortunately these functions seem unavailable for host code).

What is the difference between operator ‘*’ and __fmul_rn.

My floating point operation is

BYTE nOutValue = ( fFloatVaueOne * nIntegerDataOne ) + ( fFloatVaueTwo * nIntegerDataTwo );

For an instance,

nOutValue is 200 when I checked above equation in CPU. But in Kernel I’m getting 199 for the same instruction with same values.

Now I got two solutions for this problem.

  1. Use __fmul_rn instead of ‘*’

ie,

BYTE nOutValue = __fmul_rn( fFloatVaueOne , nIntegerDataOne ) + __fmul_rn( fFloatVaueTwo , nIntegerDataTwo );

  1. Add 0.5 to my result.

BYTE nOutValue = ( fFloatVaueOne * nIntegerDataOne ) + ( fFloatVaueTwo * nIntegerDataTwo ) + 0.5;

Which is the best way ?

Is there any performance issue if I use __fmul_rn ?

StickGuy, I need yuor help (sorry for my english).

I try use your header file on trivial example.

My kernel

/* Kernel */

global void square_elements(float2* in, float2* out, int N)

{

int idx = blockIdx.x*blockDim.x+threadIdx.x;

if ( idx < N) out[idx]=in[idx]*in[idx];

}

My mex-function:

void mexFunction(int nlhs, mxArray *plhs,

int nrhs, const mxArray *prhs[])

{

int i, j, m, n;

double *data1, *data2;

float2 *data1f, *data2f;



/* on GPU */

float2 *data1f_gpu, *data2f_gpu;



mxClassID category;

if (nrhs != nlhs)

	mexErrMsgTxt("The number of input and output arguments must be the same.");

for (i = 0; i < nrhs; i++)

{

	/* Find the dimensions of the data */

	m = mxGetM(prhs[i]);

	n = mxGetN(prhs[i]);

	

	/* Create an mxArray for the output data */

	plhs[i] = mxCreateDoubleMatrix(m, n, mxREAL);

	

	/* Create an input and output data array on the GPU*/

	cudaMalloc( (void **) &data1f_gpu,sizeof(float2)*m*n);

	cudaMalloc( (void **) &data2f_gpu,sizeof(float2)*m*n);

	

	/* Retrieve the input data */

	data1 = mxGetPr(prhs[i]);

	

	/* Check if the input array is single or double precision */

	category = mxGetClassID(prhs[i]);

	if( category == mxSINGLE_CLASS)

	{

		/* The input array is single precision, it can be sent directly to the

		card */

		cudaMemcpy( data1f_gpu, data1, sizeof(float)*m*n,

		cudaMemcpyHostToDevice);

	}

	if( category == mxDOUBLE_CLASS)

	{

		/* The input array is in double precision, it needs to be converted t

		floats before being sent to the card */

		data1f = (float2 *) mxMalloc(sizeof(float2)*m*n);

		for (j = 0; j < m*n; j++)

		{

			set(*data1f, *data1);

		}

		cudaMemcpy( data1f_gpu, data1f, sizeof(float)*n*m, cudaMemcpyHostToDevice);

	}

	data2f = (float2 *) mxMalloc(sizeof(float2)*m*n);

	

	/* Compute execution configuration using 128 threads per block */

	dim3 dimBlock(128);

	dim3 dimGrid((m*n)/dimBlock.x);

	if ( (n*m) % 128 !=0 ) dimGrid.x+=1;

	

	/* Call function on GPU */

	square_elements<<<dimGrid,dimBlock>>>(data1f_gpu, data2f_gpu, n*m);

	

	/* Copy result back to host */

	cudaMemcpy( data2f, data2f_gpu, sizeof(float)*n*m, cudaMemcpyDeviceToHost);

	

	/* Create a pointer to the output data */

	data2 = mxGetPr(plhs[i]);

	

	/* Convert from single to double before returning */

	for (j = 0; j < m*n; j++)

	{

		data2[j] = data2f[j].x + data2f[j].y;	

	}		

	/* Clean-up memory on device and host */

	mxFree(data1f);

	mxFree(data2f);

	cudaFree(data1f_gpu);

	cudaFree(data2f_gpu);

}

}

When I use type float2 istead of float, I receive almost identical results External Media .

I will be very grateful to you, if you prompt to me in what the reason of this problem.

define type uint32_t
and add function set(double, float2)

I’ve been using the double precision emulation with success in complex vector and matrix algebra routines.

My question is: Can this computation principle be extended with three, four, five floats to get something like an arbitrary precision maths implementation? I’d love to create some Mandelbrot (or 3D Mandelbulb) code that dynamically adds more and more floats (precision!) as you zoom in - but I have no idea how to attack this.

Christian