Accuracy of 1D linear interpolation by CUDA texture interpolation

I’m comparing 1D linear interpolation using a “standard” CUDA implementation and a “texture-based” CUDA implementation on complex numbers (float2).

The “standard” CUDA implementation comprises the following lines:

/*************************************/
/* LINEAR INTERPOLATION KERNEL - GPU */
/*************************************/
__device__ float linear_kernel_GPU(float in)
{
    float d_y;
    return 1.-abs(in);
}

/**********************************************/
/* LINEAR INTERPOLATION KERNEL FUNCTION - GPU */
/**********************************************/
__global__ void linear_interpolation_kernel_function_GPU(float2* result_d, float2* data_d, float* x_in_d, float* x_out_d, int M, int N)
{
    int j = threadIdx.x + blockDim.x * blockIdx.x;

    if(j<N)
    {
        result_d[j].x = 0.;
        result_d[j].y = 0.;
        for(int k=0; k<M; k++)
        {
            if (fabs(x_out_d[j]-x_in_d[k])<1.) {
                result_d[j].x = result_d[j].x + linear_kernel_GPU(x_out_d[j]-x_in_d[k])*data_d[k].x;
                result_d[j].y = result_d[j].y + linear_kernel_GPU(x_out_d[j]-x_in_d[k])*data_d[k].y; }
        }  
    } 
}

extern "C" void linear_interpolation_function_GPU(cuComplex* result_d, cuComplex* data_d, float* x_in_d, float* x_out_d, int M, int N){

    dim3 dimBlock(BLOCK_SIZE,1); dim3 dimGrid(N/BLOCK_SIZE + (N%BLOCK_SIZE == 0 ? 0:1),1);
    linear_interpolation_kernel_function_GPU<<<dimGrid,dimBlock>>>(result_d, data_d, x_in_d, x_out_d, M, N);

}

The “texture-based” CUDA implementation comprises the following lines:

texture<float2, 1, cudaReadModeElementType> data_d_texture;

// ********************************************************/
// * LINEAR INTERPOLATION KERNEL FUNCTION - GPU - TEXTURE */
// ********************************************************/
__global__ void linear_interpolation_kernel_function_GPU_texture(cuComplex* result_d, float* x_out_d, int M, int N)
{
    int j = threadIdx.x + blockDim.x * blockIdx.x;

    if(j<N) result_d[j] = tex1D(data_d_texture,float(x_out_d[j]+M/2+0.5));

}

// *************************************************/
// * LINEAR INTERPOLATION FUNCTION - GPU - TEXTURE */
// *************************************************/
extern "C" void linear_interpolation_function_GPU_texture(float2* result_d, float2* data, float* x_in_d, float* x_out_d, int M, int N){

    cudaArray* data_d = NULL; cudaMallocArray (&data_d, &data_d_texture.channelDesc, M, 1); 
    cudaMemcpyToArray(data_d, 0, 0, data, sizeof(float2)*M, cudaMemcpyHostToDevice); 
    cudaBindTextureToArray(data_d_texture, data_d); 
    data_d_texture.normalized = false; 
    data_d_texture.filterMode = cudaFilterModeLinear;

    dim3 dimBlock(BLOCK_SIZE,1); dim3 dimGrid(N/BLOCK_SIZE + (N%BLOCK_SIZE == 0 ? 0:1),1);
    linear_interpolation_kernel_function_GPU_texture<<<dimGrid,dimBlock>>>(result_d, x_out_d, M, N);

}

The “texture-based” interpolation is more than 20 times faster than the “standard” one. However, I noticed some mismatch in the results, with a root mean square error between the two implementations of about 0.07%.

The CUDA C Programming Guide says that the interpolation coefficients are stored in 9-bit fixed point format with 8 bits of fractional value, which may be the cause for that mismatch.

I have then two questions:

  1. Is there any “trick” to enhance the accuracy of the “texture-based” interpolation?

  2. I think that this 9-bits representation would limit the accuracy to that here obtained even if I move to float4, right? In other words, there would be no point in enhancing the number representation accuracy from float2 to float4?

Thanks in advance.

what hardware are you running this on? A 20 fold speed difference seems rather odd for any hardware of the Fermi generation or later.

Christian

A 20x performance difference would indicate that there is no coalescing of global memory accesses in the version not using texture. You may want to address the reasons for that, if possible.

In order to get the benefits of both accurate, FMA-based, computation and the texture cache, you may want to look into either binding a texture to the existing data and using tex1Dfetch(), or use of __ldg() if you are on Kepler. Based on my experience the performance difference should less than 50% at that point, and the single-precision version interpolation will be much more accurate.

The built-in texture interpolation is indeed based on 9-bit fixed point arithmetic and there is no way around that that I know of. I don’t understand the part of the question asking about the use of float4, as the accuracy of the interpolation is indepenent of the storage format.

After a quick look at your “standard” kernel it looks like:

A) Your threads are accessing every other 4-byte value - possibly only 50% coalesced
B) Your not caching reused data in shared memory or constant memory

For (A), declare

float2 reg_result; reg_result.x = 0.0f; reg_result.y = 0.0f;

And accumulate over the for loop in registers only.

Reading data_d[k].x and data_d[k].y separately would cause coalescing issues on older hardware and might still do that. Hence instead of issuing 2 separate 4-byte access, issue one 8-byte access per thread :

float2 reg_data = data_d[k]; // should be a little safer.

This might not be an issue for Fermi/Kepler but I would never trust these things to the compiler :-)

For (B), consider either first caching data_d in shared memory or placing it in constant memory since all threads of the warp are accessing the same index this should be beneficiary.

Thank you for your question, Christian. The hardware is a GeForce GT 540M.

Thank you for your comments, Njuffa.

If I’m understanding correctly, you are suggesting to “replicate” the “standard” kernel using texture memory instead of global memory.

Let me say that I have a significant computational bottlneck and I’m afraid I have first to change the computational strategy to start seeing improvements.

Finally, concerning float2 or float4, I understand you point since, as you mentioned, linear texture interpolation is performed in a 9-bits arithmetics.

I hope I will let you know shortly.

Thank you.

Thank you Jimmy for your suggestions.

I modified the kernel accordingly,

__global__ void linear_interpolation_kernel_function_GPU(float2* result_d, float2* data_d, float* x_in_d, float* x_out_d, int M, int N)
{
   int j = threadIdx.x + blockDim.x * blockIdx.x;

   if(j<N)
   {
	  float2 reg_result, reg_data; 
	  reg_result.x = 0.0f; reg_result.y = 0.0f;
	  float reg_kernel, reg_x_out_d=x_out_d[j];
	  for(int k=0; k<M; k++)
	  {
		  reg_data=data_d[k];
		  if (fabs(reg_x_out_d-x_in_d[k])<1.) {
			  reg_kernel = linear_kernel_GPU(reg_x_out_d-x_in_d[k]);
			  reg_result.x = reg_result.x + reg_kernel*reg_data.x;
			  reg_result.y = reg_result.y + reg_kernel*reg_data.y; }
	  }  
	  result_d[j]=reg_result;
   } 
}

with some improvemenets, but the linear texture interpolation keeps being significantly faster than the “standard” one. I’m afraid I have to modify the computational strategy of the “standard” linear interpolator.

I must be missing something. I fail to see how the texture and non-texture versions of the code are equivalent. The non-texture version has a loop that’s not there in the texture version, so the non- texture version would appear to perform M-times more computational work. Also x_in_d is never used in the texture version but is used in the non-texture version.

If this is because the texture-based version uses some clever pre-massaged version of the data, that preprocessing should be applicable to the non-texture version. I have worked on linear interpolation code before using three variants: texture interpolation, texture fetch plus FMA-based interpolation, normal array loads plus FMA-based interpolation. Except for the actual fetch and interpolation code at the core, the code was structurally identical for all three versions, which is not the case here. This puzzles me.

Is this an artifact of the forum software sometimes making portions of pasted code magically disppear?

A few things to try in practical terms:

(1) If M is typically fairly large, try requesting the compiler to partially unroll the loop, e.g. with a #pragma unroll 8.

(2) Try giving the compiler more information about the input arrays. Most seem to be read-only, so you can us the const qualifier. They also appear to be non-aliased. If so you can use the restrict qualifier. E.g.

float2* restrict result_d, const float2* restrict data_d, const float* restrict x_in_d, const float* restrict x_out_d

Thanks Njuffa, your comments are quite correct. I was using this kernel function as a “template” kernel function valid also for other kinds of interpolators, therefore the for loops runs over M terms (other interpolation kernels could involve a number of terms greater than 2).

To make a more “fair” comparison, I changed the kernel function to

__global__ void linear_interpolation_kernel_function_GPU(float* result_d, float* data_d, float* x_out_d, int M, int N)
{
   int j = threadIdx.x + blockDim.x * blockIdx.x;

   if(j<N)
   {
      float reg_x_out = x_out_d[j]+M/2;
	  int k = floor(reg_x_out);
	  float a = (reg_x_out)-floorf(reg_x_out);
	  result_d[j]=(1-a)*data_d[k]+a*data_d[k+1];
   } 
}

I have flattened the float2 to float by a cast from float2* to float*, so I achieve 100% global memory store efficiency, but I still have about 40% global memory load efficiency. I have also to check the actual sematic correctness, but more or less this should be quite close to the final version. I will let you know updates and I will also apply your last suggestions.

Thanks again.

I would suggest rewriting the actual interpolation so it can map to exactly two FMAs (the compiler cannot make this transformation under C/C++ rules as they are not numerically equivalent):

float dk = data[k];
float dkp1 = data[k+1];
result_d[j] = a * dkp1 + (-dk * a + dk);

As you noted, because of the data access pattern, you cannot get perfect coalescing for this kind of interpolation code. This is where the texture cache can provide benefits. You could replace the data_d accesses in the code above with __ldg() on Kepler, provided the array data_d is read-only across the entire kernel invocation. If the code needs to run on Fermi-class devices as well, you can bind data_d to a texture, then use tex1Dfetch() to retrieve the data. I would expect about 1.5x speedup from reading the data through the texture path, and the speed of the resulting code should be close to the performance of the texture interpolation path.

Hi Njuffa, following your suggestions (thank you very much, especially for the FMA recommendation), I ended up with the following three codes.

Global memory

__global__ void linear_interpolation_kernel_function_GPU(float* __restrict__ result_d, const float* __restrict__ data_d, const float* __restrict__ x_out_d, int M, int N)
{
   int j = threadIdx.x + blockDim.x * blockIdx.x;

   if(j<N)
   {
	  float reg_x_out = x_out_d[j/2]+M/4;
	  int k = floor(reg_x_out);
	  float a = (reg_x_out)-floorf(reg_x_out);
	  float dk = data_d[2*k+(j&1)];
	  float dkp1 = data_d[2*k+2+(j&1)];
	  result_d[j] = a * dkp1 + (-dk * a + dk);
   } 
}

Texture memory

global void linear_interpolation_kernel_function_GPU_texture(float* restrict result_d, const float* restrict x_out_d, const int M, const int N)
{
int j = threadIdx.x + blockDim.x * blockIdx.x;

if(j<N)
{
float reg_x_out = x_out_d[j/2]+M/4;
int k = floor(reg_x_out);
float a = (reg_x_out)-floorf(reg_x_out);
float dk = tex1Dfetch(data_d_GPU_texture,2k+(j&1));
float dkp1 = tex1Dfetch(data_d_GPU_texture,2
k+2+(j&1));
result_d[j] = a * dkp1 + (-dk * a + dk);
}
}

Texture filtering

global void linear_interpolation_kernel_function_GPU_texture_filtering(float2* restrict result_d, const float* restrict x_out_d, const int M, const int N)
{
int j = threadIdx.x + blockDim.x * blockIdx.x;

if(j<N) result_d[j] = tex1D(data_d_texture,float(x_out_d[j]+M/2+0.5));

}

Here are the timings:

Kernel Time N x M
Global 0.0296ms (1024x512) - 0.104ms (10244x5124) - 0.399ms (10248x5128)
Texture 0.0973ms (1024x512) - 0.374ms (10244x5124) - 1.450ms (10248x5128)
Filtering 0.1645ms (1024x512) - 0.249ms (10244x5124) - 0.620ms (10248x5128)

It seems that the version using the Global memory is now the fastest, which seems a bit strange to me. Does this match your experience? Does the keyword restrict give you any computational benefit?

Thank you very much.

Your performance measurement don’t match what I have seen with interpolation code locally, but of course much will depend on the access patterns as soon as caches are involved.

The interpolation code in question was written in the Fermi time frame. At that time the variant using global load with FMA-based interpolation variant was the slowest, texture load with FMA-based interpolation was faster, and texture interpolation was the fastest. On my K20c, the variant using texture load plus FMA and the texture interpolation variants are essentially same speed, I haven’t actually time the global load plus FMA variants, maybe I should do that in light of your results.

From my experiments I recall that there was also some perfomance variability based on how work was partitioned among thread blocks and threads. The problem was very large and I used 2D grids and 2D thread blocks if I recall correctly. I determined the best partitioning experimentally for an M2090. This partitioning may no longer be optimal for K20. You may want to look into these aspects if you need the absolute best performance on any particular GPU. I assume the performance differences I observed are due to interactions between the caches and the block and warp schedulers, in that the exact stream of addresses presented to the caches and to the memory controller differs based on this partitioning.

As for the use of ‘const’ and ‘restrict’, I use these whereever possible, since long before Kepler. My reasoning is that it never hurts to give the compiler as much information as possible to exploit during optimization. My observation is that the “never hurts” part is not striclty true. In some rare cases the freedom to move loads around (due to the use of restrict) caused the compiler to use so many registers that there was spilling on Fermi, which provides only 63 registers per thread. I haven’t seen this issue on Kepler yet, and since it was rare on Fermi, I use ‘const’ and ‘restrict’ where applicable, as a “best practice”. I have certainly seen quite a few cases where adding ‘const’ and ‘restrict’ gave an instant 10%-15% performance boost. But of course there are other cases where this additional information doesn’t help at all (possibly because the compiler can determine const-ness and lack of aliasing on its own).

Thanks for your comments. I have done a test on a K20 card, and here are the results:

Global
(10242 x 5122) 0.0161ms (*16) 0.0530ms (*64) 0.181ms (*128) 0.369ms

Texture
(10242 x 5122) 0.0525ms (*16) 0.191ms (*64) 0.683ms (*128) 1.33ms

Filtering
(10242 x 5122) 0.152ms (*16) 0.589ms (*64) 1.04ms (*128) 2.32ms

As a trend, the version using global memory seems still faster…

Again, thank you very much!

I have the feeling that there is some loss of accuracy when calculating

float a = (reg_x_out)-floorf(reg_x_out);

As you can see from the attachment, , the difference between the CPU and GPU implementations of the code is very low for j

I have the feeling that there is some loss of accuracy when calculating

float a = (reg_x_out)-floorf(reg_x_out);

The difference between the CPU and GPU implementations of the code is very low when reg_x_out is lower than approximately 350, and suddently raises above 350.

Is there any accuracy issue to know (or a mistake from my side) when calculating the fractional part of a floating point number?

Thank you again.

To split a single-precision number into integer and fractional parts for interpolation, one typically uses code like

float t = truncf(x);
int i = (int)t;
float f = x - t;

where truncf(x) is x rounded towards zero. This decomposition is exact, no information is lost and there is no loss of accuracy. For positive numbers

truncf(x) == floorf(x)

where floorf(x) is x rounded to negative infinity, so your code is doing essentially the same thing.

So that steps looks uncritical, the result of this computation should match between CPU and GPU. It is likely that there are numerical differences between CPU and GPU in the actual interpolation. On the one hand, the GPU computation uses FMA, which is not available on the host; on the other hand, the CPU computation may happen at much higher than single precision, e.g. when the operands are kept in the registers of the x87 FPU. Numerical differences may come in during other parts of your computational pipeline. Have you had a chance to read the following whitepaper

[url]http://developer.download.nvidia.com/assets/cuda/files/NVIDIA-CUDA-Floating-Point.pdf[/url]

To assess the quality of your CPU as well as your GPU results, I would suggest checking the accuracy against a higher precision reference, which is what I did for my interpolation code. I used double-double computation as a reference which gives close to quadruple precision.

Here is another thought on the relative speed of the three code variants. If you are on a sm_35 platform, the “global load” variant may get the benefit of the texture cache automagically if the compiler can prove that it is safe to use the LDG instruction, which does global loads through the texture cache. To find out whether this is the case, you can run cuobjdump --dump-sass on your binary and grep the sm_35 disassembly for “LDG”.

Thanks again for your comments, Njuffa.

Concerning the truncation, I have tried using truncf as you indicated, and the result is equivalent to using floorf, as you expected, being the involved numbers positive. Thank you very much for also pointing out the very interesting white paper.

Now you are suggesting to build up a (CPU? GPU?) quadruple-precision reference to assess the accuracy of the floating point linear interpolation. Could you please provide me some hints on how obtaining quadruple-precision results from built-in double precision data type? This would be helpful to me to construct references for assessing the accuracy of my codes.

Finally, regarding the 3.5 compute capability architecture, I’m reading back again your previous posts and I realized that you recommended using __ldg() on that architecture. So, the numbers I lastly gave you on the K20 may be misleading. As long as I understand, I should use

__ldg(data_d_GPU_texture[2*k+(j&1)]);

instead of

tex1Dfetch(data_d_GPU_texture,2*k+(j&1));

Is that correct? If so, tomorrow I will send you the new timings on the K20 card and also the output of the cuobjdump.

Again, thank you very much for your comments and suggestions.

You can find a collection of double-double functions for the GPU as a download on the registered developer website. I created a version of that for the CPU by making the necessary adjustments (lack of FMA on the host), but it is not in good enough shape to share. It is probably faster to search for a ready-to-use C code with Google. While I personally find it useful to get an accurate assessment of numerical error by comparison with a high-precision reference (like quasi-quadruple precision double-double), you could also try to explain the differences between your host and GPU results based on the information in the whitepaper and leave it at that.

In my experience, many differences in numerical results between CPU and GPU disappear (or become minimal) when FMA-merging is turned off. You can pass nvcc -fmad=false if you want to give that a try. Note that turning off FMA merging in the compiler typically has a negative impact both on performance and on accuracy.

While __ldg() reads data through the texture path, it does not require a texture. It is an overloaded function with the prototype __ldg(const *T) where T is one of CUDA’s built-in types, with the possible exception of char and short, which I have been told are not supported in CUDA 5.0. Unfortunately, the information about this new sm_35 intrinsic was inadvertently omitted from the CUDA 5.0 documentation.

Thank you Njuffa.

I tried to use the __ldg intrinsic in the “global” memory kernel and indeed it speeds up it a lot on my K20. Here are some results

Global kernel results without __ldg

(10242 x 5122) 0.0161ms (*16) 0.0530ms (*64) 0.181ms (*128) 0.369ms

Global kernel results with __ldg
(10242 x 5122) 0.0122ms (*16) 0.0192ms (*64) 0.0396ms (*128) 0.0696ms

I must say that I do not understand the mechanism. If I understand correctly your previous posts, the __ldg intrinsic tries to exploit the texture even though the array has not been bound to the texture. Then, it is not really clear to my why it is much faster than the case when I explicitly bind the array to the texture. I recall that the results with binding the array to the texture were

Texture kernel results
(10242 x 5122) 0.0525ms (*16) 0.191ms (*64) 0.683ms (*128) 1.33ms

From the prototype of the __ldg intrinsic, I understand it does not apply to texture variables, but only global, am I right?

For completeness, here is the result of the cuobjdump command you recommended (regarding the “global” kernel)

Function : _Z40linear_interpolation_kernel_function_GPUPfPKfS1_i
i
        /*0008*/     /*0x089c000664c03c00*/     MOV R1, c [0x0] [0x44];
        /*0010*/     /*0xa01fc00274000000*/     MOV32I R0, 0x140;
        /*0018*/     /*0x001c000a7ca00000*/     LDC R2, c [0x0] [R0];
        /*0020*/     /*0xa21fc00274000000*/     MOV32I R0, 0x144;
        /*0028*/     /*0x001c00427ca00000*/     LDC R16, c [0x0] [R0];
        /*0030*/     /*0xa41fc00274000000*/     MOV32I R0, 0x148;
        /*0038*/     /*0x001c00027ca00000*/     LDC R0, c [0x0] [R0];
        /*0048*/     /*0xa61fc00e74000000*/     MOV32I R3, 0x14c;
        /*0050*/     /*0x001c0c0e7ca00000*/     LDC R3, c [0x0] [R3];
        /*0058*/     /*0xa81fc01274000000*/     MOV32I R4, 0x150;
        /*0060*/     /*0x001c10127ca00000*/     LDC R4, c [0x0] [R4];
        /*0068*/     /*0x109c001686400000*/     S2R R5, SR33;
        /*0070*/     /*0x029c001ee4c03c00*/     MOV R7, R5;
        /*0078*/     /*0x051c001664c03c00*/     MOV R5, c [0x0] [0x28];
        /*0088*/     /*0x129c001a86400000*/     S2R R6, SR37;
        /*0090*/     /*0x031c001ae4c03c00*/     MOV R6, R6;
        /*0098*/     /*0x031c1446d1081c00*/     IMAD R17, R5, R6, R7;
        /*00a0*/     /*0x021c441edb181c00*/     ISETP.LT.AND P0, PT, R17, R4, PT
;
        /*00a8*/     /*0x001e001e84801c07*/     PSETP.AND.AND P0, PT, !P0, PT, P
T;
        /*00b0*/     /*0x001c3c0285800000*/     NOP;
        /*00b8*/     /*0x7c00000014800002*/     SSY 0x5b8;
        /*00c8*/     /*0x7000003c12000002*/     @P0 BRA 0x5b0;
        /*00d0*/     /*0x001c003c12000000*/     BRA 0xd8;
        /*00d8*/     /*0x089c0012e4c03c00*/     MOV R4, R17;
        /*00e0*/     /*0x011fc01674000000*/     MOV32I R5, 0x2;
        /*00e8*/     /*0x7f9c001ee4c03c00*/     MOV R7, RZ;
        /*00f0*/     /*0x029c0016e4c03c00*/     MOV R5, R5;
        /*00f8*/     /*0x029c0016e4c03c00*/     MOV R5, R5;
        /*0108*/     /*0x021c0012e4c03c00*/     MOV R4, R4;
        /*0110*/     /*0x021c001ae4c03c00*/     MOV R6, R4;
        /*0118*/     /*0x029c1812e2002000*/     LOP.XOR R4, R6, R5;
        /*0120*/     /*0x039c103edb681c00*/     ISETP.GE.AND P1, PT, R4, R7, PT;

        /*0128*/     /*0x029c0012e4c03c00*/     MOV R4, R5;
        /*0130*/     /*0x031ce81ae6100000*/     I2I R6, |R6|;
        /*0138*/     /*0x029ce816e6100000*/     I2I R5, |R5|;
        /*0148*/     /*0x029c281ee5c00800*/     I2F.F32.U32.RP R7, R5;
        /*0150*/     /*0x021c1c1e84000000*/     MUFU.RCP R7, R7;
        /*0158*/     /*0x039c001ee4c03c00*/     MOV R7, R7;
        /*0160*/     /*0xff1c1c1d4007ffff*/     IADD32I R7, R7, 0xffffffe;
        /*0168*/     /*0x039c001ee4c03c00*/     MOV R7, R7;
        /*0170*/     /*0x039c281ee5808c00*/     F2I.FTZ.U32.F32.TRUNC R7, R7;
        /*0178*/     /*0x039c1422e1c00000*/     IMUL.U32.U32 R8, R5, R7;
        /*0188*/     /*0x041ce822e6010000*/     I2I R8, -R8;
        /*0190*/     /*0x041c1c22e1c00400*/     IMUL.U32.U32.HI R8, R7, R8;
        /*0198*/     /*0x041c1c1ee0800000*/     IADD R7, R7, R8;
        /*01a0*/     /*0x031c1c1ee1c00400*/     IMUL.U32.U32.HI R7, R7, R6;
        /*01a8*/     /*0x009c1c25c0800000*/     IADD R9, R7, 0x1;
        /*01b0*/     /*0x039c1422e1c00000*/     IMUL.U32.U32 R8, R5, R7;
        /*01b8*/     /*0x041c181ae0880000*/     IADD R6, R6, -R8;
        /*01c8*/     /*0x031c141edb301c00*/     ISETP.LE.U32.AND P0, PT, R5, R6,
 PT;
        /*01d0*/     /*0x039c2422e5000000*/     SEL R8, R9, R7, P0;
        /*01d8*/     /*0x009c2025c0800000*/     IADD R9, R8, 0x1;
        /*01e0*/     /*0x029c181ee0880000*/     IADD R7, R6, -R5;
        /*01e8*/     /*0x031c1c1ae5000000*/     SEL R6, R7, R6, P0;
        /*01f0*/     /*0x029c181edb601c00*/     ISETP.GE.U32.AND P0, PT, R6, R5,
 PT;
        /*01f8*/     /*0x041c2416e5000000*/     SEL R5, R9, R8, P0;
        /*0208*/     /*0x001c3c0285800000*/     NOP;
        /*0210*/     /*0x0c00000014800000*/     SSY 0x230;
        /*0218*/     /*0x0404003c12000000*/     @P1 BRA 0x228;
        /*0220*/     /*0x029ce816e6010000*/     I2I R5, -R5;
        /*0228*/     /*0x025ffc1ee2003800*/     LOP.PASS_B.S R7, RZ, ~R4;
        /*0230*/     /*0x7f9c001ae4c03c00*/     MOV R6, RZ;
        /*0238*/     /*0x031c101edb581c00*/     ISETP.NE.AND P0, PT, R4, R6, PT;

        /*0248*/     /*0x001e001e84801c07*/     PSETP.AND.AND P0, PT, !P0, PT, P
T;
        /*0250*/     /*0x029c1c12e5000000*/     SEL R4, R7, R5, P0;
        /*0258*/     /*0x021c0016e4c03c00*/     MOV R5, R4;
        /*0260*/     /*0x021c0012e4c03c00*/     MOV R4, R4;
        /*0268*/     /*0x021c0012e4c03c00*/     MOV R4, R4;
        /*0270*/     /*0x021c0012e4c03c00*/     MOV R4, R4;
        /*0278*/     /*0x011c1011c2400000*/     SHL R4, R4, 0x2;
        /*0288*/     /*0x021c0002e0800000*/     IADD R0, R0, R4;
        /*0290*/     /*0x001c0002e4c03c00*/     MOV R0, R0;
        /*0298*/     /*0x001c0000c4000000*/     LD R0, [R0];
        /*02a0*/     /*0x001c0002e4c03c00*/     MOV R0, R0;
        /*02a8*/     /*0x019c000ee4c03c00*/     MOV R3, R3;
        /*02b0*/     /*0x011fc01274000000*/     MOV32I R4, 0x2;
        /*02b8*/     /*0x7f9c001ae4c03c00*/     MOV R6, RZ;
        /*02c8*/     /*0x021c0012e4c03c00*/     MOV R4, R4;
        /*02d0*/     /*0x021c0012e4c03c00*/     MOV R4, R4;
        /*02d8*/     /*0x019c000ee4c03c00*/     MOV R3, R3;
        /*02e0*/     /*0x019c0016e4c03c00*/     MOV R5, R3;
        /*02e8*/     /*0x021c140ee2002000*/     LOP.XOR R3, R5, R4;
        /*02f0*/     /*0x031c0c3edb681c00*/     ISETP.GE.AND P1, PT, R3, R6, PT;

        /*02f8*/     /*0x021c000ee4c03c00*/     MOV R3, R4;
        /*0308*/     /*0x029ce816e6100000*/     I2I R5, |R5|;
        /*0310*/     /*0x021ce812e6100000*/     I2I R4, |R4|;
        /*0318*/     /*0x021c281ae5c00800*/     I2F.F32.U32.RP R6, R4;
        /*0320*/     /*0x021c181a84000000*/     MUFU.RCP R6, R6;
        /*0328*/     /*0x031c001ae4c03c00*/     MOV R6, R6;
        /*0330*/     /*0xff1c18194007ffff*/     IADD32I R6, R6, 0xffffffe;
        /*0338*/     /*0x031c001ae4c03c00*/     MOV R6, R6;
        /*0348*/     /*0x031c281ae5808c00*/     F2I.FTZ.U32.F32.TRUNC R6, R6;
        /*0350*/     /*0x031c101ee1c00000*/     IMUL.U32.U32 R7, R4, R6;
        /*0358*/     /*0x039ce81ee6010000*/     I2I R7, -R7;
        /*0360*/     /*0x039c181ee1c00400*/     IMUL.U32.U32.HI R7, R6, R7;
        /*0368*/     /*0x039c181ae0800000*/     IADD R6, R6, R7;
        /*0370*/     /*0x029c181ae1c00400*/     IMUL.U32.U32.HI R6, R6, R5;
        /*0378*/     /*0x009c1821c0800000*/     IADD R8, R6, 0x1;
        /*0388*/     /*0x031c101ee1c00000*/     IMUL.U32.U32 R7, R4, R6;
        /*0390*/     /*0x039c1416e0880000*/     IADD R5, R5, -R7;
        /*0398*/     /*0x029c101edb301c00*/     ISETP.LE.U32.AND P0, PT, R4, R5,
 PT;
        /*03a0*/     /*0x031c201ee5000000*/     SEL R7, R8, R6, P0;
        /*03a8*/     /*0x009c1c21c0800000*/     IADD R8, R7, 0x1;
        /*03b0*/     /*0x021c141ae0880000*/     IADD R6, R5, -R4;
        /*03b8*/     /*0x029c1816e5000000*/     SEL R5, R6, R5, P0;
        /*03c8*/     /*0x021c141edb601c00*/     ISETP.GE.U32.AND P0, PT, R5, R4,
 PT;
        /*03d0*/     /*0x039c2012e5000000*/     SEL R4, R8, R7, P0;
        /*03d8*/     /*0x001c3c0285800000*/     NOP;
        /*03e0*/     /*0x0c00000014800000*/     SSY 0x400;
        /*03e8*/     /*0x0404003c12000000*/     @P1 BRA 0x3f8;
        /*03f0*/     /*0x021ce812e6010000*/     I2I R4, -R4;
        /*03f8*/     /*0x01dffc1ae2003800*/     LOP.PASS_B.S R6, RZ, ~R3;
        /*0408*/     /*0x7f9c0016e4c03c00*/     MOV R5, RZ;
        /*0410*/     /*0x029c0c1edb581c00*/     ISETP.NE.AND P0, PT, R3, R5, PT;

        /*0418*/     /*0x001e001e84801c07*/     PSETP.AND.AND P0, PT, !P0, PT, P
T;
        /*0420*/     /*0x021c180ee5000000*/     SEL R3, R6, R4, P0;
        /*0428*/     /*0x019c0012e4c03c00*/     MOV R4, R3;
        /*0430*/     /*0x019c000ee4c03c00*/     MOV R3, R3;
        /*0438*/     /*0x019c000ee4c03c00*/     MOV R3, R3;
        /*0448*/     /*0x019c000ee4c03c00*/     MOV R3, R3;
        /*0450*/     /*0x019ca80ee5c00000*/     I2F R3, R3;
        /*0458*/     /*0x019c004ae2c00000*/     FADD R18, R0, R3;
        /*0460*/     /*0x091c0012e4c03c00*/     MOV R4, R18;
        /*0468*/     /*0x0000010011000000*/     JCAL 0x0;
        /*0470*/     /*0x021c0002e4c03c00*/     MOV R0, R4;
        /*0478*/     /*0x001c004ee4c03c00*/     MOV R19, R0;
        /*0488*/     /*0x099c2c12e5400000*/     F2F.F64.F32 R4, R19;
        /*0490*/     /*0x021c3802e5400000*/     F2F.F32.F64 R0, R4;
        /*0498*/     /*0x001c0012e4c03c00*/     MOV R4, R0;
        /*04a0*/     /*0x0000010011000000*/     JCAL 0x0;
        /*04a8*/     /*0x021c0002e4c03c00*/     MOV R0, R4;
        /*04b0*/     /*0x001c0052e4c03c00*/     MOV R20, R0;
        /*04b8*/     /*0x099c484ae2c10000*/     FADD R18, R18, -R19;
        /*04c8*/     /*0x009c4401c2000000*/     LOP.AND R0, R17, 0x1;
        /*04d0*/     /*0x011c5001a1080000*/     IMAD R0, R20, 0x2, R0;
        /*04d8*/     /*0x011c0001c2400000*/     SHL R0, R0, 0x2;
        /*04e0*/     /*0x001c4002e0800000*/     IADD R0, R16, R0;
        /*04e8*/     /*0x001c0012e4c03c00*/     MOV R4, R0;
        /*04f0*/     /*0x0000010011000000*/     JCAL 0x0;
        /*04f8*/     /*0x021c0002e4c03c00*/     MOV R0, R4;
        /*0508*/     /*0x001c004ee4c03c00*/     MOV R19, R0;
        /*0510*/     /*0x009c4401c2000000*/     LOP.AND R0, R17, 0x1;
        /*0518*/     /*0x011c5001a1080000*/     IMAD R0, R20, 0x2, R0;
        /*0520*/     /*0x011c0001c2400000*/     SHL R0, R0, 0x2;
        /*0528*/     /*0x041c0001c0800000*/     IADD R0, R0, 0x8;
        /*0530*/     /*0x001c4002e0800000*/     IADD R0, R16, R0;
        /*0538*/     /*0x001c0012e4c03c00*/     MOV R4, R0;
        /*0548*/     /*0x0000010011000000*/     JCAL 0x0;
        /*0550*/     /*0x021c0002e4c03c00*/     MOV R0, R4;
        /*0558*/     /*0x001c0002e4c03c00*/     MOV R0, R0;
        /*0560*/     /*0x001c4802e3400000*/     FMUL R0, R18, R0;
        /*0568*/     /*0x099c280ee5410000*/     F2F R3, -R19;
        /*0570*/     /*0x091c0c0ee3400000*/     FMUL R3, R3, R18;
        /*0578*/     /*0x099c0c0ee2c00000*/     FADD R3, R3, R19;
        /*0588*/     /*0x019c000ee2c00000*/     FADD R3, R0, R3;
        /*0590*/     /*0x011c4401c2400000*/     SHL R0, R17, 0x2;
        /*0598*/     /*0x001c0802e0800000*/     IADD R0, R2, R0;
        /*05a0*/     /*0x001c0002e4c03c00*/     MOV R0, R0;
        /*05a8*/     /*0x001c000ce4000000*/     ST [R0], R3;
        /*05b0*/     /*0x005c3c0285800000*/     NOP.S;
        /*05b8*/     /*0x141c003c12000000*/     BRA 0x5e8;
        /*05c8*/     /*0x7f9c03fee4c03c00*/     MOV RZ, RZ;
        /*05d0*/     /*0x001c003c18000000*/     EXIT;
        /*05d8*/     /*0x7f9c03fee4c03c00*/     MOV RZ, RZ;
        /*05e0*/     /*0x001c003c18000000*/     EXIT;
        /*05e8*/     /*0x7f9c03fee4c03c00*/     MOV RZ, RZ;
        /*05f0*/     /*0x001c003c18000000*/     EXIT;
        /*05f8*/     /*0x7f9c03fee4c03c00*/     MOV RZ, RZ;
        /*0608*/     /*0x001c003c18000000*/     EXIT;
        /*0610*/     /*0x001c3c0285800000*/     NOP;
        /*0618*/     /*0x001c3c0285800000*/     NOP;
        /*0620*/     /*0x001c3c0285800000*/     NOP;
        /*0628*/     /*0x001c3c0285800000*/     NOP;
        /*0630*/     /*0x001c3c0285800000*/     NOP;
        /*0638*/     /*0x001c3c0285800000*/     NOP;
        /*0640*/     /*0xfc1c003c12007fff*/     BRA 0x640;
        /*0648*/     /*0x001c3c0285800000*/     NOP;
        /*0650*/     /*0x001c3c0285800000*/     NOP;
        /*0658*/     /*0x001c3c0285800000*/     NOP;
        /*0660*/     /*0x001c3c0285800000*/     NOP;
        /*0668*/     /*0x001c3c0285800000*/     NOP;
        /*0670*/     /*0x001c3c0285800000*/     NOP;
        /*0678*/     /*0x001c3c0285800000*/     NOP;

I will take also some time to read papers on using double-double precision on the GPU.

Again, thanks a lot.