TITAN V / Tesla async 64-bit core

Does is make sense, from a performance perspective, to convert 4 32-bit registers to 2 64-bit registers to perform a 64-bit operation such as add or mul? If a Tesla has separate 64-bit cores, and warp threads on a tesla can truly be unsychronized (i.e, individual program counters), could such an approach improve a kernel’s total execution time?

Oops, I meant Volta, rather than Tesla. i.e, I’m using a Titan V.

The only native 64-bit operations are floating point add, multiply, and multiply-add.

With that proviso, I don’t think your question makes much sense. How do you convert two 32-bit floating point numbers into a single 64-bit floating point number to do an add?

Are you talking about integer arithmetic?

Furthermore, there are no 64-bit general purpose registers in the GPU. The GPU architecture is basically a 32-bit architecture with some extensions for 64-bit addressing.

Instructions that natively operate on 64-bit data (double-precision add, multiply, and FMA as mentioned by txbob; conversions to double precision; conversions from floating-point to 64-bit integer; load/store instructions using 64-bit addresses) use aligned register pairs consisting of consecutive even/odd register numbers, such as R2:R3.

64-bit integer operations are emulated via 32-bit integer operations, using two 32-bit registers for each of the operands.

Yes, I am referring to integer arithmetic. Say, for example, I wanted to add the two ints, 3 and 5. I assume they could be safely added together using an FP64 operation rather than an integer operation? Generally, my question is that in the diagram of the Volta CPU [url]https://devblogs.nvidia.com/parallelforall/inside-volta/[/url] it indicates FP64 cores. Are they just virtual cores and not separate from the int and FP32 cores?

Newbie here…

You can add integers accurately using floating-point operations, provided they are within a certain range. But why would you want to do that? I sense an XY problem here; what is it you are ultimately trying to accomplish, in terms of a high-level task?

Generally speaking, a floating-point multiply returns the top-most bits of the full product, while an integer multiply returns the bottom-most bits of the full product. You may be able to produce the effect of an integer multiply by using denormals to represent integers, I haven’t thought it through.

A double-precision floating-point number can represent integers up to 2**53 accurately, you can’t even add full 64-bit integers.

In any event, given that you have a Titan V in hand, why not simply prototype whatever it is you have in mind? Below is a small program that shows how we can do integer addition with double-precision add, without conversion overhead. Note that in terms of performance, this relies on GPUs handling denormals at full speed, which I am pretty sure is the case for all of NVIDIA’s GPUs (but by all means, test that assumption).

#include <stdio.h>
#include <stdlib.h>

__global__ void kernel (unsigned long long int a, 
                        unsigned long long int b)
{
    printf ("a+b = %016llx\n", a+b);

    double da = __longlong_as_double (a);
    double db = __longlong_as_double (b);
    double r = da + db;
    unsigned long long int ir = __double_as_longlong (r);

    printf ("a+b = %016llx\n", ir);
}

int main (void)
{
    unsigned long long int a = 0x7dcba98765432ULL;
    unsigned long long int b = 0x8fedcba987654ULL;

    kernel<<<1,1>>>(a, b);
    cudaDeviceSynchronize();
    return EXIT_SUCCESS;
}

Maybe what you have in mind is more like the following, where four integers are added pairwise using a single double-precision addition. Note that this requires avoiding overflow conditions (otherwise the lower portion of the mantissa addition will “bleed” into the higher portion, or the higher mantissa portion will “bleed” too much into the exponent field, causing incorrect results).

#include <stdio.h>
#include <stdlib.h>

__global__ void kernel (unsigned int a, 
                        unsigned int b,
                        unsigned int c, 
                        unsigned int d)
{
    printf ("a+b = %08x\n", a+b);
    printf ("c+d = %08x\n\n", c+d);

    double dac = __hiloint2double (a, c);
    double dbd = __hiloint2double (b, d);
    double r = dac + dbd;
    unsigned int ilr = __double2loint (r);
    unsigned int ihr = __double2hiint (r);

    printf ("a+b = %08x\n", ihr);
    printf ("c+d = %08x\n", ilr);
}

int main (void)
{
    unsigned int a = 0x87654;
    unsigned int b = 0x9abcd;
    unsigned int c = 0x7654321;
    unsigned int d = 0x9abcdef;

    kernel<<<1,1>>>(a, b, c, d);
    cudaDeviceSynchronize();
    return EXIT_SUCCESS;
}

Thanks for those ideas. I’ll give them a shot and report the results. The ultimate goal is to determine if, for example, I had to add up 1M pairs of ints, the total time required could be reduced by performing some of the adds as 32-bit and some as 64-bit. This is essentially a Volta architectural question as I don’t know whether more cores can run simultaneously if some of the ops are 64-bit vs all of them being 32-bit.

As you said, perhaps it’s best just to try…

I don’t have access to a Volta GPU. Trade-offs are also based on throughput. If 32-bit integers can be added at twice the rate of double-precision floating-point numbers, trying to partially add them via the floating-point path doesn’t seem like a win. Since DP operations need to operate on aligned register pairs, it also places additional constrains on register allocation that could be harmful to performance.

My recommendation would be to write code in a clear and straightforward manner, and let the profiler guide any “normal” optimizations, then resort to “ninja” optimizations only if you absolutely have to. Even if your envisioned scheme proves to be beneficial on Volta, it could easily be detrimental on other architectures (“brittle” performance).