Compile float as 64bit floating point

JensHorstmann · September 19, 2016, 4:56pm

Hello, i am writing code for my bachelor thesis and i need to compare 32bit with 64bit float precision.

I have written code (cannot show it here, because of the complexity) and i have implemented a simple switch for toggling between single and double floating point precision:

#define FLOAT32BIT float
#define FLOAT64BIT double
#define FLOAT FLOAT32BIT

and i use everywhere in my code FLOAT (instead of float or double).

Now i would expect that the performance is around 32 times faster when i am using 32bit floats (FP32) instead of 64bit floats (FP64) according to my Maxwell GTX 960s FP32 to FP64 performance ratio.
But however my code has the same runtime for FP32 and FP64. My conclusion is that the code is compiled in FP32 or FP64 everytime independently of what i have set in my code. That would explain why the runtime is everytime the same.

I am compiling with Nsight Eclipse and it would be great if someone could tell me what i have to do to get the switching of FP32 and FP64 running.

In this article they say that double precision is supported in Compute Capability 2.0 or higher and i am compiling for CC 5.2.

Any help is appreciated and i will try to provide required informations as fast as possible.

Robert_Crovella · September 19, 2016, 5:09pm

Another possible conclusion is that your code is limited by something other than FP32 or FP64 throughput.

What you have shown is certainly a possible approach to switching between 32-bit and 64-bit for floating point operations. It seems unlikely that you have made a mistake somewhere else in this respect but anything is possible for code you have not shown.

You can attempt to ascertain what your code may be limited by using one of the profilers.

JensHorstmann · September 19, 2016, 5:31pm

Well, i am new to CUDA and my applications is indeed not running optimal and i am not sure why.

I have used the profiler and found out that the GPU is just using 20% of its capability. However i am unsure if this is because i am not providing enough threads for computation or if it is because of e.g. memory bottle necking.

I thought it is another problem of my code but it may be that this both problems result in each other.

Additionally i have read somewhere that i need to enable double precision with setting a compiler flag -arch sm_13. But in my eyes it looks like this will compile for CC 1.3. However there is no Checkbox for enabling CC 1.3 in Nsight Eclipse and i also found no option to turn double precision compile on without adding something manually.

Can you give me some tips how i can determine with the profiler where i have problems with my code?
I have not seen any suspicious things in the profiler yet.

Robert_Crovella · September 19, 2016, 5:38pm

If you are using CUDA 7.0 or newer, there is nothing you need to do to enable double precision. Since there is no checkbox for cc1.3 it means you are using CUDA 7.0 or newer, which do not support these early devices (cc1.x)

Using nvvp, you can run the guided analysis, and it will analyze your code and make suggestions, including what it thinks your performance is limited by. You will need to get to the point where you select an individual kernel and run “perform kernel analysis”

nvvp is the same as the profiler built into nsight EE, so you can just use the profiling facility in nsight EE.

JensHorstmann · September 20, 2016, 7:51am

So thanks for you comment it really helped me to find the guided application analysis! :)

I am right now profiling the application and it seems that FP64 is used.

So this questions seems to be answered. However the question why my application is so slow is still open. but that is the topic of another question.

Thx for your help!

Gogar · September 25, 2016, 1:20pm

Another thing to consider is that constants defined without an f suffix are interpreted by the compiler as doubles, so if you perform computations involving constants, it could be that it’s doing them in double precision in both cases.

e.g. 3.00 is interpreted as a double, as opposed to 3.00f which is seen by the compiler as a float.

SPWorley · September 25, 2016, 3:50pm

Excellent point! This is a common but subtle problem with real performance impact. Simple code like

float a=1.0/b;

will invoke the slow path!
There is an extremely useful ptxas switch to identify it: --warn-on-double-precision-use . I wish it was enabled by default.

njuffa · September 25, 2016, 4:26pm

I don’t think turning that flag on by default would be a good idea. At this point, all GPU architectures supported by CUDA provide support for double precision, so it would be entire reasonable for the CUDA standard math library to make limited use of double precision (in a slow path, for example). The math library code is injected when PTX is compiled to SASS, and is therefore subject to all PTXAS flag.

Given how many questions from CUDA programmers the very limited use of local memory in the CUDA standard math library raised, I foresee a flood of similar messages about double-precision usage warnings if the flag --warn-on-double-precision-use were turned on by default: “Why am I getting this warning, my source code is ‘float clean’?”