output difference between quadro K600 and K620

Hi,

I am having one image processing related CUDA application which previously was being used with K600 card - recently the hardware is changed to K620. Problem is the output values are coming little different than the previous one. So my client is little concerned about the accuracy part. If this is a known thing that there will be difference in floating point calculations between Kepler and Maxwell - then I will be very helpful if some one can share with me the link confirming that. [Note: not all the resultant pixel values are changing but a big number of pixels showing change in decimal part]

[P.S. I am building my application in CUDA toolkit 4.0 with VS 2008. there are some instructions present in Nvidia site to support Maxwell architecture. But my application is running properly with Maxwell card - so I’m not sure if I missed some thing - or any new changes are needed in the cuda rule set - I tried to put sm_35 in the rule set (as mentioned in the nvidia site for Maxwell) but if I use sm_35 configuration build is not successful. So I am currently building with default rule sets in cuda 4.0 toolkit .]

With so little information about your platform, the application, and the exact nature of the differences it is difficult to even speculate what the root cause for the differences could be. When I debug such issues I follow the data differences back through the code, until I find where the data first diverges. This kind of debugging can be performed with tools as primitive as a log generated from printf() calls in the code.

Here are some basic things you may want to consider.

You are using a very old CUDA version which certainly has no support for the Maxwell-based K620. It probably does not even have support for sm_35. This means what you are relying on JIT compilation of PTX intermediate code into machine code, which can be a source of additional issues such as lower performance. If possible, I would suggest upgrading to CUDA 6.5 (this may require an upgrade to Visual Studio, I don’t think VS2008 is supported anymore).

I am not sure which “default” build rules you refer to. The compiler default in CUDA 4.0 was to build for an sm_10 target, this is definitely not what you want. K600 is based on GK107, so sm_30 would be the appropriate target architecture.

Make sure your application checks the return status of every CUDA, CUBLAS, CUFFT, etc API call, and every kernel launch.

Does your code using floating-point atomics? Since floating-point arithmetic is not associative, this can cause different results depending on the order in which operations occur, which could well be changed by moving to a different GPU.

Check for race conditions and out-of-bounds accesses in your GPU code with cuda-memcheck. Your problem may also be in the host code. Use a tool equivalent to the Linux tool valgrind to check for out-of-bounds accesses and uninitialized data in host code.

Sorry if my question was not very clear, I am staring to work on CUDA from this week only —

by default rule set I meant these two -
NvCudaDriverApi.v4.0.rules
NvCudaRuntimeApi.v4.0.rules

as per the maxwell-compatibility-guide

maxwell should be supported on toolkit below 5.5

by difference in resulting pixels I am referring to this kind of change →
K600 13764.99902
K620 13765.00000
or,
K600 13694.99609
K620 13694.99414

before K620 my application was running based on sm_21. This is an old code base so previously also it was not optimized with K600 etc. Now the requirement is to support K620 (Maxwell) - and I want to build it in a way so that it can use the Maxwell card properly. So are you absolutely sure that it can’t be done in CUDA toolkit 4.0 + VS 2008 combination ?

Regarding the difference in results - let me just confirm one thing here my results always come same when it is done in K620 card, so this - https://devtalk.nvidia.com/default/topic/782499/cuda-result-changes-time-to-time/?offset=6
is not the issue I am facing. My problem is difference between 2 cards - so as I said before if I can get some official links regarding this “well be changed by moving to a different GPU” then it will be very helpful.

— Sorry again if my questions seems so naive.

As the Maxwell Compatibility Guide points out, you need CUDA version 6.x to have native support for Maxwell in the tool chain. To use code from any versions prior to 6.0 requires the use of JIT compilation of the PTX intermediate format produced by these older CUDA versions into the Maxwell machine language. While this should work just fine from a functional perspective, it is unlikely to let you take full advantage of the Maxwell architecture.

My take on this approach to forward compatibility by PTX JIT compilation is that it is best treated as a temporary solution until programmers have had the time to switch to a CUDA version with native support for the new architecture. Other people may have a different assessment.

In practical terms, one problem with using a very old CUDA version is that very few people will be able to remember specific limitations, issues, caveats, and bugs that may have existed in that version and could have a bearing on a particular observation, I know I certainly can’t. In addition it could be difficult for anybody to reproduce any such observations.

There can certainly be differences between numerical results when different CUDA versions are used, for example due to improvements to the math library, should the code use mathematical functions. Certain compiler optimizations that differ between CUDA versions could also cause differences in the results of floating-point computations. If I understand correctly, in this case you are using the same sm_21 Fermi code base for the Kepler-based K600 and the Maxwell-based K620, therefore excluding this scenario.

An additional aspect is that the JIT compiler could render identical PTX code into machine code with different amount of contractions of FMUL followed by FADD into FMA (fused multiply add) based on target architecture and this could cause small changes to floating-point results. The only way to confirm this hypothesis would be to disassemble the machine code which is non-trivial in a JIT environment. However, my understanding is you are using the same driver and thus the same JIT compiler with both the K600 and the K620 which makes this scenario unlikely as I would expect FMA-contraction optimization to be machine independent.

Do these result differences occur in a tightly controlled experiment, where all hardware and all software stays exactly the same, and you only swap the K600 with the K620? The floating-point operations provided by the GPU hardware follow the IEEE-754 standard and their results will therefore be identical across Fermi, Kepler, and Maxwell. Any differences in the results of the computation must be either due to different code being run or different input data being processed.

Both of your numerical differences are within the possible variation of floating point arithmetic. And it’s possible that Kepler and Maxwell can have slightly different execution order of threads as well as blocks due to the differences in the SM architecture, as well as differing numbers of SMs per GPU, even if they are executing exactly the same machine code.

These differences in execution order can lead to (typically small) differences in floating point results, due to effects such as the non-associativity of floating point arithmetic (A+B)+C != A+(B+C)

Furthermore, to extend the JIT explanation given by njuffa above, CUDA 4.0 natively supported only Fermi devices (cc2.x). Therefore both Kepler (K600, cc3.0) and Maxwell (K620, cc5.0) are supported in this scenario via driver JIT. There is no guarantee that the actual machine code produced by a forward-JIT to cc3.0 will be the same machine code produced by a forward-JIT to cc5.0. Therefore, the two GPUs are almost certainly running different machine code, and different machine code can lead to slightly different, (but valid in either case,) numerical results for IEEE-754 floating point arithmetic, due to various effects such as the associativity consideration already mentioned.

Even if you moved to a current toolkit (which would be interesting probably), there’s no guarantee of identical floating point results if you target cc3.0 in one case and cc5.0 in the other. All of the above considerations can still result in different execution order, which as we’ve seen can produce (typically slightly) different numerical results.

Floating point numerical differences, especially small ones like these cannot be explained (except in a hand-waving sort of way) using general discussions. It’s usually necessary to go through the exacting analysis outlined by njuffa.

— Yes it’s tightly controlled environment. I am just changing the card and running again.
Input data is also same in both the cases - I have compared both input buffer and input parameters in both cases - they are same.

— Hi thanks for you input.
But can you please explain me whether your statement

" Therefore, the two GPUs are almost certainly running different machine code, and different machine code can lead to slightly different, (but valid in either case,) numerical results for IEEE-754 floating point arithmetic, due to various effects such as the associativity consideration already mentioned"

contradicts njuffa’s statement

“There can certainly be differences between numerical results when different CUDA versions are used, for example due to improvements to the math library, should the code use mathematical functions. Certain compiler optimizations that differ between CUDA versions could also cause differences in the results of floating-point computations. If I understand correctly, in this case you are using the same sm_21 Fermi code base for the Kepler-based K600 and the Maxwell-based K620, therefore excluding this scenario.”

— Or my understanding is little wrong. I mean since I am building with sm_21 in both the cases so whether the generated machine code will be same in both cases or different ?

How are you building? Precisely? What is the compile command line?

sm_21 machine code is not compatible with either a cc3.0 or a cc5.0 device.

If you want to prove this to yourself, first make sure your code is doing proper cuda error checking, or else be sure to run your code with cuda-memcheck, so that cuda runtime errors will be observable.

Next, compile your code targetting cc2.1 device only:

nvcc -gencode arch=compute_20,code=sm_21 mycode.cu …

(if you have any other gencode switches than the above, they must be removed for this experiment)

Then try and run this code on a cc3.0 or cc5.0 device. It will fail with “invalid device function”.

this is because cc2.1 machine code is not compatible with a cc3.0 or cc5.0 device.

Therefore the only way your code is working is because you have embedded ptx, and the embedded ptx is being forward JIT-compiled to the necessary architecture (cc3.0 or cc5.0). The embedded ptx will be placed in your executable with an additional gencode switch on the compile command line, such as:

-gencode arch=compute_20,code=compute_20

The seminal fact is that there is no possibility for a CUDA 4.0 toolkit to target a cc3.0 or cc5.0 device. Therefore the machine code for these devices must be generated on the fly. And there is no reason to assume that identical code is generated in both cases (cc3.0 vs. cc5.0 JIT target).

BUT even if identical code were being generated, the execution order characteristics of the two devices will likely be different. This means that threads will likely be executed in a different order, as will threadblocks. So even in this case (indentical code) there is a possibility for numerical difference between the two devices.

By the way, I don’t think njuffa and I are contradicting each other. Notice one of the statements made by njuffa:

“Since floating-point arithmetic is not associative, this can cause different results depending on the order in which operations occur, which could well be changed by moving to a different GPU.”

This means that without any discussion of code differences, even indentical code could produce different results, even when moving from a cc3.0 device to another cc3.0 device (e.g. moving from Quadro K600 to Quadro K2000).

In fact, I think it’s also remotely conceivable that the same code could produce different results on the same device, if there is some external agent that causes code execution to occur in a different order. Some possibilities might be orthogonal graphics activities, other CUDA kernels executing concurrently, or running your code for example with cuda-memcheck.

Generally speaking, the CUDA compiler is very conservative in its handling of floating-point expressions. This means that it will not apply transformations that can change the result. This includes re-association. The only exception to this is FMA merging as described below. There have been a few bugs in the past where value-changing transformations were applied to floating-point computations inadvertently. All cases I recall involved compile-time constant operands, where the compiler’s constant propagation optimization produced different results than would have been produced by execution on actual GPU hardware.

The only caveat is that the compiler applies the contraction of FMUL/FADD to FMA by default. In general, that transformation is beneficial both numerically and performance wise, but it is not a value-preserving transformation. FMA merging can be turned off with the compiler switch -fmad=false in newer CUDA versions, I don’t recall whether that switch was already present in CUDA 4.0. If it is supported, I would suggest giving this a try.

For performance reasons, over the years, the CUDA compiler has gotten ever more clever in its application of FMA-merging, which means code compiled with different compiler versions could contain different sequences of FMAs which could lead to different results. Both the front end of the compiler (that translates C++ to PTX) and the backend of the compiler (that translates PTX to SASS) can apply this optimization. It is theoretically possible that the backend will apply different FMA merging for different architectures, but I would not expect that (as all GPUs support FMA the transformation should not be architecture dependent) and I have no evidence that it does.

In summary, differences in numerical results are generally caused by code differences or by non-deterministic execution. Examples of the former would be different library versions being used, compiler flags like -use_fast_math causing different code to be generated, or architecture-dependent code paths inside libraries or in the application code itself. Examples of the latter would be the use of floating-point atomics or the presence of race-conditions in the CUDA code.

The only way I see to get to the root cause of the differences in a particular case is to debug the code. Or you could take the position that the observed differences are small and thus harmless, and let it go.

Here is another check box item you would want to look into: Does this application use warp-synchronous programming?

It is easy to use warp-synchronous programming techniques incorrectly, assuming guarantees on execution order that just are not supported by the CUDA programming model. I have fielded numerous questions over the years that were prompted by incorrect results being produced due to faulty warp-synchronous programming.

In fact the most protracted bug in a CUDA-accelerated application I ever had to resolve was due to incorrect warp-synchronous code that broke when being moved to a new CUDA compiler. A fairly large third-party application I was not familiar with where a colleague had given up trying to identify the root cause. Time to find the problem: ten work days, most spent learning how to set up and operate the build environment and the app, with about three days of actual debugging. Time to fix: 5 minutes. Ever since then my rules about warp-synchronous programming are (based on Michael Jackson’s rules on optimization):

  1. Don't do it
  2. [Experts only] Don't do it yet

Just to make sure I understand this correctly, on newer architectures (3.5 and up), assuming the --use_fast_math flag is set, the compiler will attempt to convert instances of a*b+c to the statement fmaf(a,b,c) ?

Lately I have been going through some code and looking for instances where I see a*b+c and manually doing that conversion myself. It seems that sometimes there is a resulting modest performance boost, but I do not want to stand in the way of the compiler if it may find a better way to merge to FMA.

No, FMA contraction is an optimization orthogonal to the -use_fast_math flag. -use_fast_math does three things:

(1) it turns on flush-to-zero for subnormal single-precision inputs and results, i.e. -ftz=true
(2) it turns on approximate single-precision division, reciprocal, and square root instead of the default IEEE-754 rounded versions, i.e. -prec-sqrt=false -prec-div=false
(3) it replaces certain single-precision math functions by calls to equivalent intrinsics. There is a list in the Programming Guide detailing these substitutions, e.g. sinf() → __sinf()

There are instances where it actually makes sense to manually perform FMA contraction. For example, one may need the useful properties of FMA, such as protection against subtractive cancellation in products, to have functionally correct code. In the absence of a manually coded fma() or fmaf() such code would break if compiled in debug mode where compiler-initiated FMA contraction does not take place, or if the code were to be compiled with -fmad=false. This is the reason the CUDA math library uses a lot of explicitly coded FMAs.

Likewise there is the occasional tricky numerical situation where use of FMA would negatively affect a desired numerical property. There are a couple of places like that in the CUDA math library, you could search for comments that say “prevent FMA merging” if you are interested in the details. The way to prevent FMA merging from happening is to explicitly code with __fmul_rn() and __fadd_rn() intrinsics (or the corresponding __dmul_rn() and __dadd_rn() intrinsics for double-precision computations) instead of the standard ‘+’, ‘-’, or ‘*’ operators.

On some architectures (notably sm_21, as I recall), the manual introduction or removal of FMAs can affect the pipe-steering (how instructions are assigned to internal execution resources when there is more than one choice) of floating-point operations enough that a noticeable performance difference results. Since the effect is somewhat unpredictable (e.g. lack of information on the pipe steering), usually minor, and often depends on the dynamic execution context, it is not something that I would suggest targetting as a source-level optimization.

In complex expressions, there can sometimes be numerous ways in which FMA could be applied. I would not expect the compiler to exhaustively enumerate all possibilities before making a choice, it is much more likely that it uses a set of heuristics that guide the FMA contraction. So the choice made by the compiler may not be optimal in each and every case. That said, I have found that the CUDA compiler is generally able to “spot” more opportunities for FMA contraction than I can, for example because the merged operations actually originate in two different inlined functions.

Thanks njuffa and txbob for the guidance.
As of now after explaining to my client - all these information provided by you guys, he is now OK with the small difference in accuracy value. Plus I will change the cuda version and try again in future to utilize the Maxwell architecture properly.