double precision atomicAdd() problem

cuda version : 6.5
GPU : Telsa K40c
compute capability : 3.5>=

double Precision atomicAdd() use this code.

device double atomicAdd(double* address, double val) {
unsigned long long int* address_as_ull = (unsigned long long int*)address;
unsigned long long int old = *address_as_ull, assumed;

            do { 
                  assumed = old; 
                  old = atomicCAS(address_as_ull, assumed, __double_as_longlong(val+__longlong_as_double(assumed))); 
           } while (assumed != old); 
           return __longlong_as_double(old); 

}

but, the result of atomicAdd() is different with result of cpu code under 10th decimal place.
Are these differences inevitable?

They might be “inevitable”.
People who expect exact duplication of floating point results between host and device computations are frequently disappointed.

Floating point calculations may produce different results depending on the actual order of operations. Since parallel code running on the device will execute a given algorithm with possibly a different order of operations than the “same” algorithm running on the host, these differences pop up.

If you google “what every computer scientist should know about floating-point arithmetic” you may get some interesting information.

@Robert_Crovella: I am facing same problem but if print result of two nodes only, i get the same double values. but if print all the values i get very weird order and results reason is off course calculation are based on results from different nodes (its neighbors). the problem i am solving is louvain. even after applying atomic operations i am geeting the same weird order.

There is not enough information here for me to be able to make any further comments.