How to prevent compiler from optimizing operations away?

I was trying to use code in a form like this

// ...

if (laneId == 0) atomicAdd(&s, 1);

// ...

for (; (s & 7) != 0;);

// ...

And it actually worked in Debug mode. But in Release mode, the atomicAdd was gone. And the kernel got stuck in the loop till the GPU reset.

I’ve tried inline assembly, and there was no good.

It’s impossible to make recommendations based on two isolated lines of code. If code disappears in release builds, that is usually due to dead code elimination. This is a widely used optimization technique in compilers that eliminates code that does not contribute to externally visible program state.

In the context of CUDA, that typically means the code does not contribute to modifications of global memory that are observable at kernel termination time. The CUDA compiler is highly optimizing, and may find “dead” code even in places where it is not obvious to a human at first glance. I have never encountered a situation where the CUDA compiler incorrectly marked code as “dead”. That does not mean such a bug is impossible, but the answer to your question is most likely to be found in your code, which you haven’t shown.

Note that violations of the CUDA programming model, as well as violations of the underlying C++ programming model, can lead to code elimination. As soon as undefined behavior is invoked, anything can happen (back in the days, the C community referred to “nasal daemons” as one possible outcome of undefined behavior).

in this particular case, may be decalring s as volatile can help? call your own finger daemons to fight against their nosal ones!

It is usually not a good idea to paper over problematic code by tossing in a few ‘volatile’ modifiers. It would be better to understand what is going on with the OP’s code and to determine what might be an appropriate way to achieve the desired functionality in CUDA.

To everyone,

I removed ‘volatile’ from the definition of s because atomicAdd wants non-volatile. But I forgot to add ‘volatile’ in the for loop…

Guess I need some sleep X|

And about the real thing I wanted to test:

Synchronizing this way seems to be much slower than just using bar.sync. But I haven’t look into the reason yet.

when i wrote several MTF implementations in CUDA, i found that some of them work faster with volatile, while other were faster with syncThreads. so it depends

Maybe less than 200 calculations are just not enough…

The major performance hit is the compiler misinterpreted my logic and creates some huge memory dependency stall, though. Guess I will have to try pure PTX now.

Just to set the proper expectations: The part of the CUDA toolchain that transforms PTX into SASS (machine code), PTXAS, is an optimizing compiler.

You can influence the optimization level of PTXAS by passing the nvcc command-line argument -Xptxas -O[0|1|2|3]. The nvcc default is -O3.

Tried that. But lowering the optimization level only make things worse :(

Out of curiosity, what makes you think it’s just the compiler that’s breaking your code? Debug vs Release might actually be revealing undefined behavior present in your code. Or was this determined using an assembly dump and inspecting to make sure the instructions were actually removed?

I only ask because it seems unlikely that a decently-written compiler like this would remove such critical instructions.

The ATOMS.ADD instruction was absent. Actually the loop was absent, too… The kernel was stuck elsewhere.