Packed datatype in an union - 'volatile' keyword necessary ?

I have a list of pixel coordinates, where each pixel coordinate is stored (to save bandwidth) in one ushort2 variable ‘pos’ (taking 4 bytes). In order to load pixel coordinates in a coalesced way, I am union-ing the variable with a float variable ‘posForLoading’ - see the code below. So inside my CUDA kernels, for loading from global memory I use ‘posForLoading’, and then I read the pixel-coordinate from ‘pos’.

union
{
  // pixel position (x, y)
  ushort2 pos;
  // for loading the pixel position in a coalseced way from global mem
  float posForLoading;
};

Now my question: Is it ‘safe’ to use this type as it is now, without putting a ‘volatile’ in front of the union ? Practically, my code works correctly with Cuda Toolkit 7.0 on both Kepler & Maxwell devices.
On the other side, on a couple of places on the internet (e.g. https://devtalk.nvidia.com/default/topic/371071/cudamemset-or-cudamemset2d-set-memory-with-float-values/?offset=9) and in Cuda handbook, page 256 (for portable casting of 32-bit float into 32-bit integer) I noticed that always the “volatile” keyword is added in front of the ‘union’ keyword. If I need the “volatile” for correctness, then it would be nice to hear also why I need it.

Historically, this is a very contentious area that easily leads to heated discussions in which participants argue about the nuances of various clauses and footnotes in relevant language standards.

Best I understand from reading the C and C++ standards and following two decades of discussions:

In the original C89, type punning by storing to one member of a union and reading from a differently-typed member of the union would lead to undefined behavior, and with optimizing compilers starting to take advantage of undefined behavior in the mid 1990s, such usages would actually break code. This state of affairs was inherited by C++ and applies there to this day. The canonical standard-sanctioned way for a bit-wise transfer between different types (type re-interpretation, e.g. float-as-int) in C++ is therefore to use memcopy(), the compiler can optimize this so no library function is actually called. In C99 on the other hand, type-punning via unions was officially sanctioned as an important, frequently needed construct.

In practical terms, gcc in particular guaranteed type-punning via unions to work in C programs as a proprietary extension early on, and empirically I have never seen type-punning via a volatile union fail with any C compiler I have used, and I have used a lot of different ones. So I routinely code type punning-constructs with volatile unions to this day, and so may many other programmers of my vintage.

Note however, that CUDA is a language in the C++, not C, family. To avoid the type-punning issue at least for the most frequent use cases (floating-point types to integer types and vice-versa), CUDA has supported re-interpretation device function intrinsics like __int_as_float() from the start.

Are you sure you need the union here? ‘ushort2’ is already a 4-byte aligned, 4-byte type, and therefore can be loaded via a single 4-byte access, so the comment in front of the ‘float’ member seems confusing. Also, why was ‘float’ chosen, instead of ‘unsigned int’?

Thats right, when I think aobuut it it really seems to be un-necessary. I ‘inherited’ the code, and back then in 2008/2009 it was written for CC 1.X and CC 2.X architecture GPUs and older toolkits (Toolkit 3.X or so). I think the developer who wrote this told me it was necessary that way to get good global memory access.

If its not too much trouble, I would suggest a quick experiment: rip out the union and use the ushort2 directly. My expectation would be that performance shouldn’t be affected. If there actually is a negative impact, it seems worthwhile finding out what it is, documenting it, and then spending some time to devise the best workaround. That workaround may well be the union, although I would personally always prefer a union with a 32-bit integer (instead of float) in a case like this, following the principal of least surprise.

This is a good demonstration of why code needs constant refactoring, instead of endless recycling (the often mentioned “re-use”) that just propagates yesteryear’s coding artifacts further into the future, although hardware architectures and software environments have evolved. This is certainly not a CUDA-specific issue, in some domains people still propagate code from the 1970s (e.g. originally Fortran, then ported to C, then ported to Python).