Benoit , I am sorry for wasting your time. I should have thought more carefully. It did not occur to me to think that they would be device only. My fault.
Was anybody actually able to solve this problem? I mean, using [font=“Courier”]cudaMemset[/font] (i.e. a runtime API call) with a [font=“Courier”]float[/font] or any data type other than [font=“Courier”]int[/font]? The type casting method known from C’s [font=“Courier”]memset[/font] doesn’t seem to work.
Only [font=“Courier”]0.0f[/font] gives the expected result (which is actually not surprising at all).
Not sure if this is still of interest, but I had to battle this myself and finally figured it out, I think.
pyrtsa was right. You need to transfer your float value’s bit pattern into an unsigned int using
unsigned int val = ((unsigned int)&float_val)
If you pass the float value directly to cuMemsetD32, it won’t compile because the function wants an unsigned int. If you cast the value, as in
float val = 1000;
cuMemsetD32(pointer, (unsigned int)val, bytes)
the compiler won’t complain, but you will change the bit pattern of the float value to represent 1000 as an unsigned int, which looks very different:
float value: 1.000000e+03
float bits: 01000100 01111010 00000000 00000000
cast value: 1000
cast bits: 00000000 00000000 00000011 11101000
cast as float value: 1.401298e-42
cast as float bits: 00000000 00000000 00000011 11101000
As you can see, the same bit pattern is a very different value depending on whether it’s interpreted as an unsigned int or as a float.
So if you allocated the gpu memory as type float, your device and host functions will interpret the memory as floats, but if you memset using a cast to unsigned int, you will set 32bit memory locations to unsigned int representations of your values, very different from float representations as hopefully shown above.
I was very puzzled about this pseudo code for a day or so:
float* gpu_d = cudaMalloc(bytes);
float val = 1000;
// print value I’m setting, as float and cast unsigned int:
OUTPUT: 1000 and 1000
OK
// cast to satisfy compiler
cuMemsetD32(gpu_d, (unsigned int)val, bytes);
float* gpu_h = malloc(bytes);
// download gpu_d to gpu_h, sync everything…
// print values in gpu_h:
OUTPUT: 1.401298e-42
what??
All the confusion about cuMemsetD32 in the forums made me think it’s not doing anything at all (1.4e-42 looks a lot like memory trash), or it’s doing it wrong, but this is not so. I had a similar story with JIT compile. Either documentation could be better, or I missed an important part of it.
Also, cuda runtime 4 could have addressed this, I haven’t checked. I’m currently using driver api version 4010 and try to be portable… good times…
Here’s the code I use to print bit patterns, if anyone wants to fool around:
Please note that type punning via pointer cast in the following invokes undefined behavior according to the C/C++ standards:
unsigned int val = *((unsigned int*)&float_val)
Sometimes this will happen to work as intended, but many times it will “fail”, based on real life experience. For device code, CUDA offers specific type re-interpretation functions for this purpose such as __int_as_float() and __float_as_int(). For C host code I would recommend the use of a volatile union as shown below. Note that this usage is also not sanctioned by the standards, but it appears to be safe in practice, meaning I have not seen it fail in 20+ years of use across diverse platforms: