Hey,
I’ve encountered some very strange behaviour since upgrading to CUDA 2.2 (running windows xp, 32bit, 185.85).
Usually, when accessing memory allocated via cuMemAllocPitch as if it were a 2D array - each row is pitch-bytes in length, thus memory must be accessed something along the lines of: T* pElement = (T*)((char*)BaseAddress + Row * Pitch) + Column; (copy/pasted from docos)
And this has worked for me 100% of the time previously, and still works for me when using primitive types (unsigned char, unsigned int, unsigned long long, etc), HOWEVER - since CUDA 2.2, when accessing memory as if it’s not a primitive type (eg: T = uchar4), I have to access memory as if it’s NOT pitch linear, and the pitch is the actual width of the row, not the pitch… (Similarly, 2D textures to pitch linear memory (of all formats) for some reason have a width of their pitch, not the width specified in the texture - but that’s another problem for another day…)
For those who don’t like reading what I said, and prefer code…[codebox]// Note: ‘data’ for both of these functions is the same memory pointer
device void foo(uchar4 *data, unsigned int width_bytes, unsigned int pitch_bytes)
{
uint2 tid = make_uint2(threadIdx.x, threadIdx.y);
// Note: data is allocated as pitch linear memory - but accessing it as such as a uchar4 does not work as intended? why?
// Note: this used to work previously... not sure why it's changed.
uchar4 works = data[(tid.y * (width_bytes / 4)) + tid.x]; // should NOT work (but it does, and gets the correct pixels at [x,y])
uchar4 breaks = data[(tid.y * (pitch_bytes / 4)) + tid.x]; // SHOULD work, getting the intended pixels at [x,y]
}
device void foo(unsigned int *data, unsigned int width_bytes, unsigned int pitch_bytes)
{
uint2 tid = make_uint2(threadIdx.x, threadIdx.y);
// Note: data is allocated as pitch linear memory - accessing it as such does as an unsigned int DOES work as intended… as expected!
unsigned int breaks = data[(tid.y * (width_bytes / 4)) + tid.x]; // should NOT work (and doesn't work, yay)
unsigned int works = data[(tid.y * (pitch_bytes / 4)) + tid.x]; // SHOULD work (and does work, yay again)
}[/codebox]
I’m hoping this is a driver bug, otherwise my understanding of CUDA has flipped upside down since 2.2 :\
(Also note, uchar4 is 4bytes, as is an unsigned int, when compiling with nvcc/msvc/gcc/whatever - so my hard-coded 4’s are valid in this case)
Edit: I’m less convinced this is to do with the fact I’m accessing the data as uchar4 now, as I’ve been able to replicate the same issue with unsigned int now… Somewhere between allocating the memory (cuMemAllocPitch, returning a pitch not equal to the width bytes) - and my kernels, the memory somehow lost it’s pitch, if that’s even possible…