Pitch linear memory

Hi,

I am unable to understand the following line that I read here
cudaMallocPitch() pads the allocation to get best performance for the memory subsystem of a given piece of hardware.

My question is what is pitch linear memory (though linear memory I know)? and how is the padding going to improve the performance (i.e memory bandwidth or fetching rate)?

Thanks in advance

Pitched linear memory is just a linear memory allocation calculated from the 2D sizes you provide, with padding added as required to ensure row major access will be correctly aligned for coalesced memory access.

1 Like

Thank you very much.

But I understand that in modern GPUs there is no Coalescing requirements… is this padding still required if working on Tesla?

Yes. You still want coalesced reads for optimum performance. It’s just that the hardware on the G200 GPUs makes the penalty for not coalescing much lower (it automatically works out the minimum set of coalesced reads required to satisfy the half-warp’s request). One of the key requirements for coalescing is alignment.

But after pitched memory copy to the device, is the data still in format of 2D array or just 1D linear array? When I want to access the 2D array on device, how can I use 2D threads structure (threadIdx.x, threadIdx.y) to index and process the data?

It’s linear memory with pitch. Meaning, you cannot access it as a 2D array, but rather you have to access it as a 1D array. Here is an example:

__global__ void srcToDest(

float *srcImage, unsigned int nSourcePitch, 

float *dstImage, unsigned int nDestPitch,

unsigned int width, unsigned int height){

int x, y;

		//Get the thread indices

		x = blockIdx.x *  blockDim.x + threadIdx.x;

		y = blockIdx.y *  blockDim.y + threadIdx.y;

		//check the image bounds vs the thread ID:

		//x = width - 1;

		if (x >= width)

			return; 

		// y = height - 1;

		if (y >= height)

			return; 

		dstImage[y * nDestPitch + x] = srcImage[y * nSourcePitch + x];

}

One thing of note, this example kernel has its pitch in terms of elements, not bytes

got you. Thanks:)