Shared memory configuration

Hi,
In my program I use shared memory to do prefetching of data. A 2D block of threads, dimentions 8 by 4 (32), gets 8 * 4 * 8 * sizeof(float4) bytes of shared memory. Each thread copies 8 float4s in a loop:

inline __device__ void pack(const float4 *g_src, float4 *s_dst, const unsigned int w, const unsigned int d) {
	uint2 indx = { blockIdx.x * blockDim.x + threadIdx.x, blockIdx.y * blockDim.y + threadIdx.y };
	uint2 sindx = { threadIdx.x, threadIdx.y };
	int i;
	
	for (i = 0; i < d; ++i)	s_dst[(sindx.y * blockDim.x + sindx.x) * d + i] = g_src[(w * indx.y + indx.x) * d + i];
}

where ‘w’ is set to width of the global memory buffer (in number of float4s) and ‘d’ is set to 8 (number of float4s copied).

Can such configuration and further usage of the memory, lead to bank conflicts, or broadcasting will be applied? Will this be a case also when threads copy only, say 5 float4s, not 8?

MK