Coalesced writes

Hi,

I’m trying to coalesce global memory writes, but it isn’t any faster. What am I doing wrong?

I want to do this using warp shuffles where available. The code block for Compute < 3.0 shows the unoptimized code, once I get the shuffled thing working better than it does now, I’ll write code that used shared memory.

#define SHUFFLE_MIN_VER 300

const int thread_id = threadIdx.x & 3;

#if __CUDA_ARCH__ < SHUFFLE_MIN_VER
	for (uint32_t i = 0; i < 4; t++) {
		dag_nodes[node_index].uint4s[i] =  dag_node.uint4s[i];
#else
	for (uint32_t t = 0; t < 4; t++) {

		uint32_t shuffle_index = __shfl(node_index, t, 4);
		uint4 s[4];
		for (uint32_t w = 0; w < 4; w++) {
			s[w] = make_uint4(__shfl(dag_node.uint4s[w].x, t, 4), __shfl(dag_node.uint4s[w].y, t, 4), __shfl(dag_node.uint4s[w].z, t, 4), __shfl(dag_node.uint4s[w].w, t, 4));
		}
		dag_nodes[shuffle_index].uint4s[thread_id] = s[thread_id];
	}
#endif

dag_node is a struct of 4 uint4s. dag_nodes is a global memory array of dag_node(s). The part where I generate a dag_node is omitted from this example.

When profiling, I get the exact same kernel runtime, and same amount of L2 transactions per write request (16).

each L2 transaction is 32 bytes long, so you already have minimum amount of transactions (32*uint4 = 512 bytes, 512/32=16 transactions). the reason is probably L1 cache which exists, in particular, for “joining” non-coalesced memory writes

But in the unoptimized code I have 32 threads * uint4 per write not coalesced (write 16 bytes, skip 48, write 16, etc.), while in my “optimized” code they are (or should be if i didn’t make a mistake) aligned in groups of 64 (the whole dag_node) at once (per 4 threads).

Can you explain in a bit more detail please? When I do the same thing for reads, it speeds up the code nicely. For reference, here’s the complete kernel. It hammers down 1GB+ of data into GPU RAM, later to be used for a proof of work algorithm. The first loop where I calculate the dag_nodes, uses coalesced reads on a smaller piece of GPU data (~100MB or so), which speeds up 40%. Ok it’s done 512x as many times (ETHASH_DATASET_PARENTS==512), but I could at least make some improvement, right?

__global__ void
__launch_bounds__(128, 7)
ethash_calculate_dag_item(uint32_t start)
{
 	uint32_t const node_index = start + blockIdx.x * blockDim.x + threadIdx.x;
	if (node_index > d_dag_size * 2) return;
	
	hash200_t dag_node;
	copy(dag_node.uint4s, d_light[node_index % d_light_size].uint4s, 4);
	dag_node.words[0] ^= node_index;
	SHA3_512(dag_node.uint2s);

	const int thread_id = threadIdx.x & 3;

	for (uint32_t i = 0; i != ETHASH_DATASET_PARENTS; ++i) {	
		uint32_t parent_index = fnv(node_index ^ i, dag_node.words[i % NODE_WORDS]) % d_light_size;	
#if __CUDA_ARCH__ < SHUFFLE_MIN_VER
		for (unsigned w = 0; w != 4; ++w) {
			dag_node.uint4s[w] = fnv4(dag_node.uint4s[w], d_light[parent_index].uint4s[w]);
		}
#else
		for (uint32_t t = 0; t < 4; t++) {
			uint32_t shuffle_index = __shfl(parent_index, t, 4); 
			uint4 p4 = d_light[shuffle_index].uint4s[thread_id];
				
			for (int w = 0; w < 4; w++) {
				uint4 s4 = make_uint4(__shfl(p4.x, w, 4), __shfl(p4.y, w, 4), __shfl(p4.z, w, 4), __shfl(p4.w, w, 4));
				if (t == thread_id) { 
					dag_node.uint4s[w] = fnv4(dag_node.uint4s[w], s4);
				}
			}
			
		}
#endif		
	}
	SHA3_512(dag_node.uint2s);
	hash64_t * dag_nodes = (hash64_t *)d_dag;

#if __CUDA_ARCH__ < SHUFFLE_MIN_VER
	for (uint32_t i = 0; i < 4; t++) {
		dag_nodes[node_index].uint4s[i] =  dag_node.uint4s[i];
#else
	for (uint32_t t = 0; t < 4; t++) {

		uint32_t shuffle_index = __shfl(node_index, t, 4);
		uint4 s[4];
		for (uint32_t w = 0; w < 4; w++) {
			s[w] = make_uint4(__shfl(dag_node.uint4s[w].x, t, 4), __shfl(dag_node.uint4s[w].y, t, 4), __shfl(dag_node.uint4s[w].z, t, 4), __shfl(dag_node.uint4s[w].w, t, 4));
		}
		dag_nodes[shuffle_index].uint4s[thread_id] = s[thread_id];
	}
#endif		 
}