But in the unoptimized code I have 32 threads * uint4 per write not coalesced (write 16 bytes, skip 48, write 16, etc.), while in my “optimized” code they are (or should be if i didn’t make a mistake) aligned in groups of 64 (the whole dag_node) at once (per 4 threads).
Can you explain in a bit more detail please? When I do the same thing for reads, it speeds up the code nicely. For reference, here’s the complete kernel. It hammers down 1GB+ of data into GPU RAM, later to be used for a proof of work algorithm. The first loop where I calculate the dag_nodes, uses coalesced reads on a smaller piece of GPU data (~100MB or so), which speeds up 40%. Ok it’s done 512x as many times (ETHASH_DATASET_PARENTS==512), but I could at least make some improvement, right?
__global__ void
__launch_bounds__(128, 7)
ethash_calculate_dag_item(uint32_t start)
{
uint32_t const node_index = start + blockIdx.x * blockDim.x + threadIdx.x;
if (node_index > d_dag_size * 2) return;
hash200_t dag_node;
copy(dag_node.uint4s, d_light[node_index % d_light_size].uint4s, 4);
dag_node.words[0] ^= node_index;
SHA3_512(dag_node.uint2s);
const int thread_id = threadIdx.x & 3;
for (uint32_t i = 0; i != ETHASH_DATASET_PARENTS; ++i) {
uint32_t parent_index = fnv(node_index ^ i, dag_node.words[i % NODE_WORDS]) % d_light_size;
#if __CUDA_ARCH__ < SHUFFLE_MIN_VER
for (unsigned w = 0; w != 4; ++w) {
dag_node.uint4s[w] = fnv4(dag_node.uint4s[w], d_light[parent_index].uint4s[w]);
}
#else
for (uint32_t t = 0; t < 4; t++) {
uint32_t shuffle_index = __shfl(parent_index, t, 4);
uint4 p4 = d_light[shuffle_index].uint4s[thread_id];
for (int w = 0; w < 4; w++) {
uint4 s4 = make_uint4(__shfl(p4.x, w, 4), __shfl(p4.y, w, 4), __shfl(p4.z, w, 4), __shfl(p4.w, w, 4));
if (t == thread_id) {
dag_node.uint4s[w] = fnv4(dag_node.uint4s[w], s4);
}
}
}
#endif
}
SHA3_512(dag_node.uint2s);
hash64_t * dag_nodes = (hash64_t *)d_dag;
#if __CUDA_ARCH__ < SHUFFLE_MIN_VER
for (uint32_t i = 0; i < 4; t++) {
dag_nodes[node_index].uint4s[i] = dag_node.uint4s[i];
#else
for (uint32_t t = 0; t < 4; t++) {
uint32_t shuffle_index = __shfl(node_index, t, 4);
uint4 s[4];
for (uint32_t w = 0; w < 4; w++) {
s[w] = make_uint4(__shfl(dag_node.uint4s[w].x, t, 4), __shfl(dag_node.uint4s[w].y, t, 4), __shfl(dag_node.uint4s[w].z, t, 4), __shfl(dag_node.uint4s[w].w, t, 4));
}
dag_nodes[shuffle_index].uint4s[thread_id] = s[thread_id];
}
#endif
}