Parallel reduction of nxn blocks in mxm matrix (oops, duplicate, not sure how to delete)

Hi, I’m new to cuda programming and wanted to get some ideas of how to go about this algorithm design. I’d like to take a mxm matrix and sum up nxn chunks of blocks inside the matrix, to end up with (assuming m is divisible by n) an (m/n)x(m/n) size matrix of sums. I’ve seen the parallel reduction algorithms to one sum, but i’m not sure how to efficiently handle this particular case. I also realize that you can think of this as an nxn convolution with n stride on a kernel of all ones, but I think i’d like to give a direct approach a try rather than using cudnn.

Thanks!