CUBLAS and coalesced operations Access patterns inside CUBLAS code?

Can memory access patterns inside CUBLAS code be optimized?

As a concrete example, I profiled the simpleCUBLAS application that multiplies two NxN matrices with the command:

cublasSgemm(‘n’, ‘n’, N, N, N, alpha, d_A, N, d_B, N, beta, d_C, N);

When N=200, I see that there are uncoalesced stores: gst_uncoalesced is 6400 (with gst_coalesced=800). This was a surprise, I had expected CUBLAS code to use only coalesced operations. Can this be fixed, and should I be worried about this? (For example, should I try to write my own matrix multiply kernel using shared memory and without uncoalesced accesses?)

I searched the forum but couldn’t find an answer. Any pointers appreciated. Thanks.

There are cases (for example, a simple matrix transpose) where some uncoalesced reads or writes will always be necessary, so I wouldn’t worry about it too much.

The CUBLAS code is written to maximum performance and we’re continuing to optimize it.

Thanks for the reply!

I realize that some operations (such as a matrix transpose) will require uncoalesced accesses in global memory. However, one possibility is to rewrite the operation (such as a transpose) using shared memory, with only coalesced accesses to global memory. In the case of matrix transpose, this is faster as well.

So, to reword my question: How would cublasSgemm compare with a matrix multiply kernel that makes only coalesced accesses? Shouldn’t it be possible to speed up cublasSgemm by making only coalesced accesses?

cublasSgemm calls a lot of different kernels, depending on the size of the matrices.
The fastest path is for m%64=0, k%16=0, n%16=0 and it is based on the implementation written by Volkov at UC Berkeley.
Good luck in beating his code…

What is the fastest path for cublasDgemm?

The same, m%64=0, k%16=0, n%16=0

If you want to improve speed for large matrices with size that’s not multiple of 64, you have to add padding (zeros) to your matrix in device memory to obtain size multiple of 64, and then call the kernel.

You can gain also more than 2x of performance…

So you do not need to beat Volkov code, just you need to use it…

Keeping in mind these fast-path-restrictions gave me a 20 % speedup in one case. Thanks for the advice.