GET STARTED

GET INVOLVED

Authorization Required

Not a member? Register Now

As a concrete example, I profiled the simpleCUBLAS application that multiplies two NxN matrices with the command:

cublasSgemm('n', 'n', N, N, N, alpha, d_A, N, d_B, N, beta, d_C, N);

When N=200, I see that there are uncoalesced stores: gst_uncoalesced is 6400 (with gst_coalesced=800). This was a surprise, I had expected CUBLAS code to use only coalesced operations. Can this be fixed, and should I be worried about this? (For example, should I try to write my own matrix multiply kernel using shared memory and without uncoalesced accesses?)

I searched the forum but couldn't find an answer. Any pointers appreciated. Thanks.

As a concrete example, I profiled the simpleCUBLAS application that multiplies two NxN matrices with the command:

cublasSgemm('n', 'n', N, N, N, alpha, d_A, N, d_B, N, beta, d_C, N);

When N=200, I see that there are uncoalesced stores: gst_uncoalesced is 6400 (with gst_coalesced=800). This was a surprise, I had expected CUBLAS code to use only coalesced operations. Can this be fixed, and should I be worried about this? (For example, should I try to write my own matrix multiply kernel using shared memory and without uncoalesced accesses?)

I searched the forum but couldn't find an answer. Any pointers appreciated. Thanks.

The CUBLAS code is written to maximum performance and we're continuing to optimize it.

The CUBLAS code is written to maximum performance and we're continuing to optimize it.

I realize that some operations (such as a matrix transpose) will require uncoalesced accesses in global memory. However, one possibility is to rewrite the operation (such as a transpose) using shared memory, with only coalesced accesses to global memory. In the case of matrix transpose, this is faster as well.

So, to reword my question: How would cublasSgemm compare with a matrix multiply kernel that makes only coalesced accesses? Shouldn't it be possible to speed up cublasSgemm by making only coalesced accesses?

[quote name='Simon Green' date='Sep 9 2008, 04:40 AM']There are cases (for example, a simple matrix transpose) where some uncoalesced reads or writes will always be necessary, so I wouldn't worry about it too much.

[right][snapback]437455[/snapback][/right]

[/quote]

I realize that some operations (such as a matrix transpose) will require uncoalesced accesses in global memory. However, one possibility is to rewrite the operation (such as a transpose) using shared memory, with only coalesced accesses to global memory. In the case of matrix transpose, this is faster as well.

So, to reword my question: How would cublasSgemm compare with a matrix multiply kernel that makes only coalesced accesses? Shouldn't it be possible to speed up cublasSgemm by making only coalesced accesses?

[quote name='Simon Green' date='Sep 9 2008, 04:40 AM']There are cases (for example, a simple matrix transpose) where some uncoalesced reads or writes will always be necessary, so I wouldn't worry about it too much.

The fastest path is for m%64=0, k%16=0, n%16=0 and it is based on the implementation written by Volkov at UC Berkeley.

Good luck in beating his code......

The fastest path is for m%64=0, k%16=0, n%16=0 and it is based on the implementation written by Volkov at UC Berkeley.

Good luck in beating his code......

The fastest path is for m%64=0, k%16=0, n%16=0 and it is based on the implementation written by Volkov at UC Berkeley.

[right][snapback]437609[/snapback][/right]

[/quote]

What is the fastest path for cublasDgemm?

What is the fastest path for cublasDgemm?

[right][snapback]437331[/snapback][/right]

[/quote]

If you want to improve speed for large matrices with size that's not multiple of 64, you have to add padding (zeros) to your matrix in device memory to obtain size multiple of 64, and then call the kernel.

You can gain also more than 2x of performance..

So you do not need to beat Volkov code, just you need to use it..

If you want to improve speed for large matrices with size that's not multiple of 64, you have to add padding (zeros) to your matrix in device memory to obtain size multiple of 64, and then call the kernel.

You can gain also more than 2x of performance..

So you do not need to beat Volkov code, just you need to use it..

Davide Barbieri