CUBLAS and coalesced operations Access patterns inside CUBLAS code?
Can memory access patterns inside CUBLAS code be optimized?

As a concrete example, I profiled the simpleCUBLAS application that multiplies two NxN matrices with the command:

cublasSgemm('n', 'n', N, N, N, alpha, d_A, N, d_B, N, beta, d_C, N);

When N=200, I see that there are uncoalesced stores: gst_uncoalesced is 6400 (with gst_coalesced=800). This was a surprise, I had expected CUBLAS code to use only coalesced operations. Can this be fixed, and should I be worried about this? (For example, should I try to write my own matrix multiply kernel using shared memory and without uncoalesced accesses?)

I searched the forum but couldn't find an answer. Any pointers appreciated. Thanks.
Can memory access patterns inside CUBLAS code be optimized?



As a concrete example, I profiled the simpleCUBLAS application that multiplies two NxN matrices with the command:



cublasSgemm('n', 'n', N, N, N, alpha, d_A, N, d_B, N, beta, d_C, N);



When N=200, I see that there are uncoalesced stores: gst_uncoalesced is 6400 (with gst_coalesced=800). This was a surprise, I had expected CUBLAS code to use only coalesced operations. Can this be fixed, and should I be worried about this? (For example, should I try to write my own matrix multiply kernel using shared memory and without uncoalesced accesses?)



I searched the forum but couldn't find an answer. Any pointers appreciated. Thanks.

#1
Posted 09/09/2008 06:14 AM   
There are cases (for example, a simple matrix transpose) where some uncoalesced reads or writes will always be necessary, so I wouldn't worry about it too much.

The CUBLAS code is written to maximum performance and we're continuing to optimize it.
There are cases (for example, a simple matrix transpose) where some uncoalesced reads or writes will always be necessary, so I wouldn't worry about it too much.



The CUBLAS code is written to maximum performance and we're continuing to optimize it.

#2
Posted 09/09/2008 12:40 PM   
Thanks for the reply!

I realize that some operations (such as a matrix transpose) will require uncoalesced accesses in global memory. However, one possibility is to rewrite the operation (such as a transpose) using shared memory, with only coalesced accesses to global memory. In the case of matrix transpose, this is faster as well.

So, to reword my question: How would cublasSgemm compare with a matrix multiply kernel that makes only coalesced accesses? Shouldn't it be possible to speed up cublasSgemm by making only coalesced accesses?


[quote name='Simon Green' date='Sep 9 2008, 04:40 AM']There are cases (for example, a simple matrix transpose) where some uncoalesced reads or writes will always be necessary, so I wouldn't worry about it too much.

[right][snapback]437455[/snapback][/right]
[/quote]
Thanks for the reply!



I realize that some operations (such as a matrix transpose) will require uncoalesced accesses in global memory. However, one possibility is to rewrite the operation (such as a transpose) using shared memory, with only coalesced accesses to global memory. In the case of matrix transpose, this is faster as well.



So, to reword my question: How would cublasSgemm compare with a matrix multiply kernel that makes only coalesced accesses? Shouldn't it be possible to speed up cublasSgemm by making only coalesced accesses?





[quote name='Simon Green' date='Sep 9 2008, 04:40 AM']There are cases (for example, a simple matrix transpose) where some uncoalesced reads or writes will always be necessary, so I wouldn't worry about it too much.



[snapback]437455[/snapback]


#3
Posted 09/09/2008 05:09 PM   
cublasSgemm calls a lot of different kernels, depending on the size of the matrices.
The fastest path is for m%64=0, k%16=0, n%16=0 and it is based on the implementation written by Volkov at UC Berkeley.
Good luck in beating his code......
cublasSgemm calls a lot of different kernels, depending on the size of the matrices.

The fastest path is for m%64=0, k%16=0, n%16=0 and it is based on the implementation written by Volkov at UC Berkeley.

Good luck in beating his code......

#4
Posted 09/09/2008 07:48 PM   
[quote name='mfatica' date='Sep 9 2008, 08:48 PM']cublasSgemm calls a lot of different kernels, depending on the size of the matrices.
The fastest path is for m%64=0, k%16=0, n%16=0 and it is based on the implementation written by Volkov at UC Berkeley.
[right][snapback]437609[/snapback][/right]
[/quote]

What is the fastest path for cublasDgemm?
[quote name='mfatica' date='Sep 9 2008, 08:48 PM']cublasSgemm calls a lot of different kernels, depending on the size of the matrices.

The fastest path is for m%64=0, k%16=0, n%16=0 and it is based on the implementation written by Volkov at UC Berkeley.

[snapback]437609[/snapback]






What is the fastest path for cublasDgemm?

#5
Posted 09/12/2008 06:14 AM   
The same, m%64=0, k%16=0, n%16=0
The same, m%64=0, k%16=0, n%16=0

#6
Posted 09/12/2008 02:09 PM   
[quote name='cublasuser' date='Sep 9 2008, 07:14 AM']Can this be fixed, and should I be worried about this?
[right][snapback]437331[/snapback][/right]
[/quote]

If you want to improve speed for large matrices with size that's not multiple of 64, you have to add padding (zeros) to your matrix in device memory to obtain size multiple of 64, and then call the kernel.

You can gain also more than 2x of performance..

So you do not need to beat Volkov code, just you need to use it..
[quote name='cublasuser' date='Sep 9 2008, 07:14 AM']Can this be fixed, and should I be worried about this?

[snapback]437331[/snapback]






If you want to improve speed for large matrices with size that's not multiple of 64, you have to add padding (zeros) to your matrix in device memory to obtain size multiple of 64, and then call the kernel.



You can gain also more than 2x of performance..



So you do not need to beat Volkov code, just you need to use it..

Davide Barbieri

#7
Posted 09/12/2008 03:15 PM   
Keeping in mind these fast-path-restrictions gave me a 20 % speedup in one case. Thanks for the advice.
Keeping in mind these fast-path-restrictions gave me a 20 % speedup in one case. Thanks for the advice.

#8
Posted 09/21/2008 04:26 PM   
Scroll To Top