What's the best matrix size for cublasSgemm performance ?

I’m working on DNN optimize,most of them are matrix multiplication.
I test different size square matrix by clbasSgemm().
I test in a GTX1080 board with cuda 8.0. I find the different matrix size N has different performance .
when N <512, do 1000 times N size matrix mul ,used time : (0-3)ms,time increase with N peacefully.
but when N=513 ,the time increase to 80ms .Then about N increase 100 the time will come a new high level.

1.what’s influence the critical matrix size for cblasSgemm() performance? Device memory or compute unite resource?

2.How to tune matrix size for best performance.

3.When use multiple stream, how to tune matrix size to make sure multiple stream with sgemm compute concurrent on one gpu card.

please don’t care my poor English.

Thanks.

Here is my test result table:

A typical graph plotting performance vs size for any BLAS library shows a sawtooth pattern, due to various granularity effects in the tiling (blocking) used. In general SGEMM is a compute-bound task.

I am reasonably sure that NVIDIA gives guidance on (S)GEMM performance somewhere in the documentation. From memory, for best performance (as measured in GFLOPS) the matrices should be

(1) square
(2) dimensions multiple of 32
(3) large (but performance plateaus for dimension > 4K x 4K)

There is some performance impact from transposition mode, but typically < 10%. If you have a choice there, worth experimenting. I seem to recall that transpose-B, notranspose-A is often the fastest, but my memory is hazy.