cublasZgemmBatched low performance 2x2 matrices; how to increase performance?

I get low performance of cublasZgemmBatched on a GTX 970 with matrix size 2 x 2.

I am using the following scheme:
I have 13 streams which call cublasZgemmBatched 70 times with a batch size of 10 000.
This results in 9.1 million matrix multiplications which take 1.5 sec on my GTX 970.

I assume 98 operations per matrix multiplication which results in 0.6 GFlops.

How can I increase performance?

Performing 2x2 matrix multiplies would appear to be a memory-bound, rather than a compute-bound, task, even if the matrix elements are complex. My back-of-the-envelope calculation indicates 0.33 FLOP/byte. In addition, the required access pattern for 2x2 matrices is likely to cause sub-optimal memory throughput.

You did not mention the context in which these 2x2 matrix multiplies are being computed. I would suggest researching whether these multiplies can be incorporated into other code running on the GPU, with the goal of minimizing overall data movement. Writing your own local 2x2 matrix multiply function should be easy, you may even consider inlining the function, as it is only about 32 FMAs per matrix multiply or so.

I think CuBLAS has poor performance for small matrices and should not be used for small matrices.

I have written a simple kernel which performs 4 times faster. It takes 0.38 sec to calculate 9.1 million matrix multiplications on a GTX 970. There still is room for optimisation. Any suggestions?

#include "matrixmul.h"
#define THREADMULTIPLIER 8

#include <complex>

using namespace std;

__global__ void matrixMul(cuDoubleComplex **a, int lda, cuDoubleComplex **b, int ldb, cuDoubleComplex **c, int ldc){
	cuDoubleComplex *aa = *(a + THREADMULTIPLIER * blockIdx.x + threadIdx.z);
	cuDoubleComplex *bb = *(b + THREADMULTIPLIER * blockIdx.x + threadIdx.z);
	cuDoubleComplex *cc = *(c + THREADMULTIPLIER * blockIdx.x + threadIdx.z);
	cc[threadIdx.x + ldc * threadIdx.y] = cuCadd(cuCmul(aa[threadIdx.x], bb[ldb * threadIdx.y]),
		cuCmul(aa[threadIdx.x + lda], bb[threadIdx.y * ldb + 1]));
};

void batchedMatrixMul(complex<double> **a, int lda, complex<double> **b, int ldb, complex<double> **c, int ldc, int batchSize, cudaStream_t *stream){
	dim3 threads(2, 2, THREADMULTIPLIER);
	matrixMul << <batchSize / THREADMULTIPLIER, threads, 0, *stream >> >((cuDoubleComplex**)a, lda, (cuDoubleComplex**)b, ldb, (cuDoubleComplex**)c, ldc);
	if (cudaSuccess != cudaGetLastError())
		printf("Error!\n");
}

The context of my problem is rather complicated:

  1. I start with two vectors of size N on host memory -> device memory
  2. these 2 vectors are used to compute X sets of 3 (N sized) vectors
  3. (for each set) (2+3) 5 vectors are used to compute 2*N 4x4 matrices
  4. each set computes N 4x4 matrices using (2+1 from other set) 3*N matrices
  5. N 4x4 matrices are computed with the N 4x4 matrices of each set using 2x2 matrix multiplications (and inversions)
  6. The final result is calculated by multiple Nx2 matrices(calculated by a set of 5 vectors) and a subset of the N 4x4 matrices
  7. --> Copied to host

I have about 7-10 sets: it can change.

I have gone from 0.38 → 0.3 by using cuCfma(x,y,cuCmul(v,w)).

5 times faster then CuBLAS…

I see you already found cuCfma(), which is what I would have suggested using. I assume you are already using the latest CUDA version and thus latest compiler?

Generally speaking, all software libraries are compromises between diverse customer needs, code size and complexity, development effort, etc. It is practically impossible to provide a GEMM implementation that performs optimally for any combination of GPU architecture, data type, matrix dimensions, and transpose mode. The same applies to other operations in CUBLAS, and any library functions in general. By the same token, it can be quite easy to beat a library function by putting some effort into custom code for one specific case.

That said, you should feel free to file an enhancement request with NVIDIA. To do so, use the bug reporting form linked off the registered developer website and prefix the synopsis with “RFE:” to mark it as an enhancement request rather than a functional bug. It is entirely possible that there have not been many customer requests for faster GEMMs that are extremely small, and development efforts tend to gravitate towards functionality with the most significant customer impact.

I use CUDA 6.5. Even with my custom kernel, I can only achieve 1.6 Double GFlops (NSight).
Is this the limit of the GTX 970?
Is this operation memory bound as you mentioned before?
Can I achieve higher performance by adjusting my kernel?

I have following numbers of issue efficiency:

Warp issue Efficiency
24% one or more eligible
76% no eligible

Issue Stall Reasons
60% Memory Dependency
34% Execution Dependency
4% Other

At 1.6 GFLOPS, I don’t see how the code would have reached the memory bandwidth limit yet, even considering that an unfavorable access pattern may reduce effective memory bandwidth to something like 20 GB/sec. What does the CUDA profiler tell you about bottlenecks in the code?

From your description, I have a hard time figuring out what your code actually does. However, all (or most) steps seem to involve very small matrices only. I think you would want to look into an approach where each thread takes one sub-set of the data through the entire processing pipeline (or at least steps 3-6), keeping all intermediate data in registers. In a sense, each thread would work on one slice of the problem, start to finish. You would want to have at least 10,000 threads active to fill the GPU reasonably well. I have used this slice-by-thread approach successfully for things like batched matrix inversions of very small matrices (like 5x5).

How are you storing the matrices? My alarm bells go off when I see ‘cuDoubleComplex **a’. Why isn’t this just ‘cuDoubleComplex *a’, meaning the matrices are passed as flattened arrays? Once you have simple pointers, you could try adding ‘const’ and ‘restrict’ modifiers, which may help read performance (see the Best Practices Guide).

How do I identify bottlenecks in a single kernel?

I am trying to keep my code as high level as possible, by using CuBLAS. Now I see that I need to write kernels for some operations due to the small size.

I believe I will spend a lot of time in writing a kernel which can do my entire pipeline.

Just for clarification: N will be about 10 000. The entire process: step 1 to 7 can be calculated up to 10 000 - 100 000 per execution of the program. I have implemented my program so that I have 13 workers ( = SM) which calculate the pipeline repeatedly for small variations of some input parameters (= the differences between the 10 000 - 100 000 processes).

The kernel that I wrote and posted here calculates a 2x2 matrix multiplication from a 4x4 matrix.
All these matrices are calculated in batch, A, B & C are N-times 4x4 matrices. I have implemented the CuBLAS equivalent of ZgemmBatched which needs an array of pointers to matrices → complex**.

I use complex in my C++ functions and cast them to cuDoubleComplex because they are binairy equivalent.

The CUDA profiler is a useful tool to identify the bottlenecks in CUDA code. Have you had a chance to try it?

I am not sure what you mean when you say that you have “13 workers”. You need about 10,000 threads running concurrently to fill the GPU so basic latencies (like memory load latency) can be covered via thread-level parallelism. How many threads are running concurrently for your kernel launch? What are the grid and block dimensions for the launch?

By design, cuDoubleComplex data layout is compatible with the data layout of complex types in C, C++, and Fortran.

The extra level of indirection (array of pointers to matrices) is likely a contributor to the poor performance of the code. Depending on how the matrices are actually arranged in memory, the access pattern may be almost random, giving poor memory throughput. Is there any chance you can store the matrices consecutively and compact, that is, as an array of matrices, which can then be simply re-interpreted as an array of cuDoubleComplex?

I was/am storing them in bulk. I just tried to copy the cublasZgemmBatched function with the array of pointers. I have discarded the idea.

The “13 workers” are 13 streams which each perform my basic steps. I incorporate them to mask any insufficient use of a kernel by the GPU.

Let me rephrase my program:

I need to simulate a device for many variations of parameters. I can easily need to calculate 10 000 - 100 000 of variations of the device. In my code I use 13 C++ threads (as many as SM I have) which all launch kernels in their own stream in order of the steps I mentioned to simulate the device.
These 13 workers all get their portion of the 10 000 - 100 000 variations to calculate.

The kernel that I am optimising now is step 4 which would launch 13*70 = 910 kernels.
Each kernel launches 625 blocks with 64 threads per block → 10 000 threads.

__global__ void bulkMatrixMul(const cuDoubleComplex *a, int lda, const cuDoubleComplex *b, int ldb, cuDoubleComplex *c, int ldc){
	int pos = (THREADMULTIPLIER * blockIdx.x + threadIdx.z);
	int ia = pos * lda * lda + threadIdx.x;
	int ib = (pos * ldb + threadIdx.y) * ldb;
	cuDoubleComplex cR;
	cR.x = a[ia].x * b[ib].x;
	cR.x -= a[ia].y * b[ib].y;
	cR.x += a[ia + LD].x * b[ib + 1].x;
	cR.x -= a[ia + LD].y * b[ib + 1].y;

	cR.y = a[ia].x * b[ib].y;
	cR.y += a[ia].y * b[ib].x;
	cR.y += a[ia + LD].x * b[ib + 1].y;
	cR.y += a[ia + LD].x * b[ib + 1].y;
	
	c[(pos * ldc + threadIdx.y) * ldc + threadIdx.x] = cR;
};

void bulkBatchedMatrixMul(complex<double> *a, int lda, complex<double> *b, int ldb, complex<double> *c, int ldc, int batchSize, cudaStream_t *stream){
	dim3 threads(2, 2, 16);
	bulkMatrixMul << <batchSize / 16, threads, 0, *stream >> >((cuDoubleComplex*)a, lda, (cuDoubleComplex*)b, ldb, (cuDoubleComplex*)c , ldc);
	if (cudaSuccess != cudaGetLastError())
		printf("Error!\n");
}

I can calculate 9.1 million gemm’s in 37ms on a gtx 970 → about 15 double GFlops

I am using NSight but I am unable to link the data with actual code improvement.