I was/am storing them in bulk. I just tried to copy the cublasZgemmBatched function with the array of pointers. I have discarded the idea.
The “13 workers” are 13 streams which each perform my basic steps. I incorporate them to mask any insufficient use of a kernel by the GPU.
Let me rephrase my program:
I need to simulate a device for many variations of parameters. I can easily need to calculate 10 000 - 100 000 of variations of the device. In my code I use 13 C++ threads (as many as SM I have) which all launch kernels in their own stream in order of the steps I mentioned to simulate the device.
These 13 workers all get their portion of the 10 000 - 100 000 variations to calculate.
The kernel that I am optimising now is step 4 which would launch 13*70 = 910 kernels.
Each kernel launches 625 blocks with 64 threads per block → 10 000 threads.
__global__ void bulkMatrixMul(const cuDoubleComplex *a, int lda, const cuDoubleComplex *b, int ldb, cuDoubleComplex *c, int ldc){
int pos = (THREADMULTIPLIER * blockIdx.x + threadIdx.z);
int ia = pos * lda * lda + threadIdx.x;
int ib = (pos * ldb + threadIdx.y) * ldb;
cuDoubleComplex cR;
cR.x = a[ia].x * b[ib].x;
cR.x -= a[ia].y * b[ib].y;
cR.x += a[ia + LD].x * b[ib + 1].x;
cR.x -= a[ia + LD].y * b[ib + 1].y;
cR.y = a[ia].x * b[ib].y;
cR.y += a[ia].y * b[ib].x;
cR.y += a[ia + LD].x * b[ib + 1].y;
cR.y += a[ia + LD].x * b[ib + 1].y;
c[(pos * ldc + threadIdx.y) * ldc + threadIdx.x] = cR;
};
void bulkBatchedMatrixMul(complex<double> *a, int lda, complex<double> *b, int ldb, complex<double> *c, int ldc, int batchSize, cudaStream_t *stream){
dim3 threads(2, 2, 16);
bulkMatrixMul << <batchSize / 16, threads, 0, *stream >> >((cuDoubleComplex*)a, lda, (cuDoubleComplex*)b, ldb, (cuDoubleComplex*)c , ldc);
if (cudaSuccess != cudaGetLastError())
printf("Error!\n");
}
I can calculate 9.1 million gemm’s in 37ms on a gtx 970 → about 15 double GFlops
I am using NSight but I am unable to link the data with actual code improvement.