Matrix multiplication + reduction with atomicAdd is slower than CPU code, how can I make it faster?

The program I’m writing currently has a huge bottleneck taking > 99 % of the runtime. The bottleneck occurs when I multiply two matrices A and B together and subtract a vector y from each column of the resulting matrix product. Finally I square the difference between y and each column of the A*B product and reduce the results down to a vector.

Since much of this is basic linear algebra operations it could probably normally be done efficiently with cuBlas, however there are as I see it two reasons why I can’t do that in my situation:

  1. I only want to use specific rows and columns of A and B during the matrix multiplication, not the entire matrices. Which rows and columns are indicated with two Boolean vectors.
  2. I would like to be able to run this kernel many times simultaneously using dynamic parallelism, so if I create the full matrix product C = A*B before reducing it I suspect I will run into memory issues very quickly since the matrices are large. Hence I want to reduce the result on the fly and never keep the full C in memory.

In high-level Matlab notation what I’m trying to do is this:

m = 1e6; n = 256; p = 25; % size variables
UseRow = rand(m,1)>0.5;   % boolean vector with rows of A to use 
UseCol = rand(1,n)>0.5;   % boolean vector with columns of A to use (and rows of B) 
A = rand(m,n);            % A matrix
B = rand(n,p);            % B matrix
y = rand(m,1);            % y vector
SSE = sum( (A(UseRow, UseCol)*B(UseCol,:) - y(UseRow)).^2 )

I have managed to create a working version of this which uses atomicAdd to perform the reduction as the matrix multiplication is being calculated. The only problem is that to my surprise the CUDA version of this is 3-4 times slower than a sequential version of the exact same thing performed on the CPU. Since Im quite new to CUDA I suspect there are things that could be done a lot more efficient. This is what my kernel looks like:

__global__ void MatmultRed(double* A, double* y, double* B, double* SSE, int m, int n, int p, bool* UseCol, bool* UseRow) {
	double temp;
	/* Loop all rows of A matrix. */
	for (int row = blockIdx.x * blockDim.x + threadIdx.x; row < m; row += blockDim.x * gridDim.x) {
		/* Only do the matrix mult for rows marked as true. */
		if (UseRow[row] == true) {
			/* Loop columns of B. */
			for (int col = blockIdx.y * blockDim.y + threadIdx.y; col < p; col += blockDim.y * gridDim.y) {
				/* Initialize temporary variable temp to hold values of current row/col accumulation. */
				temp = 0;
				/* Accumulation loop over columns of A / rows of B.   */
				for (int steps = 0; steps < n; steps++) {
					/* Only accumulate if the column in A is marked as true. */
					if (UseCol[steps] == true) {
						temp += A[row + steps*m] * B[steps + col*n];
					}
				}
				/* Reduce squared error between temp and the current row of y into SSE. */
				atomicAdd(SSE + col, ((temp - y[row])*(temp - y[row])));
			}
			__syncthreads();
		}
	}
}

Any advice on how I can make this go faster? It would be greatly appreciated. To me this seems like an extremely parallel problem which should be able to benefit a lot from being done on a GPU.

Do you have specific matrix sizes that you work with?

Because then a loop unrolling for these specific sizes might work wonders.

No the sizes can vary quite substantially. But in general, the number of rows of A (m) is always much larger than anything else, hundreds of thousands to millions. The number of columns of A (n) is typically around 50-300. And the number of columns of B (p) is around 10-50.

This problem doesn’t seem all that different from a vector/matrix reduction to me (it just has a few extra steps), and after googling it appears that vector/matrix summation can be done much faster without atomicAdd. But I still fail to see how I could change the code I have to work without atomicAdd. Any experts on this topic that could provide hints/clarify if this can be done in an alternative way to what I have now?

“Doctor, it hurts if I do this…”

The general theme to avoid atomic operations is to have (at most) one thread per output, i.e. per element of SSE in your Matlab code.

Apart from getting rid on the atomic operations:
Depending on the level of sparsity it may also be more efficient to store vectors of the indices of participating rows and columns, instead of a bitmap.

I assume this:

SSE = sum( (X(UseRow, UseCol)*B(UseCol,:) - y(UseRow)).^2 )

is supposed to be this:

SSE = sum( (A(UseRow, UseCol)*B(UseCol,:) - y(UseRow)).^2 )

I guess your data is column major? (would be if it is coming from matlab, I think)

Also, I imagine you are running on a fairly recent GPU (cc 6.0 or higher) that supports double atomicAdd natively?

Ops, yes X does not exist so that should be A just like you said. And I’m using a 1080ti with cc 6.1.

Is your data column major?

Also shouldn’t this:

atomicAdd(SSE + col, ((temp - y[row])*(temp - y[row])));
                ^^^

be this:

atomicAdd(SSE + row, ((temp - y[row])*(temp - y[row])));
                ^^^

Your matrix product can have up to as many rows as A and up to as many columns as B. Then you are taking a column of that product, and doing an element-wise difference with y. You are then doing an element-wise squaring, and adding that to the result SSE. Shouldn’t SSE be a column vector with as many rows as A? (just like y is a column vector with as many rows as A?) So shouldn’t you be indexing into the SSE vector using the row variable, just like you are indexing into y using the row variable?

Yes the data is column major.

But no, I don’t think it should be ‘SSE + row’ inside the atomicAdd. The final SSE variable I want is of dimension 1-by-p, so it grows with the columns of the B matrix.

In essence I’m trying to multiply the A and B matrices together (which normally would form an m-by-p matrix C) and then subtract the y vector from each column vector of the matrix product to form a matrix of errors. The elements of the error matrix are then squared and summed along the rows to form the 1-by-p Sum-Squared-Error (SSE) row vector. However, because m is much larger than all other involved dimensions I’m trying to reduce C along the rows with atomicAdd as it is being calculated down to a 1-by-p vector directly, so that no variable is ever m-by-p.

Got it, thanks, my mistake.

Some comments:

  1. I’m guessing that you are doing some kind of overall timing, i.e. including data transfer time to/from the GPU. My guess is you are working directly from matlab, and this stuff is accomplished via mex/mexfunction/ptx interface. All of that is fine, however nothing I’m about to discuss will affect any of the data transfer times to/from the GPU. If this is the only operation you are offloading to the GPU, then of course that is a valid goodness metric. But if you can move more workload to the GPU, then it may help, effectively “amortizing” data transfer costs over more work.

  2. Your GPU isn’t the best choice for double-precision throughput. The GeForce GPUs generally have relatively lower throughput for double-precision calculations than for single-precision floating point calculations. Your 1080ti has 32x lower peak double-precision throughput compared to single precision. I don’t know if your algorithm can tolerate computations in float instead of double, but if so, you may be able to get significant speed up. Or if you can find a faster GPU e.g. various Tesla GPUs that have much higher DP throughput.

  3. With some fairly simple mods, according to my testing on a GTX 960, your code can run ~2x faster. First, change the kernel prototype to use const and restrict decorations:

__global__ void MatmultRed(const double * __restrict__  A, const double * __restrict__  y, const double * __restrict__  B, double * __restrict__  SSE, const int m, const int n, const int p, const bool * __restrict__  UseCol, const bool * __restrict__ UseRow) {
...

Second, get rid of the unnecessary syncthreads():

}
		//	__syncthreads();  comment this line out
		}

those two changes resulted in a ~2x speedup, on a GeForce GPU, when I tested it. Again, the 2x faster I’m talking about here is just comparing your kernel execution time to mine, not the data transfer times. Also, you may see something quite different on your 1080Ti, perhaps no benefit at all.

  1. There may be another algorithm realization that can work faster than what you have. Basically, the idea is to break down your math into individual steps, to fully expose the matrix multiply in a fashion where it can be directly fed into cublas, to take advantage of a high quality matrix-multiply operation. To so this, we have to write helper kernels that prepare the data for multiplication by cublas, and then perform the remaining multiply and reduce steps. I acknowledge that you stated you did not want to keep C in memory, and this approach does exactly that, so it may not fit your needs. However the size of C is less than the size of A (maybe about 1/5 the size of A?), so the cost here is less than 2x on memory allocation. Given the size of this operation, I think you’re unlikely to get any performance benefit trying to run multiple instances via dynamic parallelism or any other method.

Here’s a fully worked example of my test code, built around what you have shown. It doesn’t use matlab, and it’s entirely possible I may have made mistakes – I’m doing minimal results validation, so use it your own risk. On this particular code, comparing the situation where we are using all rows and all columns, the cublas method is about 5x faster on a Tesla P100. The MY_R factor can be used to vary the density of true/false elements in the UseRow and UseCol vectors.

$ cat t272.cu
#include <stdio.h>
#include <cublas_v2.h>

const int my_m = 100000;
const int my_n = 256;
const int my_p = 25;

#ifdef USE_FLOAT
typedef float mytype;
#else
typedef double mytype;
#endif

#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)

#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL

long long dtime_usec(unsigned long long start){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

#if defined(__CUDA_ARCH__) &&  __CUDA_ARCH__ < 600
__device__ double atomicAdd(double* address, double val)
{
 unsigned long long int* address_as_ull = (unsigned long long int*)address;
 unsigned long long int old = *address_as_ull, assumed;
 do {
      assumed = old;
      old = atomicCAS(address_as_ull, assumed, __double_as_longlong(val + __longlong_as_double(assumed)));
    } while (assumed != old);
 return __longlong_as_double(old);
}
#endif

__global__ void MatmultRed(double* A, double* y, double* B, double* SSE, int m, int n, int p, bool* UseCol, bool* UseRow) {
        double temp;
        /* Loop all rows of A matrix. */
        for (int row = blockIdx.x * blockDim.x + threadIdx.x; row < m; row += blockDim.x * gridDim.x) {
                /* Only do the matrix mult for rows marked as true. */
                if (UseRow[row] == true) {
                        /* Loop columns of B. */
                        for (int col = blockIdx.y * blockDim.y + threadIdx.y; col < p; col += blockDim.y * gridDim.y) {
                                /* Initialize temporary variable temp to hold values of current row/col accumulation. */
                                temp = 0;
                                /* Accumulation loop over columns of A / rows of B.   */
                                for (int steps = 0; steps < n; steps++) {
                                        /* Only accumulate if the column in A is marked as true. */
                                        if (UseCol[steps] == true) {
                                                temp += A[row + steps*m] * B[steps + col*n];
                                        }
                                }
                                /* Reduce squared error between temp and the current row of y into SSE. */
                                atomicAdd(SSE + col, ((temp - y[row])*(temp - y[row])));
                        }
                        __syncthreads();
                }
        }
}
#ifdef USE_CONSTANT_B
__constant__ mytype c_B[my_n*my_p] = {0};
#endif
template <typename T>
__global__ void MatmultRed_i(const T * __restrict__  A, const T * __restrict__  y, const T * __restrict__  B, T * __restrict__  SSE, const int m, const int n, const int p, const bool * __restrict__  UseCol, const bool * __restrict__ UseRow) {
        T temp;
        /* Loop all rows of A matrix. */
        for (int row = blockIdx.x * blockDim.x + threadIdx.x; row < m; row += blockDim.x * gridDim.x) {
                /* Only do the matrix mult for rows marked as true. */
                if (UseRow[row] == true) {
                        /* Loop columns of B. */
                        for (int col = blockIdx.y * blockDim.y + threadIdx.y; col < p; col += blockDim.y * gridDim.y) {
                                /* Initialize temporary variable temp to hold values of current row/col accumulation. */
                                temp = 0;
                                /* Accumulation loop over columns of A / rows of B.   */
                                for (int steps = 0; steps < n; steps++) {
                                        /* Only accumulate if the column in A is marked as true. */
                                        if (UseCol[steps] == true) {
#ifndef USE_CONSTANT_B
                                                temp += A[row + steps*m] * B[steps + col*n];
#else
                                                temp += A[row + steps*m] * c_B[steps + col*n];
#endif
                                        }
                                }
                                /* Reduce squared error between temp and the current row of y into SSE. */
                                atomicAdd(SSE + col, ((temp - y[row])*(temp - y[row])));
                        }
//                      __syncthreads();
                }
        }
}

template <typename T>
__global__ void zero_kernel(T * __restrict__ A, const int m, const int n, const bool * __restrict__ UseCol){
  int idx = threadIdx.x+blockDim.x*blockIdx.x;
  if (idx < m)
    for (int i = 0; i < n; i++)
      if (!(UseCol[i])) A[i*m+idx] = 0;
}
template <typename T>
__global__ void red_kernel(const T * __restrict__ AB, const T * __restrict__ y, T * __restrict__ SSE, const int m, const int p, const bool * __restrict__ UseRow){

  int idx = threadIdx.x;
  __shared__ T s[1024];
  s[threadIdx.x] = 0;
  while (idx < m) { // block-stride loop
    if (UseRow[idx]) s[threadIdx.x] += (AB[idx + m*blockIdx.x] - y[idx])*(AB[idx + m*blockIdx.x] - y[idx]);
    idx += blockDim.x;}
  __syncthreads();
  for (int i = 512; i > 0; i>>=1){
    if (threadIdx.x < i) s[threadIdx.x] += s[threadIdx.x+i];
    __syncthreads();}
  if (!threadIdx.x) SSE[blockIdx.x] = s[0];
}

template <typename T>
void mmr(cublasHandle_t h, T * __restrict__  A, const T * __restrict__  y, const T * __restrict__  B, T * __restrict__  SSE, T * __restrict__ AB, const int m, const int n, const int p, const bool * __restrict__  UseCol, const bool * __restrict__ UseRow) {

  zero_kernel<<<(m+255)/256, 256>>>(A, m, n, UseCol);
  T alpha = 1.0;
  T beta  = 0.0; // could also prefill AB with y, set beta to -1.0  so as to compute AB - y
#ifdef USE_FLOAT
  cublasSgemm(h, CUBLAS_OP_N, CUBLAS_OP_N, m, p, n, &alpha, A, m, B, n, &beta, AB, m);
#else
  cublasDgemm(h, CUBLAS_OP_N, CUBLAS_OP_N, m, p, n, &alpha, A, m, B, n, &beta, AB, m);
#endif
  red_kernel<<<p, 1024>>>(AB, y, SSE, m, p, UseRow);
}

int main(){

  mytype *d_A, *d_y, *d_B, *d_SSE, *h_A, *h_y, *h_B, *h_SSE, *d_AB, *h_SSE1;
  bool *d_UseCol, *d_UseRow, *h_UseCol, *h_UseRow;

  cudaMalloc(&d_A,   my_m*my_n*sizeof(mytype));
  cudaMalloc(&d_AB,  my_m*my_p*sizeof(mytype));
  cudaMalloc(&d_y,   my_m     *sizeof(mytype));
  cudaMalloc(&d_B,   my_n*my_p*sizeof(mytype));
  cudaMalloc(&d_SSE,      my_p*sizeof(mytype));
  cudaMalloc(&d_UseCol, my_n*sizeof(bool));
  cudaMalloc(&d_UseRow, my_m*sizeof(bool));
  cudaMemset(d_A, 0,  my_m*my_n*sizeof(mytype));
  cudaMemset(d_y, 0,  my_m     *sizeof(mytype));
  cudaMemset(d_B, 0,  my_n*my_p*sizeof(mytype));
  cudaMemset(d_AB, 0,  my_m*my_p*sizeof(mytype));
  cudaMemset(d_SSE,0,      my_p*sizeof(mytype));

  h_A   = new mytype[my_m*my_n];
  h_B   = new mytype[my_p*my_n];
  h_y   = new mytype[my_m];
  h_SSE = new mytype[my_p];
  h_SSE1 = new mytype[my_p];
  h_UseCol = new bool[my_n];
  h_UseRow = new bool[my_m];

#ifndef MY_R
#define MY_R 1
#endif

  for (int i = 0; i < my_n; i++) h_UseCol[i] = (i%MY_R)?false:true;
  for (int i = 0; i < my_m; i++) h_UseRow[i] = (i%MY_R)?false:true;
  for (int i = 0; i < my_m*my_n;i++) h_A[i] = 1.0;
  for (int i = 0; i < my_m;i++)      h_y[i] = 1.0;
  for (int i = 0; i < my_n*my_p;i++) h_B[i] = 1.0;

  cudaMemcpy(d_UseCol, h_UseCol, my_n*sizeof(bool), cudaMemcpyHostToDevice);
  cudaMemcpy(d_UseRow, h_UseRow, my_m*sizeof(bool), cudaMemcpyHostToDevice);
#ifndef NO_INIT
  cudaMemcpy(d_y, h_y, my_m*sizeof(mytype), cudaMemcpyHostToDevice);
  cudaMemcpy(d_B, h_B, my_p*my_n*sizeof(mytype), cudaMemcpyHostToDevice);
  cudaMemcpy(d_A, h_A, my_m*my_n*sizeof(mytype), cudaMemcpyHostToDevice);
#endif
#ifdef USE_CONSTANT_B
  cudaMemcpyToSymbol(c_B, h_B, my_p*my_n*sizeof(mytype));
#endif
  dim3 threads(32, 8);
  dim3 blocks((my_m+threads.x-1)/threads.x, (my_p+threads.y-1)/threads.y);
  long long dt = dtime_usec(0);
#ifndef TEST
  MatmultRed<<<blocks,threads>>>(d_A, d_y, d_B, d_SSE, my_m, my_n, my_p, d_UseCol, d_UseRow);
#else
  MatmultRed_i<<<blocks,threads>>>(d_A, d_y, d_B, d_SSE, my_m, my_n, my_p, d_UseCol, d_UseRow);
#endif
  cudaDeviceSynchronize();
  dt = dtime_usec(dt);
  printf("elapsed time: %f\n", dt/(float)USECPSEC);
  cudaMemcpy(h_SSE, d_SSE, my_p*sizeof(mytype), cudaMemcpyDeviceToHost);
  cublasHandle_t hdl;
  cublasCreate(&hdl);
  cudaDeviceSynchronize();
  dt = dtime_usec(0);
  mmr(hdl, d_A, d_y, d_B, d_SSE, d_AB, my_m, my_n, my_p, d_UseCol, d_UseRow);
  cudaDeviceSynchronize();
  dt = dtime_usec(dt);
  printf("elapsed time: %f\n", dt/(float)USECPSEC);
  cudaMemcpy(h_SSE1, d_SSE, my_p*sizeof(mytype), cudaMemcpyDeviceToHost);
  cudaCheckErrors("some error");
  for (int i = 0; i < my_p; i++)
    if (fabs(h_SSE[i] - h_SSE1[i]) > TOL) {printf("mismatch at %d, was %f, should be %f\n", i, h_SSE1[i], h_SSE[i]); return -1;}
  return 0;
}
$ nvcc -arch=sm_60 -o t272 t272.cu -lineinfo -DMY_R=1 -DTOL=0.1 -lcublas
$ ./t272
elapsed time: 0.007794
elapsed time: 0.001095
$ nvcc -arch=sm_60 -o t272 t272.cu -lineinfo -DTEST -DMY_R=1 -DTOL=0.1 -lcublas
$ ./t272
elapsed time: 0.007289
elapsed time: 0.001088
$ nvcc -arch=sm_60 -o t272 t272.cu -lineinfo -DTEST -DUSE_CONSTANT_B -DMY_R=1 -DTOL=0.1 -lcublas
$ ./t272
elapsed time: 0.005814
elapsed time: 0.001076
$ nvcc -arch=sm_60 -o t272 t272.cu -lineinfo -DTEST -DUSE_CONSTANT_B -DMY_R=2 -DTOL=0.1 -lcublas
$ ./t272
elapsed time: 0.003573
elapsed time: 0.001203
$ nvcc -arch=sm_60 -o t272 t272.cu -lineinfo -DTEST -DUSE_CONSTANT_B -DMY_R=4 -DTOL=0.1 -lcublas
$ ./t272
elapsed time: 0.002607
elapsed time: 0.001298
$

Notes:

  1. I’ve included an extra optimization here. For some sizes of your B matrix, it can be put into constant memory, and this might give some benefit. The benefit was not that significant when I tested on GTX 960, but on Tesla P100 it made a more significant difference. Your size ranges for B won’t all fit in constant memory, but some of them will. However, on Tesla P100, I don’t see the 2x speedup between your code and the trivial mods I mentioned in item 2 above.

  2. The first test case above is your code with UseRow and UseCol all set to true. 7.8ms for your kernel. The cublas test is ~7x faster on P100.

  3. The second test case is just the prototype decoration and elimination of unnecessary syncthreads. Not much benefit on P100.

  4. The third test case adds in the use of constant memory for B. cublas method is about 5x faster

  5. The fourth test case above reduces UseRow and UseCol to 50% population of true. cublas is about 3x faster

  6. The fifth test case above reduces UseRow and UseCol to 25% population of true. cublas is about 2x faster.

  7. I haven’t included the test here, but when I run the above test case on a GTX 960, comparing your original code to this cublas method, I get something like a 25x speedup, even though we are running cublasDgemm on a GPU that has low-rate DP throughput. However I’m not sure how it will play out on your GTX 1080Ti, but you may want to give it a test comparison. Just run the first test case I have listed here (note 2 above) on your setup, and see what you get.

  8. I don’t actually think the use of atomics is a really big problem for your code. You can get a sense of this by replacing the atomic operation with an ordinary add. Yes, the results are not correct, but the timing changes very little, according to my testing.

  9. I’ve also included a version that uses float instead of double. Be sure to define USE_FLOAT, TEST, and you will need to adjust TOL to a much higher value, if you want results “matching”. I’m guessing this would still show the cublas version run about 5x faster than your code.

Thank you for the code and the restrict suggestion. Since I only write cuda functions that are callable from Matlab much of what you wrote looked a bit foreign to me, but I have now migrated your cublas based code (which was the fastest option in you tests?) into a mex function and timed it on my computer as well. However, as I understand it you destroy the A matrix by overwriting columns of it with zeros before giving it to cublas? So in other words, I can’t use this to calculate C for several different UseCol vectors because A is ruined after the first one? Then I would have to constantly copy a new A again from host to device each iteration.

My goal is that I should be able to run the kernel → Change UseCol → run it again etc. many many times.

And when I time your destructive cublas function vs my nondestructive manual matrix multiplication (now a bit updated from the original code) with UseRow set to 20 % true the manual kernel is about 2x faster than cublas (6 ms vs 12 ms). This seems to contradict your results, am I doing something wrong in my version of this (code attached below)? When evaluating 1000 multiple UseCol vectors (which is not supported in the cublas code so the resulting numbers are incorrect) the manual code is about 10x faster than cublas (883 ms vs 8450 ms).

What I’m doing is essentially just matrix multiplication, with some rows and columns skipped, so I find it a bit unlikely that my naïve triple nested loop solution should be the best way to go. There has to be a faster way to do this.

#include "mex.h"
#include "cuda_runtime.h" 
#include <stdio.h>
#include <time.h>
#include <math.h>
#include <cublas_v2.h>

/* Helper kernel 1 for cublas-based matmult reduction */
__global__ void zero_kernel(double * __restrict__ A, const int m, const int n, const bool * __restrict__ UseCol) {
	int idx = threadIdx.x + blockDim.x*blockIdx.x;
	if (idx < m){
		for (int i = 0; i < n; i++){
			if (!(UseCol[i])){
				A[i*m + idx] = 0;
			}
		}
	}
}
/* Helper kernel 2 for cublas-based matmult reduction */
__global__ void red_kernel(const double * __restrict__ AB, const double * __restrict__ y, double * __restrict__ SSE, const int m, const int p, const bool * __restrict__ UseRow) {

	int idx = threadIdx.x;
	__shared__ double s[1024];
	s[threadIdx.x] = 0;
	while (idx < m) { // block-stride loop
		if (UseRow[idx]) { 
			s[threadIdx.x] += (AB[idx + m*blockIdx.x] - y[idx])*(AB[idx + m*blockIdx.x] - y[idx]); 
		}
		idx += blockDim.x;
	}
	__syncthreads();
	for (int i = 512; i > 0; i >>= 1) {
		if (threadIdx.x < i) {
			s[threadIdx.x] += s[threadIdx.x + i];
		}
		__syncthreads();
	}
	if (!threadIdx.x) {
		SSE[blockIdx.x] = s[0];
	}
}

/* Manual matmult + reduction kernel without cublas */
__global__ void ManualMatMultRedKernel(const double *__restrict__ A, const double *__restrict__ B, const double * __restrict__ y, double *SSE, const size_t m, const size_t n, const size_t p, const size_t k, const uint64_T *UseColindex, const uint64_T *smalln, const uint64_T *UseRowIndex, const size_t smallm) {
	double e;
	/* Loop over pages in B (k) */
	for (int v = blockIdx.z * blockDim.z + threadIdx.z; v < k; v += blockDim.z * gridDim.z) {
		/* Loop rows of A (smallm) */
		for (int r = blockIdx.x * blockDim.x + threadIdx.x; r < smallm; r += blockDim.x * gridDim.x) {
			/* Loop columns of B (p) */
			for (int c = blockIdx.y * blockDim.y + threadIdx.y; c < p; c += blockDim.y * gridDim.y) {
				e = 0;
				/* Inner loop over rows of B (smalln[v]) */
				for (int steps = 0; steps < smalln[v]; steps++) {
					e += A[UseRowIndex[r] + m*UseColindex[v + k*steps]] * B[UseColindex[v + k*steps] + n*c + n*p*v];
				}
				e -= y[UseRowIndex[r]];       // Error
				e *= e;						  // Squared error
				atomicAdd(SSE + v + k*c, e);  // Sum of squared errors
			} 
		} 
	}
} 

void mmr(cublasHandle_t h, double * __restrict__  A, const double * __restrict__  y, const double * __restrict__  B, double * __restrict__  SSE, double * __restrict__ AB, const int m, const int n, const int p, const bool * __restrict__  UseCol, const bool * __restrict__ UseRow) {

	zero_kernel<<<(m + 255) / 256, 256 >>>(A, m, n, UseCol);
	
	double alpha = 1.0;
	double beta = 0.0; // could also prefill AB with y, set beta to -1.0  so as to compute AB - y

	cublasDgemm(h, CUBLAS_OP_N, CUBLAS_OP_N, m, p, n, &alpha, A, m, B, n, &beta, AB, m);

	red_kernel<<<p, 1024 >>>(AB, y, SSE, m, p, UseRow);
}

/* ------------------------------ Start of Matlab gateway function ----------------------------- */
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[]) {
	
	/* Get inputs from Matlab. */
	const double *A, *B, *y;
	const bool *UseCol, *UseRow;
	A       = mxGetPr(prhs[0]);				   // [m-by-n] Matrix A.
	B       = mxGetPr(prhs[1]);				   // [n-by-p-by-k] Matrix B.
	y       = mxGetPr(prhs[2]);				   // [m-by-1] vector y.
	UseCol  = mxGetLogicals(prhs[3]);          // [k-by-n] logical matrix with columns to use.
	UseRow  = mxGetLogicals(prhs[4]);          // [m-by-1] logical vector with rows to use.

	/* Get size variables for inputs. */
	size_t m, n, p, k;

	const size_t *pDims;
	const mwSize ndims = mxGetNumberOfDimensions(prhs[1]);
	pDims = mxGetDimensions(prhs[1]);

	m = mxGetM(prhs[2]);    // Number of rows of A and y.
	n = pDims[0];		    // Number of columns of A/rows in B.
	p = pDims[1];           // Number of columns in B.
	if (ndims == 2) {
		k = 1;			    // Number of pages in B.
	}
	else {
		k = pDims[2];      // Number of pages in B.
	}

	/* Initialize Matlab output matrices. */
	double *SSE1;
	plhs[0] = mxCreateDoubleMatrix(k, p, mxREAL);
	SSE1 = mxGetPr(plhs[0]);
	double *SSE2;
	plhs[1] = mxCreateDoubleMatrix(k, p, mxREAL);
	SSE2 = mxGetPr(plhs[1]);

	/* Allocate memory on GPU and transfer variables.                                          */
	double *dA, *dB, *dy, *dSSE1, *dSSE2;
	bool *dUseCol, *dUseRow;

	/* Allocate memory on GPU.	     */
	cudaMalloc(&dA     , sizeof(double)   * (int)m   * (int)n            );
	cudaMalloc(&dB     , sizeof(double)   * (int)n   * (int)p   * (int)k );
	cudaMalloc(&dy     , sizeof(double)   * (int)m                       );
	cudaMalloc(&dUseCol, sizeof(bool)     * (int)k   * (int)n            );
	cudaMalloc(&dUseRow, sizeof(bool)     * (int)m                       );
	
	cudaMalloc(&dSSE1  , sizeof(double)   * (int)k   * (int)p            );
	cudaMemset(dSSE1, 0, sizeof(double)   * (int)k   * (int)p            ); // make sure SSE1 is initialized to 0
	
	cudaMalloc(&dSSE2  , sizeof(double)   * (int)k   * (int)p            );
	cudaMemset(dSSE2, 0, sizeof(double)   * (int)k   * (int)p            ); // make sure SSE2 is initialized to 0
	
	/* Transfer variables from host to device.												     */
	cudaMemcpy(dA     , A     , sizeof(double) * (int)m * (int)n         , cudaMemcpyHostToDevice);
	cudaMemcpy(dB     , B     , sizeof(double) * (int)n * (int)p * (int)k, cudaMemcpyHostToDevice);
	cudaMemcpy(dy     , y     , sizeof(double) * (int)m                  , cudaMemcpyHostToDevice);
	cudaMemcpy(dUseCol, UseCol, sizeof(bool)   * (int)k   * (int)n       , cudaMemcpyHostToDevice);
	cudaMemcpy(dUseRow, UseRow, sizeof(bool)   * (int)m                  , cudaMemcpyHostToDevice);

	clock_t t1, t2;
	/* ———————————————————————————————————————————————————————————————————————————————————————— */
	/* Time non-destructive manual matrix multiplication + reduction kernel                     */
	t1 = clock();

	// Count active variables in UseCol and store their indices.
	uint64_T *UseColindex, *smalln;
	UseColindex = (uint64_T*)malloc(sizeof(uint64_T) * (int)k * (int)n); // [k-by-n] (overallocated)
	smalln = (uint64_T*)calloc(k, sizeof(uint64_T)); // [k-by-1]
	for (int c = 0; c < n; c++) {
		for (int r = 0; r < k; r++) {
			if (UseCol[r + k*c] == true) {
				UseColindex[r + k*smalln[r]] = c;
				smalln[r] += 1;
			}
		}
	}
	// Copy to device.
	uint64_T *dsmalln, *dUseColindex;
	cudaMalloc(&dsmalln, sizeof(uint64_T)   * (int)k);
	cudaMemcpy(dsmalln, smalln, sizeof(uint64_T) * (int)k, cudaMemcpyHostToDevice);
	cudaMalloc(&dUseColindex, sizeof(uint64_T)   * (int)k * (int)n);
	cudaMemcpy(dUseColindex, UseColindex, sizeof(uint64_T) * (int)k * (int)n, cudaMemcpyHostToDevice);

	// Count active observations in UseRow and store their indices.
	uint64_T *UseRowIndex;
	size_t smallm;
	UseRowIndex = (uint64_T*)malloc(sizeof(uint64_T) * (int)m); // [m-by-1] (overallocated)
	smallm = 0;
	for (int r = 0; r < m; r++) {
		if (UseRow[r] == true) {
			UseRowIndex[smallm] = r;
			smallm++;
		}
	}
	// Copy to device.
	uint64_T *dUseRowIndex;
	cudaMalloc(&dUseRowIndex, sizeof(uint64_T)   * (int)m);
	cudaMemcpy(dUseRowIndex, UseRowIndex, sizeof(uint64_T) * (int)m, cudaMemcpyHostToDevice);

	/* Specify kernel launch configuration.													     */
	int xnTPB = 8; int ynTPB = 16; int znTPB = 8;
	int xnumBlocks = (smallm + xnTPB - 1) / xnTPB;
	int ynumBlocks = (n + ynTPB - 1) / ynTPB;
	int znumBlocks = (k + znTPB - 1) / znTPB;
	dim3 Nblocks(xnumBlocks, ynumBlocks, znumBlocks);
	dim3 nTPB(xnTPB, ynTPB, znTPB);

	// Call manual matrix multiplication + reduction kernel.
	ManualMatMultRedKernel << <Nblocks, nTPB >> >(dA, dB, dy, dSSE2, m, n, p, k, dUseColindex, dsmalln, dUseRowIndex, smallm);
	cudaDeviceSynchronize();

	free(UseColindex);
	free(UseRowIndex);
	free(smalln);
	cudaFree(dUseColindex);
	cudaFree(dUseRowIndex);
	cudaFree(dsmalln);

	t2 = clock();
	printf("Manual matmult kernel took: %.3f ms\n", (double)(t2 - t1) / CLOCKS_PER_SEC * 1000);

	/* ———————————————————————————————————————————————————————————————————————————————————————— */
	/* Time destructive cublas based matrix multiplication + reduction kernel                   */
	t1 = clock();

	cublasHandle_t hdl;
	cublasCreate(&hdl);

	cudaDeviceSynchronize();
	double *d_AB;
	cudaMalloc(&d_AB, sizeof(double) * (int)m * (int)p);
	
	// Call kernel k times to simulate different pages of B used (and columns of UseCol) 
	for (int Bpage = 0; Bpage < k; Bpage++){
		mmr(hdl, dA, dy, dB, dSSE1, d_AB, m, n, p, dUseCol, dUseRow);
	}
	cudaDeviceSynchronize();

	cudaFree(d_AB);

	t2 = clock();
	printf("Cublas kernel took: %.3f ms\n", (double)(t2 - t1) / CLOCKS_PER_SEC * 1000);
	/*	-------------------------------------- 	*/

	// copy calculation result back to host
	cudaMemcpy(SSE1, dSSE1, (int)k * (int)p * sizeof(double), cudaMemcpyDeviceToHost);
	cudaMemcpy(SSE2, dSSE2, (int)k * (int)p * sizeof(double), cudaMemcpyDeviceToHost);
	
	/* Free allocated memory */
	cudaFree(dA);
	cudaFree(dB);
	cudaFree(dy);
	cudaFree(dSSE1);
	cudaFree(dSSE2);
	cudaFree(dUseCol);
	cudaFree(dUseRow);
}

Update: I realized that when multiplying several different n-by-p B matrices with the same A I can just:

  1. horizontally concatenate the B’s next to each other into a n-by-(p*k) matrix
  2. Remove all rows of A which are marked as false by UseRow
    And then perform a single dgemm call instead of multiple ones in a loop. The resulting C should then contain all the products needed provided that inactive elements of B are given a value of 0. So I tried that thinking that I would see at least 10X performance gains because I’m giving the entire problem to dgemm at once in a nice compact way. But to my surprise, that was still slower than my atomicAdd kernel with nested loops! And that’s just considering the dgemm matrix multiplication, I didn’t even time the subsequent C - y error calculation followed by squaring and reduction of C! What is going on!? I thought cublas dgemm was supposed to give unrivaled performance but its slower than loops!

You’re timing operations like cudaMalloc, cudaMemcpy, cudaFree, etc. I’m not timing those, so it’s not surprising to me that our results are different. Ideally we would do operations like cudaMalloc and cudaFree once, at the beginning and ending of your code, and not do them repetitively in timing-sensitive loops.

If you need to use A multiple times, then I would copy A once to the device, create an extra copy of A on the device, and perform the zeroing-of-A from A to Acopy with a kernel each time you need a different UseCol or UseRow config. Yes, this costs memory, but using memory to speed things up is a common strategy, and your concern about being able to launch many copies of this algorithm on a single GPU are not well-founded. That strategy will not give you any speed up.

Regarding your update, I have no idea what is going on there, you haven’t provided the code (right?) But if you are timing cudaMalloc, cudaMemcpy, cudaFree, etc. and there is a significant difference in the two implementations with respect that (e.g. one uses cudaMalloc, the other does not) my guess would be that is the issue. cudaMalloc is expensive.

But when I ”fake” doing multiple calculations with the dgemm based mmr function in the code I gave before I placed the loop after all the malloc, memcpy etc. so that the comparison is as fair as possible. When I test 1000 iterations (k=1000) the mallocs/memcpy only occurs on the first iteration, the other 999 are free since there are no new mallocs/memcpy in the inner loop:

for (int Bpage = 0; Bpage < k; Bpage++){
    mmr(hdl, dA, dy, dB, dSSE1, d_AB, m, n, p, dUseCol, dUseRow);
}

And I don’t intend to use several instances of this algorithm on my gpu, having multiple different B matrices is what I meant by running the kernel multiple times (that was before the code was written in a way that accepted 3-dimensional B’s), so the code right now does what I want it to do, its just slower than what I would have wished and its bottlenecking my entire algorithm.

I’m attaching the code I used for timing manual matrix multiplication + reduction vs. a single dgemm call below. I just tested it with these dimensions: m=1e5, n=256, p=25, k=50. k is the number of different B’s I use. And the reported time for the manual matrix multiplication (which includes: Allocating and counting the true rows and columns of UseRow and UseCol + the matrix multiplication + squaring of error + reduction) is 38 milliseconds. The reported time for ONLY the dgemm call and nothing else is 37 ms. When including the preparation of dSmallA etc. this method is substantially slower, and the results are of course not done when they get out of dgemm, they still need squaring and reduction and so on which I have not done.

So I think I’m giving dgemm the benefit of the doubt here by being generous in what I include in the timing, but its very unimpressive in its performance.

#include "mex.h"
#include "cuda_runtime.h" 
#include <stdio.h>
#include <time.h>
#include <math.h>
#include <cublas_v2.h>

/* Helper kernel 1 for cublas-based matmult reduction */
__global__ void zero_kernel(double * __restrict__ A, const int m, const int n, const bool * __restrict__ UseCol) {
	int idx = threadIdx.x + blockDim.x*blockIdx.x;
	if (idx < m){
		for (int i = 0; i < n; i++){
			if (!(UseCol[i])){
				A[i*m + idx] = 0;
			}
		}
	}
}
/* Helper kernel 2 for cublas-based matmult reduction */
__global__ void red_kernel(const double * __restrict__ AB, const double * __restrict__ y, double * __restrict__ SSE, const int m, const int p, const bool * __restrict__ UseRow) {

	int idx = threadIdx.x;
	__shared__ double s[1024];
	s[threadIdx.x] = 0;
	while (idx < m) { // block-stride loop
		if (UseRow[idx]) { 
			s[threadIdx.x] += (AB[idx + m*blockIdx.x] - y[idx])*(AB[idx + m*blockIdx.x] - y[idx]); 
		}
		idx += blockDim.x;
	}
	__syncthreads();
	for (int i = 512; i > 0; i >>= 1) {
		if (threadIdx.x < i) {
			s[threadIdx.x] += s[threadIdx.x + i];
		}
		__syncthreads();
	}
	if (!threadIdx.x) {
		SSE[blockIdx.x] = s[0];
	}
}
/* Manual matmult + reduction kernel without cublas */
__global__ void ManualMatMultRedKernel(const double *__restrict__ A, const double *__restrict__ B, const double * __restrict__ y, double *SSE, const size_t m, const size_t n, const size_t p, const size_t k, const uint64_T *UseColindex, const uint64_T *smalln, const uint64_T *UseRowIndex, const size_t smallm) {
	double e;
	/* Loop over pages in B (k) */
	for (int v = blockIdx.z * blockDim.z + threadIdx.z; v < k; v += blockDim.z * gridDim.z) {
		/* Loop rows of A (smallm) */
		for (int r = blockIdx.x * blockDim.x + threadIdx.x; r < smallm; r += blockDim.x * gridDim.x) {
			/* Loop columns of B (p) */
			for (int c = blockIdx.y * blockDim.y + threadIdx.y; c < p; c += blockDim.y * gridDim.y) {
				e = 0;
				/* Inner loop over rows of B (smalln[v]) */
				for (int steps = 0; steps < smalln[v]; steps++) {
					e += A[UseRowIndex[r] + m*UseColindex[v + k*steps]] * B[UseColindex[v + k*steps] + n*c + n*p*v];
				}
				e -= y[UseRowIndex[r]];       // Error
				e *= e;						  // Squared error
				atomicAdd(SSE + v + k*c, e);  // Sum of squared errors
			} 
		} 
	}
} 

void mmr(cublasHandle_t h, double * __restrict__  A, const double * __restrict__  y, const double * __restrict__  B, double * __restrict__  SSE, double * __restrict__ AB, const int m, const int n, const int p, const bool * __restrict__  UseCol, const bool * __restrict__ UseRow) {

	zero_kernel<<<(m + 255) / 256, 256 >>>(A, m, n, UseCol);
	
	double alpha = 1.0;
	double beta = 0.0; // could also prefill AB with y, set beta to -1.0  so as to compute AB - y

	cublasDgemm(h, CUBLAS_OP_N, CUBLAS_OP_N, m, p, n, &alpha, A, m, B, n, &beta, AB, m);

	red_kernel<<<p, 1024 >>>(AB, y, SSE, m, p, UseRow);
}
/* Kernel for filling A temp */
__global__ void FillPartialA(const double *__restrict__ A, const size_t m, const size_t n, const uint64_T *UseRowIndex, const size_t smallm, double *SmallA) {
	/* Loop rows of A (smallm) */
	for (int r = blockIdx.x * blockDim.x + threadIdx.x; r < smallm; r += blockDim.x * gridDim.x) {
		/* Inner loop over columns of A (n) */
		for (int steps = blockIdx.y * blockDim.y + threadIdx.y; steps < n; steps += blockDim.y * gridDim.y) {
			SmallA[r + smallm*steps] = A[UseRowIndex[r] + m*steps];		
		}
	}
}

/* ------------------------------ Start of Matlab gateway function ----------------------------- */
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[]) {
	
	/* ————————————————————————————————————————————————————————————————————————————————————————*/
	/* Get inputs from Matlab.																   */
	const double *A, *B, *y;
	const bool *UseCol, *UseRow;
	A       = mxGetPr(prhs[0]);				   // [m-by-n] Matrix A.
	B       = mxGetPr(prhs[1]);				   // [n-by-p-by-k] Matrix B.
	y       = mxGetPr(prhs[2]);				   // [m-by-1] vector y.
	UseCol  = mxGetLogicals(prhs[3]);          // [k-by-n] logical matrix with columns to use.
	UseRow  = mxGetLogicals(prhs[4]);          // [m-by-1] logical vector with rows to use.

	/* Get size variables for inputs. */
	size_t m, n, p, k;

	const size_t *pDims;
	const mwSize ndims = mxGetNumberOfDimensions(prhs[1]);
	pDims = mxGetDimensions(prhs[1]);

	m = mxGetM(prhs[2]);    // Number of rows of A and y.
	n = pDims[0];		    // Number of columns of A/rows in B.
	p = pDims[1];           // Number of columns in B.
	if (ndims == 2) {
		k = 1;			    // Number of pages in B.
	}
	else {
		k = pDims[2];      // Number of pages in B.
	}

	/* Initialize Matlab output matrices. */
	double *SSE1;
	plhs[0] = mxCreateDoubleMatrix(k, p, mxREAL);
	SSE1 = mxGetPr(plhs[0]);
	double *SSE2;
	plhs[1] = mxCreateDoubleMatrix(k, p, mxREAL);
	SSE2 = mxGetPr(plhs[1]);
	/* ————————————————————————————————————————————————————————————————————————————————————————*/
	/* Allocate memory on GPU and transfer variables.                                          */
	double *dA, *dB, *dy, *dSSE1, *dSSE2;
	bool *dUseCol, *dUseRow;

	/* Allocate memory on GPU.																     */
	cudaMalloc(&dA     , sizeof(double)   * (int)m   * (int)n            );
	cudaMalloc(&dB     , sizeof(double)   * (int)n   * (int)p   * (int)k );
	cudaMalloc(&dy     , sizeof(double)   * (int)m                       );
	cudaMalloc(&dUseCol, sizeof(bool)     * (int)k   * (int)n            );
	cudaMalloc(&dUseRow, sizeof(bool)     * (int)m                       );
	
	cudaMalloc(&dSSE1  , sizeof(double)   * (int)k   * (int)p            );
	cudaMemset(dSSE1, 0, sizeof(double)   * (int)k   * (int)p            ); // make sure SSE1 is initialized to 0
	
	cudaMalloc(&dSSE2  , sizeof(double)   * (int)k   * (int)p            );
	cudaMemset(dSSE2, 0, sizeof(double)   * (int)k   * (int)p            ); // make sure SSE2 is initialized to 0
	
	/* Transfer variables from host to device.												     */
	cudaMemcpy(dA     , A     , sizeof(double) * (int)m * (int)n         , cudaMemcpyHostToDevice);
	cudaMemcpy(dB     , B     , sizeof(double) * (int)n * (int)p * (int)k, cudaMemcpyHostToDevice);
	cudaMemcpy(dy     , y     , sizeof(double) * (int)m                  , cudaMemcpyHostToDevice);
	cudaMemcpy(dUseCol, UseCol, sizeof(bool)   * (int)k   * (int)n       , cudaMemcpyHostToDevice);
	cudaMemcpy(dUseRow, UseRow, sizeof(bool)   * (int)m                  , cudaMemcpyHostToDevice);

	clock_t t1, t2, t3, t4;	
	/*	———————————————————————————————————————————————————————————————————————————————————————— */
	/* Time non-destructive manual matrix multiplication + reduction kernel                      */
	/*	———————————————————————————————————————————————————————————————————————————————————————— */
	t1 = clock();

	// Count active variables in UseCol and store their indices.
	uint64_T *UseColindex, *smalln;
	UseColindex = (uint64_T*)malloc(sizeof(uint64_T) * (int)k * (int)n); // [k-by-n] (overallocated)
	smalln = (uint64_T*)calloc(k, sizeof(uint64_T)); // [k-by-1]
	for (int c = 0; c < n; c++) {
		for (int r = 0; r < k; r++) {
			if (UseCol[r + k*c] == true) {
				UseColindex[r + k*smalln[r]] = c;
				smalln[r] += 1;
			}
		}
	}
	// Copy to device.
	uint64_T *dsmalln, *dUseColindex;
	cudaMalloc(&dsmalln, sizeof(uint64_T)   * (int)k);
	cudaMemcpy(dsmalln, smalln, sizeof(uint64_T) * (int)k, cudaMemcpyHostToDevice);
	cudaMalloc(&dUseColindex, sizeof(uint64_T)   * (int)k * (int)n);
	cudaMemcpy(dUseColindex, UseColindex, sizeof(uint64_T) * (int)k * (int)n, cudaMemcpyHostToDevice);

	// Count active observations in UseRow and store their indices.
	uint64_T *UseRowIndex;
	size_t smallm;
	UseRowIndex = (uint64_T*)malloc(sizeof(uint64_T) * (int)m); // [m-by-1] (overallocated)
	smallm = 0;
	for (int r = 0; r < m; r++) {
		if (UseRow[r] == true) {
			UseRowIndex[smallm] = r;
			smallm++;
		}
	}
	// Copy to device.
	uint64_T *dUseRowIndex;
	cudaMalloc(&dUseRowIndex, sizeof(uint64_T)   * (int)m);
	cudaMemcpy(dUseRowIndex, UseRowIndex, sizeof(uint64_T) * (int)m, cudaMemcpyHostToDevice);

	/* Specify kernel launch configuration.													     */
	int xnTPB = 8; int ynTPB = 16; int znTPB = 8;
	int xnumBlocks = (smallm + xnTPB - 1) / xnTPB;
	int ynumBlocks = (n + ynTPB - 1) / ynTPB;
	int znumBlocks = (k + znTPB - 1) / znTPB;
	dim3 Nblocks(xnumBlocks, ynumBlocks, znumBlocks);
	dim3 nTPB(xnTPB, ynTPB, znTPB);

	// Call manual matrix multiplication + reduction kernel.
	ManualMatMultRedKernel<<<Nblocks, nTPB >>>(dA, dB, dy, dSSE2, m, n, p, k, dUseColindex, dsmalln, dUseRowIndex, smallm);
	cudaDeviceSynchronize();

	t2 = clock();
	printf("Manual matmult kernel took: %.3f ms\n", (double)(t2 - t1) / CLOCKS_PER_SEC * 1000);

	/*———————————————————————————————————————————————————————————————————————————————————————— */
	/* Time non-destructive big cublasDgemm call						                       */
	/*———————————————————————————————————————————————————————————————————————————————————————— */

	t1 = clock();

	cublasHandle_t hdl;
	cublasCreate(&hdl);
	double alpha = 1.0;
	double beta = 0.0; // could also prefill AB with y, set beta to -1.0  so as to compute AB - y

	// Smaller A with only rows which are true in UseRow included and AB.
	double *dSmallA;
	cudaMalloc(&dSmallA, sizeof(double)   * (int)smallm   * (int)n);
	double *dAB;
	cudaMalloc(&dAB, sizeof(double)       * (int)smallm   * (int)p  * (int)k);

	/* Remove rows from A that are not set to true in UseRow.											*/
	int xnTPB2 = 8;	int ynTPB2 = 16; 	int znTPB2 = 1;
	int xnumBlocks2 = (smallm + xnTPB2 - 1) / xnTPB2;
	int ynumBlocks2 = (n + ynTPB2 - 1) / ynTPB2;
	int znumBlocks2 = 1;
	dim3 Nblocks2(xnumBlocks2, ynumBlocks2, znumBlocks2);
	dim3 nTPB2(xnTPB2, ynTPB2, znTPB2);

	FillPartialA<<<Nblocks2, nTPB2>>>(dA, m, n, dUseRowIndex, smallm, dSmallA);
	cudaDeviceSynchronize();
	
	t3 = clock();
	size_t pk = p*k;
	 //Calculate C = A*B  with cublas
	cublasDgemm(hdl,
	  CUBLAS_OP_N,  // No transpose
	  CUBLAS_OP_N,  // No transpose
		   smallm,  // rows of dSmallA
		       pk,	// columns of dB
		        n,  // columns of dSmallA
		   &alpha,  // scalar alpha
	  	  dSmallA,  // dSmallA matrix
  		   smallm,  // LDA
		       dB,  // dB matrix
		        n,  // LDB
		    &beta,  // scalar beta
		      dAB,  // output matrix [smallm-by-p-by-k]
		   smallm); // LDC
	cudaDeviceSynchronize();
	t4 = clock();
	/* If the reported time here is greater than the whole manual matmult+reduction time,
	there is no need to convert the dAB output to a squared error and reduce it, because the method is already slower. */
	printf("dgemm alone took: %.3f ms\n", (double)(t4 - t3) / CLOCKS_PER_SEC * 1000);

	cudaFree(dAB);
	free(UseColindex);
	free(UseRowIndex);
	free(smalln);
	cudaFree(dUseColindex);
	cudaFree(dUseRowIndex);
	cudaFree(dsmalln);

	t2 = clock();
	printf("The whole non-destructive cublas approach took: %.3f ms\n", (double)(t2 - t1) / CLOCKS_PER_SEC * 1000);

	// copy calculation result back to host
	cudaMemcpy(SSE1, dSSE1, (int)k * (int)p * sizeof(double), cudaMemcpyDeviceToHost);
	cudaMemcpy(SSE2, dSSE2, (int)k * (int)p * sizeof(double), cudaMemcpyDeviceToHost);
	
	/* Free allocated memory */
	cudaFree(dA);
	cudaFree(dB);
	cudaFree(dy);
	cudaFree(dSSE1);
	cudaFree(dSSE2);
	cudaFree(dUseCol);
	cudaFree(dUseRow);
}

And concerning the speed of my manual matrix multiplication kernel, AtomicAdd is for sure the bottleneck. If I comment that single line out the kernel time goes from 40 ms to 4 ms. But you are right that a normal add is not much faster than the atomic. Why is that so slow? Its just an addition. The SSE matrix I’m adding elements to is small (only k-by-p), is there any trick I can do to make accessing it faster? Put it in some sort of constant/shared memory etc?

I took your code that you have now posted, and converted it so I don’t need matlab. Then I set UseRow and UseCol to 20% fill each. Then I ran it on a Tesla P100, and also profiled it. The timing report is that the cublas routine took zero time. This is because the measurement is less than the granularity of clock in my setup.

However the profiler reports the cublas dgemm kernel is over 10x faster than your kernel.

$ cat t273.cu
#include <stdio.h>
#include <time.h>
#include <math.h>
#include <cublas_v2.h>

typedef unsigned long long uint64_T;

/* Helper kernel 1 for cublas-based matmult reduction */
__global__ void zero_kernel(double * __restrict__ A, const int m, const int n, const bool * __restrict__ UseCol) {
        int idx = threadIdx.x + blockDim.x*blockIdx.x;
        if (idx < m){
                for (int i = 0; i < n; i++){
                        if (!(UseCol[i])){
                                A[i*m + idx] = 0;
                        }
                }
        }
}
/* Helper kernel 2 for cublas-based matmult reduction */
__global__ void red_kernel(const double * __restrict__ AB, const double * __restrict__ y, double * __restrict__ SSE, const int m, const int p, const bool * __restrict__ UseRow) {

        int idx = threadIdx.x;
        __shared__ double s[1024];
        s[threadIdx.x] = 0;
        while (idx < m) { // block-stride loop
                if (UseRow[idx]) {
                        s[threadIdx.x] += (AB[idx + m*blockIdx.x] - y[idx])*(AB[idx + m*blockIdx.x] - y[idx]);
                }
                idx += blockDim.x;
        }
        __syncthreads();
        for (int i = 512; i > 0; i >>= 1) {
                if (threadIdx.x < i) {
                        s[threadIdx.x] += s[threadIdx.x + i];
                }
                __syncthreads();
        }
        if (!threadIdx.x) {
                SSE[blockIdx.x] = s[0];
        }
}
/* Manual matmult + reduction kernel without cublas */
__global__ void ManualMatMultRedKernel(const double *__restrict__ A, const double *__restrict__ B, const double * __restrict__ y, double *SSE, const size_t m, const size_t n, const size_t p, const size_t k, const uint64_T *UseColindex, const uint64_T *smalln, const uint64_T *UseRowIndex, const size_t smallm) {
        double e;
        /* Loop over pages in B (k) */
        for (int v = blockIdx.z * blockDim.z + threadIdx.z; v < k; v += blockDim.z * gridDim.z) {
                /* Loop rows of A (smallm) */
                for (int r = blockIdx.x * blockDim.x + threadIdx.x; r < smallm; r += blockDim.x * gridDim.x) {
                        /* Loop columns of B (p) */
                        for (int c = blockIdx.y * blockDim.y + threadIdx.y; c < p; c += blockDim.y * gridDim.y) {
                                e = 0;
                                /* Inner loop over rows of B (smalln[v]) */
                                for (int steps = 0; steps < smalln[v]; steps++) {
                                        e += A[UseRowIndex[r] + m*UseColindex[v + k*steps]] * B[UseColindex[v + k*steps] + n*c + n*p*v];
                                }
                                e -= y[UseRowIndex[r]];       // Error
                                e *= e;                                           // Squared error
                                atomicAdd(SSE + v + k*c, e);  // Sum of squared errors
                        }
                }
        }
}

void mmr(cublasHandle_t h, double * __restrict__  A, const double * __restrict__  y, const double * __restrict__  B, double * __restrict__  SSE, double * __restrict__ AB, const int m, const int n, const int p, const bool * __restrict__  UseCol, const bool * __restrict__ UseRow) {

        zero_kernel<<<(m + 255) / 256, 256 >>>(A, m, n, UseCol);

        double alpha = 1.0;
        double beta = 0.0; // could also prefill AB with y, set beta to -1.0  so as to compute AB - y

        cublasDgemm(h, CUBLAS_OP_N, CUBLAS_OP_N, m, p, n, &alpha, A, m, B, n, &beta, AB, m);

        red_kernel<<<p, 1024 >>>(AB, y, SSE, m, p, UseRow);
}
/* Kernel for filling A temp */
__global__ void FillPartialA(const double *__restrict__ A, const size_t m, const size_t n, const uint64_T *UseRowIndex, const size_t smallm, double *SmallA) {
        /* Loop rows of A (smallm) */
        for (int r = blockIdx.x * blockDim.x + threadIdx.x; r < smallm; r += blockDim.x * gridDim.x) {
                /* Inner loop over columns of A (n) */
                for (int steps = blockIdx.y * blockDim.y + threadIdx.y; steps < n; steps += blockDim.y * gridDim.y) {
                        SmallA[r + smallm*steps] = A[UseRowIndex[r] + m*steps];
                }
        }
}

/* ------------------------------ Start of Matlab gateway function ----------------------------- */
int  main() {

        /* ————————————————————————————————————————————————————————————————————————————————————————�*/
        /* Get inputs from Matlab.                                                                                                                                 */
        const double *A, *B, *y;
        bool *UseCol, *UseRow;

        /* Get size variables for inputs. */
        size_t m, n, p, k;

        m = 10000;  // Number of rows of A and y.
        n = 256;                    // Number of columns of A/rows in B.
        p = 25;           // Number of columns in B.
        k = 50;
        A       = new double[m*n];                                 // [m-by-n] Matrix A.
        B       = new double[n*p*k];                               // [n-by-p-by-k] Matrix B.
        y       = new double[m];                                   // [m-by-1] vector y.
        UseCol  = new bool[k*n];          // [k-by-n] logical matrix with columns to use.
        UseRow  = new bool[m];          // [m-by-1] logical vector with rows to use.
        for (int i = 0; i < k*n; i++) UseCol[i] = (i%5)?false:true;
        for (int i = 0; i < m; i++) UseRow[i] = (i%5)?false:true;
        /* Initialize Matlab output matrices. */
        double *SSE1;
        SSE1 = new double[k*p];
        double *SSE2;
        SSE2 = new double[k*p];
        /* ————————————————————————————————————————————————————————————————————————————————————————*/
        /* Allocate memory on GPU and transfer variables.                                          */
        double *dA, *dB, *dy, *dSSE1, *dSSE2;
        bool *dUseCol, *dUseRow;

        /* Allocate memory on GPU.                                                                                                                                   */
        cudaMalloc(&dA     , sizeof(double)   * (int)m   * (int)n            );
        cudaMalloc(&dB     , sizeof(double)   * (int)n   * (int)p   * (int)k );
        cudaMalloc(&dy     , sizeof(double)   * (int)m                       );
        cudaMalloc(&dUseCol, sizeof(bool)     * (int)k   * (int)n            );
        cudaMalloc(&dUseRow, sizeof(bool)     * (int)m                       );

        cudaMalloc(&dSSE1  , sizeof(double)   * (int)k   * (int)p            );
        cudaMemset(dSSE1, 0, sizeof(double)   * (int)k   * (int)p            ); // make sure SSE1 is initialized to 0

        cudaMalloc(&dSSE2  , sizeof(double)   * (int)k   * (int)p            );
        cudaMemset(dSSE2, 0, sizeof(double)   * (int)k   * (int)p            ); // make sure SSE2 is initialized to 0

        /* Transfer variables from host to device.                                                                                                   */
        cudaMemcpy(dA     , A     , sizeof(double) * (int)m * (int)n         , cudaMemcpyHostToDevice);
        cudaMemcpy(dB     , B     , sizeof(double) * (int)n * (int)p * (int)k, cudaMemcpyHostToDevice);
        cudaMemcpy(dy     , y     , sizeof(double) * (int)m                  , cudaMemcpyHostToDevice);
        cudaMemcpy(dUseCol, UseCol, sizeof(bool)   * (int)k   * (int)n       , cudaMemcpyHostToDevice);
        cudaMemcpy(dUseRow, UseRow, sizeof(bool)   * (int)m                  , cudaMemcpyHostToDevice);

clock_t t1, t2, t3, t4;
        /*      ———————————————————————————————————————————————————————————————————————————————————————— */
        /* Time non-destructive manual matrix multiplication + reduction kernel                      */
        /*      ———————————————————————————————————————————————————————————————————————————————————————— */
        t1 = clock();

        // Count active variables in UseCol and store their indices.
        uint64_T *UseColindex, *smalln;
        UseColindex = (uint64_T*)malloc(sizeof(uint64_T) * (int)k * (int)n); // [k-by-n] (overallocated)
        smalln = (uint64_T*)calloc(k, sizeof(uint64_T)); // [k-by-1]
        for (int c = 0; c < n; c++) {
                for (int r = 0; r < k; r++) {
                        if (UseCol[r + k*c] == true) {
                                UseColindex[r + k*smalln[r]] = c;
                                smalln[r] += 1;
                        }
                }
        }
        // Copy to device.
        uint64_T *dsmalln, *dUseColindex;
        cudaMalloc(&dsmalln, sizeof(uint64_T)   * (int)k);
        cudaMemcpy(dsmalln, smalln, sizeof(uint64_T) * (int)k, cudaMemcpyHostToDevice);
        cudaMalloc(&dUseColindex, sizeof(uint64_T)   * (int)k * (int)n);
        cudaMemcpy(dUseColindex, UseColindex, sizeof(uint64_T) * (int)k * (int)n, cudaMemcpyHostToDevice);

        // Count active observations in UseRow and store their indices.
        uint64_T *UseRowIndex;
        size_t smallm;
        UseRowIndex = (uint64_T*)malloc(sizeof(uint64_T) * (int)m); // [m-by-1] (overallocated)
        smallm = 0;
        for (int r = 0; r < m; r++) {
                if (UseRow[r] == true) {
                        UseRowIndex[smallm] = r;
                        smallm++;
                }
        }
        // Copy to device.
        uint64_T *dUseRowIndex;
        cudaMalloc(&dUseRowIndex, sizeof(uint64_T)   * (int)m);
        cudaMemcpy(dUseRowIndex, UseRowIndex, sizeof(uint64_T) * (int)m, cudaMemcpyHostToDevice);

        /* Specify kernel launch configuration.                                                                                                      */
        int xnTPB = 8; int ynTPB = 16; int znTPB = 8;
        int xnumBlocks = (smallm + xnTPB - 1) / xnTPB;
        int ynumBlocks = (n + ynTPB - 1) / ynTPB;
        int znumBlocks = (k + znTPB - 1) / znTPB;
        dim3 Nblocks(xnumBlocks, ynumBlocks, znumBlocks);
        dim3 nTPB(xnTPB, ynTPB, znTPB);

        // Call manual matrix multiplication + reduction kernel.
        ManualMatMultRedKernel<<<Nblocks, nTPB >>>(dA, dB, dy, dSSE2, m, n, p, k, dUseColindex, dsmalln, dUseRowIndex, smallm);
        cudaDeviceSynchronize();

        t2 = clock();
        printf("Manual matmult kernel took: %.3f ms\n", (double)(t2 - t1) / CLOCKS_PER_SEC * 1000);

        /*———————————————————————————————————————————————————————————————————————————————————————— */
        /* Time non-destructive big cublasDgemm call                                                                   */
        /*———————————————————————————————————————————————————————————————————————————————————————— */

        t1 = clock();

        cublasHandle_t hdl;
        cublasCreate(&hdl);
        double alpha = 1.0;
        double beta = 0.0; // could also prefill AB with y, set beta to -1.0  so as to compute AB - y

        // Smaller A with only rows which are true in UseRow included and AB.
        double *dSmallA;
        cudaMalloc(&dSmallA, sizeof(double)   * (int)smallm   * (int)n);
        double *dAB;
        cudaMalloc(&dAB, sizeof(double)       * (int)smallm   * (int)p  * (int)k);

        /* Remove rows from A that are not set to true in UseRow.                                                                                       */
        int xnTPB2 = 8; int ynTPB2 = 16;        int znTPB2 = 1;
        int xnumBlocks2 = (smallm + xnTPB2 - 1) / xnTPB2;
        int ynumBlocks2 = (n + ynTPB2 - 1) / ynTPB2;
        int znumBlocks2 = 1;
        dim3 Nblocks2(xnumBlocks2, ynumBlocks2, znumBlocks2);
        dim3 nTPB2(xnTPB2, ynTPB2, znTPB2);

        FillPartialA<<<Nblocks2, nTPB2>>>(dA, m, n, dUseRowIndex, smallm, dSmallA);
        cudaDeviceSynchronize();

        t3 = clock();
        size_t pk = p*k;
         //Calculate C = A*B  with cublas
        cublasDgemm(hdl,
          CUBLAS_OP_N,  // No transpose
          CUBLAS_OP_N,  // No transpose
                   smallm,  // rows of dSmallA
                       pk,      // columns of dB
                        n,  // columns of dSmallA
                   &alpha,  // scalar alpha
                  dSmallA,  // dSmallA matrix
                   smallm,  // LDA
                       dB,  // dB matrix
                        n,  // LDB
                    &beta,  // scalar beta
                      dAB,  // output matrix [smallm-by-p-by-k]
                   smallm); // LDC
        cudaDeviceSynchronize();
        t4 = clock();
        /* If the reported time here is greater than the whole manual matmult+reduction time,
        there is no need to convert the dAB output to a squared error and reduce it, because the method is already slower. */
        printf("dgemm alone took: %.3f ms\n", (double)(t4 - t3) / CLOCKS_PER_SEC * 1000);

        cudaFree(dAB);
        free(UseColindex);
        free(UseRowIndex);
        free(smalln);
        cudaFree(dUseColindex);
        cudaFree(dUseRowIndex);
        cudaFree(dsmalln);

        t2 = clock();
        printf("The whole non-destructive cublas approach took: %.3f ms\n", (double)(t2 - t1) / CLOCKS_PER_SEC * 1000);

        // copy calculation result back to host
        cudaMemcpy(SSE1, dSSE1, (int)k * (int)p * sizeof(double), cudaMemcpyDeviceToHost);
        cudaMemcpy(SSE2, dSSE2, (int)k * (int)p * sizeof(double), cudaMemcpyDeviceToHost);

        /* Free allocated memory */
        cudaFree(dA);
        cudaFree(dB);
        cudaFree(dy);
        cudaFree(dSSE1);
        cudaFree(dSSE2);
        cudaFree(dUseCol);
        cudaFree(dUseRow);
}
$ nvcc -arch=sm_60 -o t273 t273.cu -lcublas
$ ./t273
Manual matmult kernel took: 10.000 ms
dgemm alone took: 0.000 ms
The whole non-destructive cublas approach took: 170.000 ms
$ nvprof ./t273
==28478== NVPROF is profiling process 28478, command: ./t273
Manual matmult kernel took: 0.000 ms
dgemm alone took: 0.000 ms
The whole non-destructive cublas approach took: 390.000 ms
==28478== Profiling application: ./t273
==28478== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   65.82%  11.890ms         9  1.3211ms  1.1840us  10.926ms  [CUDA memcpy HtoD]
                   31.76%  5.7382ms         1  5.7382ms  5.7382ms  5.7382ms  ManualMatMultRedKernel(double const *, double const *, double const *, double*, unsigned long, unsigned long, unsigned long, unsigned long, unsigned __int64 const *, unsigned __int64 const *, unsigned __int64 const *, unsigned long)
                    2.10%  379.71us         1  379.71us  379.71us  379.71us  maxwell_dgemm_64x64_nn
                    0.23%  41.760us         1  41.760us  41.760us  41.760us  FillPartialA(double const *, unsigned long, unsigned long, unsigned __int64 const *, unsigned long, double*)
                    0.07%  11.936us         2  5.9680us  1.9840us  9.9520us  [CUDA memcpy DtoH]
                    0.02%  3.2960us         2  1.6480us  1.5360us  1.7600us  [CUDA memset]
      API calls:   48.66%  384.48ms        12  32.040ms  7.1170us  381.88ms  cudaFree
                   47.28%  373.62ms        15  24.908ms  12.526us  370.93ms  cudaMalloc
                    1.72%  13.554ms        11  1.2322ms  22.372us  11.580ms  cudaMemcpy
                    1.04%  8.2494ms       740  11.147us     282ns  466.15us  cuDeviceGetAttribute
                    0.78%  6.1669ms         3  2.0556ms  43.601us  5.7461ms  cudaDeviceSynchronize
                    0.39%  3.0691ms         8  383.64us  224.67us  667.92us  cuDeviceTotalMem
                    0.09%  748.27us         8  93.533us  81.999us  113.36us  cuDeviceGetName
                    0.02%  156.80us         3  52.265us  24.749us  78.973us  cudaLaunch
                    0.01%  53.151us         2  26.575us  12.046us  41.105us  cudaMemset
                    0.00%  30.713us        42     731ns     148ns  8.9550us  cudaSetupArgument
                    0.00%  22.335us        16  1.3950us     710ns  8.6470us  cudaEventCreateWithFlags
                    0.00%  10.194us         1  10.194us  10.194us  10.194us  cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags
                    0.00%  9.2680us        12     772ns     346ns  1.6660us  cuDeviceGet
                    0.00%  8.0380us        11     730ns     391ns  3.3040us  cudaDeviceGetAttribute
                    0.00%  7.2590us         3  2.4190us  2.0760us  2.7600us  cudaConfigureCall
                    0.00%  4.9240us         4  1.2310us     480ns  2.7310us  cuDeviceGetCount
                    0.00%  2.5820us         1  2.5820us  2.5820us  2.5820us  cudaGetDevice
                    0.00%  1.6450us         1  1.6450us  1.6450us  1.6450us  cudaGetLastError
                    0.00%  1.5150us         1  1.5150us  1.5150us  1.5150us  cuInit
                    0.00%  1.4060us         1  1.4060us  1.4060us  1.4060us  cuDriverGetVersion
$

Your 1080 won’t perform as well on dgemm - I mentioned this already. I don’t know how much worse it will be. Note that the cudaMalloc, cudaMemcpy, and cudaFree calls are the largest items here. Not any of the computation.

Regarding commenting things out - going from 40ms to 4ms. This is typical mistake. When you comment out a write to global state, you also allow the compiler to remove any code that would have created that data. It is an optimizing compiler. So that methodology of commenting things out is often not useful. But the comparison between an ordinary write and atomicAdd is more valid, and as you see, there is little difference.

This is very confusing. I just ran the code with the same dimensions as I see in your code: m = 10000, n = 256, p = 25, k = 50, and my results are the opposite of yours:

Manual matmult kernel took: 2.000 ms
dgemm alone took: 4.000 ms

If you have a better GPU than me shouldn’t it perform better on the manual matmult kernel as well? My GPU was 5x faster than yours on that one (you had 10 ms?). But significantly slower on dgemm. Even though the absolute ms values may differ, shouldn’t our GPU’s agree on which kernel is more efficient? Did you change anything else in the code that I may have missed or did you just run it as it was?

Your dgemm will be slower than my dgemm, because my GPU has a lot more FP64 throughput than yours.

With respect to the “manual” kernel, the execution time of the kernel on my setup is 5.7ms, not 10ms (clock() is a very granular measurement in my case). On your setup, I don’t know what the granularity of clock is (I’m guessing +/- 2ms), and its possible that your kernel execution is actually taking longer than 2ms, if you time it with a profiler.

when I run my code with m=100000 to match your original dimensions (that was the only difference) I get the following profiler measured times for your kernel and dgemm:

27.20%  41.013ms         1  41.013ms  41.013ms  41.013ms  ManualMatMultRedKernel(double const *, double const *, double const *, double*, unsigned long, unsigned long, unsigned long, unsigned long, unsigned __int64 const *, unsigned __int64 const *, unsigned __int64 const *, unsigned long)
2.27%  3.4243ms         1  3.4243ms  3.4243ms  3.4243ms  maxwell_dgemm_128x64_nn

Again, dgemm kernel is about 10x faster than your kernel.

I profiled the code on my computer now with Nsight and it reports more or less the same thing as before. But it revealed two things: 1) the dgemm kernel my computer calls is not the same as yours, mine has “raggedMn” added to it. 2) The occupancy of my dgemm call is only 12.5 %! Which I guess explains why its slow. What could cause it to be so low?

ManualMatMultRedKernel - Duration 2,049.152 μs - Occupancy 100 %
Maxwell_dgemm_128x64_raggedMn_nn - Duration 4,211.520 μs - Occupancy 12.5 %

I’m using CUDA 9.1. If you are using a different CUDA version, it’s possible that your dgemm library call could be different. Also it’s possible that the library is making a different choice of dgemm calls because I was running on cc6.0 device and you are running on cc6.1 device.