Help speeding up multi-width boxcar convolution

Hi,

I’m trying to implement a boxcar convolution down 1 dimension of an image, with 32 boxcars of width 1…32 samples.

The fastest implementation I have so far has each thread in a 32-thread warp responsible for a boxcar. Each thread keeps the the sum over the last ibc (boxcar index) samples, and the last sample. For each pixel in a column, the threads in the warp update the state and shift the last sample across to the next thread with a __shfl_up.

I’m surprised that I only get 0.68 Gflops and 1.40 GByte/sec on my GPU GeForce GT 750M. Am I doing something wrong?

#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <assert.h>

const int N = 1024;
const int NBOX = 32;

#ifdef __cplusplus
extern "C"
#endif

__host__  inline void gpuAssert(cudaError_t code, const char *file, int line)
{
	if (code != cudaSuccess)
	{
		fprintf(stderr,"GPUassert: %s %s:%d\n", cudaGetErrorString(code), file, line);
		assert(code == cudaSuccess);
		exit(code);
	}
}

#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }

cudaEvent_t m_start;
cudaEvent_t m_stop;

void tic() {
	gpuErrchk(cudaEventRecord(m_start));
	gpuErrchk(cudaEventSynchronize(m_start));
}

float toc() {
	gpuErrchk(cudaEventRecord(m_stop));
	gpuErrchk(cudaEventSynchronize(m_stop));
	float ms;
	gpuErrchk(cudaEventElapsedTime(&ms, m_start, m_stop));
	return ms;
}

__global__ void boxcar_do_kernel (
		const  float* __restrict__ indata,
		float* __restrict__ outdata)
{

	int ibc = threadIdx.x; // boxcar index. ibc=0 is 1 sample long, ibc=1 is 2 samples long, etc

	float state = 0; // Sum of ibc samples - Ideally we do a prefix sum to initialise this, but that's not important now.
	float vlast = 0; // value of sample ibc samples ago
	int offset = N*(threadIdx.y + blockDim.y*blockIdx.x); // Start on the beginning of a column
	const float* __restrict__ iptr = &indata[offset];
	float* __restrict__ optr = &outdata[offset + ibc];

	// for each pixel in the column
	for(int t = 0; t < N; ++t) {
		// All threads in warp access the same address location - LDU instruction?
		float vin = *iptr;
		iptr++; // increment for next time

		// Add the current sample and subtract the 'previous' one - the only FP in the joint
		state += vin - vlast;

		// shift previous values one thread to the right. leaves vlast for ibc=0 unchanged.
		vlast = __shfl_up(vlast, 1, NBOX);

		// set vlast to vin for the 1-sample boxcar (i.e. ibc=0)
		if (ibc == 0) {
			vlast = vin;
		}

		// write state into output
		if (outdata != NULL) {
			*optr = state/(sqrtf((float) (ibc + 1))); // Scale by variance
			// increment output pointer
			optr += NBOX;
		}

	}

}

int main(int argc, char* argv[])
{
	// Allocate input and output arrays
	float* vin;
	float* vout;
	gpuErrchk(cudaMalloc(&vin, N*N));
	gpuErrchk(cudaMalloc(&vout, N*N*NBOX));
	gpuErrchk(cudaEventCreate(&m_start));
	gpuErrchk(cudaEventCreate(&m_stop));

	// Print GPU name
	cudaDeviceProp prop;
	gpuErrchk(cudaGetDeviceProperties(&prop, 0));
	printf("Using GPU %s\n", prop.name);

	int niter = 16;

	// Setup grid and block size. Each block processes NBOX boxcars, and nblocks columns
	int nblocks = 8;
	int grid_size = N/nblocks;
	dim3 block_size(NBOX, nblocks);

	// Calculate statistics
	float nops = N*N*NBOX*2*niter; // 2 additions per boxcar per pixel
	float nbytes = sizeof(float)*N*N*(NBOX+1)*niter; // 1 read and NBOX writes per pixel
	tic();
	for(int ii = 0; ii < niter; ++ii) {
		boxcar_do_kernel<<<grid_size, block_size>>>(vin, vout);
	}
	float time_sec = toc() / 1e3;

	printf("Ran %d iterations WITH output in %f seconds = %0.2f Gflops and %0.2f GByte/sec\n", niter, time_sec, nops/time_sec/1e9, nbytes/time_sec/1e9);

//	Let's see what happens if we don't write to gmem
	nbytes = sizeof(float)*N*N*niter; // 1 read per pixel
	tic();
	for(int ii = 0; ii < niter; ++ii) {
		boxcar_do_kernel<<<grid_size, block_size>>>(vin, NULL);
	}
	time_sec = toc() / 1e3;

	printf("Ran %d iterations WITHOUT output in %f seconds = %0.2f Gflops and %0.2f GByte/sec\n", niter, time_sec, nops/time_sec/1e9, nbytes/time_sec/1e9);

}

Your code outputs two sets of numbers. Are your reported values for the first or second output?

Is this on windows or linux? If on windows, are you building a debug project or a release project?
If on linux, what is the compile command line?

Hi txbob! Thanks for replying.

The quoted numbers were for the first output.

Here’s the entire output:

Using GPU GeForce GT 750M
Ran 16 iterations WITH output in 1.309129 seconds = 0.82 Gflops and 1.69 GByte/sec
Ran 16 iterations WITHOUT output in 0.230930 seconds = 4.65 Gflops and 0.29 GByte/sec

On a macbook proc with OSX 10.9.5. Compile command line was:

make all 
Building file: ../tboxcar.cu
Invoking: NVCC Compiler
/Developer/NVIDIA/CUDA-7.5/bin/nvcc -G -g -O0 -gencode arch=compute_30,code=sm_30  -odir "." -M -o "tboxcar.d" "../tboxcar.cu"
/Developer/NVIDIA/CUDA-7.5/bin/nvcc -G -g -O0 --compile --relocatable-device-code=false -gencode arch=compute_30,code=compute_30 -gencode arch=compute_30,code=sm_30  -x cu -o  "tboxcar.o" "../tboxcar.cu"
Finished building: ../tboxcar.cu
 
Building target: tboxcar
Invoking: NVCC Linker
/Developer/NVIDIA/CUDA-7.5/bin/nvcc --cudart static --relocatable-device-code=false -gencode arch=compute_30,code=compute_30 -gencode arch=compute_30,code=sm_30 -link -o  "tboxcar"  ./tboxcar.o   
Finished building target: tboxcar

Compiling with ‘-O3’ makes negligible difference:
Using GPU GeForce GT 750M
Ran 16 iterations WITH output in 1.277818 seconds = 0.84 Gflops and 1.73 GByte/sec
Ran 16 iterations WITHOUT output in 0.227726 seconds = 4.72 Gflops and 0.29 GByte/sec

that -G will slow your device code down a lot. You should never use that switch when evaluating CUDA code for performance. Remove that switch and compile with -O3 (make those changes on both command lines where they appear)

Well, that’s better, thanks txbob! Here’s the result with -O3 and no -G

Using GPU GeForce GT 750M
Ran 16 iterations WITH output in 0.291004 seconds = 3.69 Gflops and 7.61 GByte/sec
Ran 16 iterations WITHOUT output in 0.056981 seconds = 18.84 Gflops and 1.18 GByte/sec

Still, I feel like those numbers are still a little low (According to https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units the GT 750M can do 750 Gflops and bandwidth at least 32 GByte/sec. I’m quite a way below that.

For reference: Here’s the compile (I’m using nsight)

make all 
Building file: ../tboxcar.cu
Invoking: NVCC Compiler
/Developer/NVIDIA/CUDA-7.5/bin/nvcc -O3 -gencode arch=compute_30,code=sm_30  -odir "." -M -o "tboxcar.d" "../tboxcar.cu"
/Developer/NVIDIA/CUDA-7.5/bin/nvcc -O3 --compile --relocatable-device-code=false -gencode arch=compute_30,code=compute_30 -gencode arch=compute_30,code=sm_30  -x cu -o  "tboxcar.o" "../tboxcar.cu"
Finished building: ../tboxcar.cu
 
Building target: tboxcar
Invoking: NVCC Linker
/Developer/NVIDIA/CUDA-7.5/bin/nvcc --cudart static --relocatable-device-code=false -gencode arch=compute_30,code=compute_30 -gencode arch=compute_30,code=sm_30 -link -o  "tboxcar"  ./tboxcar.o   
warning: no debug symbols in executable (-arch x86_64)
Finished building target: tboxcar

Your code is not anything that could come anywhere near the flops peak.

A better estimate of achievable memory bandwidth is that given by the bandwidthTest CUDA sample code, for the Device-to-Device reported number. What is the output for that on your machine?

Ah rats, I thought 32 flops per 4 bytes (for the non-writing case) was enough to be interesting. Clearly not.

But I’m still nowhere near the device to device bandwidth:

Here’s the result of bandwidthTest:

bandwidthTest$ ../../bin/x86_64/darwin/release/bandwidthTest 
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: GeForce GT 750M
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			6293.2

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			6375.0

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			39598.4

Result = PASS

The device-to-device bandwidth is reported as 39.5984 GB/sec by bandwidthTest, based on the output you showed above. That seems about right. What were you expecting?

Hi njuffa,

Perhaps it’s unreasonable, but I was hoping my code (tboxcar, see top) would run at around something like 20 GB/sec, rather than 1.7 GB/sec.

It seems I misunderstood your comment, and your question is: “Why is the effective bandwidth of my code so much lower than the upper limit established by bandwidthTest?”

I don’t have the faintest idea what a “boxcar convolution” is, and how one would go about implementing it efficiently. If you haven’t done so yet, I would suggest (a) reading the Best Practices Guide (b) becoming familiar with the CUDA profiler. That will likely point you in the right direction.

Each thread is doing 1024 reads from the same subset of 2048 floats. These will be cached in L2 at least, but you might try loading them once into shared memory and reading them from there instead.
Not sure if this is your bottleneck, but it did leap out when I did a quick code scan.