NPP libray fucntions call speed issue

I tried to use function nppiMean_stdDev_32f_c1R() and nppsStdDev_32f() and found they are much slower than we expected.

In visual studio 2010, I used Nsight->start performance analysis and found each nppsStdDev_32f() call will lead to multiply CUDA Runtime API calls such as CudaGetDeviceProperties(), cudaGetDevice() and

CudaGetDeviceCount() and so on. These run time API calls take most of process time.

I tried the Nvidia sample project histEqualizationNPP and found the similar issue.

I have two GPUs in the PC. One is GTX Titan black and the other is NVS310. My Cuda version is 6.5.

Is anyone can help to figure out a way to speed up nppsStdDev_32f() function call.

Thanks,

The whole test code is here
int TestnppStdDev_32f()
{
int devID = 0;
cudaSetDevice(devID);
cudaDeviceProp deviceProp;
cudaGetDeviceProperties(&deviceProp, devID);

// cpu buffer
int ncols=2048;
int nrows=2048;
int npix = ncols*nrows;
float* buf = new float[npix];
for(int i=0; i<nrows; i++)
	for(int j=0; j<ncols; j++)
		buf[i*ncols+j] = (float)i;

// work buffer
int iWorkBufSize;
NppStatus status = nppsStdDevGetBufferSize_32f(npix, &iWorkBufSize);
if(status != 0)
{
	delete [] buf;
	return -1;
}

// gpu buffer
float *_buf;
cudaError_t error = cudaMalloc((void **) &_buf, npix*sizeof(float)+iWorkBufSize); 
if (error != cudaSuccess) 
{
	delete [] buf;
	return -2;
}
unsigned char* _workBuf = (unsigned char*)(_buf+npix);

float* _stdDev;
cudaMalloc((void **) &_stdDev, sizeof(float)); 
if (error != cudaSuccess)
{	
	cudaFree(_buf);	
	delete [] buf;
	return -3;
}

error = cudaMemcpy(_buf, buf, npix*sizeof(float), cudaMemcpyHostToDevice);
if (error != cudaSuccess)
{	
	cudaFree(_buf);	
	cudaFree(_stdDev);
	delete [] buf;
	return -4;
}

int loops = 1000;
float stdDev;
DWORD start = GetTickCount();
for(int i=0; i<loops; i++)
	status = nppsStdDev_32f (_buf, npix, _stdDev, _workBuf);
DWORD end = GetTickCount();

cudaMemcpy(&stdDev, _stdDev, sizeof(float), cudaMemcpyDeviceToHost);

cudaFree(_buf);
cudaFree(_stdDev);
delete [] buf;

if(status!=0)
	return -5;

return end-start;

}

Where nppsStdDev_32f() called 1000 times, CudaGetDeviceProperties() called 6001 times.

To clarify,
Where nppsStdDev_32f() called 1000 times, CudaGetDeviceProperties() called 6001 times implicitly (not by my code directly).

One call to cudaGetDeviceProperties() per call to nppsStdDev_32f() seems reasonable and necessary to determine the most appropriate code path. Six such calls seems excessive, you may want to file an RFE via the bug reporting form linked from the registered developer website to minimize the number of such calls.

That said, calls to cudaGetDeviceProperties() should be extremely fast, as they should simply reference block of data filled in at CUDA context creation time. If you examine the output of the visual profiler, does it really show that the calls to cudaGetDeviceProperties() are comprising the critical path through the code? What kind of GPU and system platform are you using?

I observed something similar to what is being reported based on the posted code, and using nvprof on a linux system.

I think filing an RFE is a good suggestion.<<

Based on my profiler data, many CUDA API calls are less than a microsecond or at most a few microseconds. cudaGetDeviceProperties seems to take ~250us And it is being called quite a few times. It’s not clear to me why this couldn’t be called once per device at library initialization time, and dispense with the per-call usage altogether, but I’m not a library author.

I’ve attached an excerpt my nvprof data, for two loops through the code (instead of 1000). I would lump everything up to the first cudaLaunch call into “library initialization” and ignore all that. From that point forward, I think the timings reported are instructive. Again, this represents two loops through the test:

715.46ms  263.39us                    -               -         -         -         -         -           -                -         -         -  cudaGetDeviceProperties
715.73ms  55.951ms                    -               -         -         -         -         -           -                -         -         -  cudaMalloc
771.68ms  111.98us                    -               -         -         -         -         -           -                -         -         -  cudaMalloc
771.80ms  7.5160ms                    -               -         -         -         -         -           -                -         -         -  cudaMemcpy
772.32ms  7.1545ms                    -               -         -         -         -  16.777MB  2.3450GB/s  Quadro 5000 (0)         1         7  [CUDA memcpy HtoD]
779.34ms     592ns                    -               -         -         -         -         -           -                -         -         -  cudaGetDeviceCount
779.34ms  2.7590us                    -               -         -         -         -         -           -                -         -         -  cudaGetDevice
779.34ms  304.57us                    -               -         -         -         -         -           -                -         -         -  cudaGetDeviceProperties
779.65ms     236ns                    -               -         -         -         -         -           -                -         -         -  cudaGetDeviceCount
779.65ms     642ns                    -               -         -         -         -         -           -                -         -         -  cudaGetDevice
779.65ms  268.27us                    -               -         -         -         -         -           -                -         -         -  cudaGetDeviceProperties
779.98ms     514ns                    -               -         -         -         -         -           -                -         -         -  cuDriverGetVersion
779.98ms     734ns                    -               -         -         -         -         -           -                -         -         -  cuInit
780.01ms     429ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetCount
780.01ms     312ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGet
780.01ms  31.458us                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetName
780.04ms  36.034us                    -               -         -         -         -         -           -                -         -         -  cuDeviceTotalMem
780.08ms     462ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.08ms     231ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.08ms     257ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.08ms     241ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.08ms     228ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.08ms  24.743us                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.11ms     264ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.11ms     228ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.11ms     231ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.11ms     269ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.11ms     415ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.11ms     231ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.11ms     243ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.11ms     266ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.11ms     231ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.11ms     231ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.11ms     231ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.11ms     231ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.11ms     345ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.12ms     232ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.12ms     230ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.12ms     246ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.12ms     250ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.12ms     232ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.12ms     287ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.12ms     232ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.12ms     313ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.12ms     229ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.12ms     230ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.12ms     243ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.12ms     230ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.12ms     230ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.12ms     229ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.12ms     230ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.12ms     266ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.12ms     287ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.12ms     230ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.12ms     230ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.12ms     247ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.13ms     256ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.13ms     231ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.13ms     228ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.13ms     231ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.13ms     229ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.13ms     229ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.13ms     228ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.13ms     231ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.13ms     228ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.13ms     227ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.13ms     232ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.13ms     230ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.13ms     231ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.13ms     247ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.13ms     228ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.13ms     230ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.13ms  119.11us                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.25ms     266ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.25ms     234ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.25ms     231ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.25ms     233ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.25ms     243ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.25ms     285ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.26ms     348ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.26ms     233ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.26ms     231ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.26ms     230ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.26ms     229ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.26ms     269ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.26ms     233ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.26ms     233ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.26ms     232ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.26ms     354ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.26ms     240ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.26ms     235ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.26ms     230ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.26ms  118.48us                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.38ms     265ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.38ms     247ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.38ms     231ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.38ms     230ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.38ms     230ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.38ms     236ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.38ms     232ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.38ms     316ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGet
780.39ms  28.252us                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetName
780.41ms  37.196us                    -               -         -         -         -         -           -                -         -         -  cuDeviceTotalMem
780.45ms     326ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.45ms     248ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.45ms     231ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.45ms     231ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.45ms     264ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.45ms  26.340us                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.48ms     295ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.48ms     229ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.48ms     266ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.48ms     269ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.48ms     254ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.48ms     230ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.48ms     232ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.48ms     228ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.48ms     229ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.48ms     232ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.49ms     231ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.49ms     231ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.49ms     235ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.49ms     228ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.49ms     229ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.49ms     227ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.49ms     228ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.49ms     226ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.49ms     269ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.49ms     226ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.49ms     238ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.49ms     264ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.49ms     226ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.49ms     228ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.49ms     227ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.49ms     228ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.49ms     228ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.49ms     229ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.49ms     226ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.49ms     233ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.49ms     226ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.50ms     230ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.50ms     234ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.50ms     228ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.50ms     232ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.50ms     232ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.50ms     244ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.50ms     232ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.50ms     231ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.50ms     232ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.50ms     232ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.50ms     231ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.50ms     229ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.50ms     231ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.50ms     228ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.50ms     230ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.50ms     230ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.50ms     231ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.50ms     233ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.50ms  131.36us                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.64ms     279ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.64ms     233ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.64ms     266ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.64ms     229ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.64ms     241ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.64ms     235ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.64ms     232ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.64ms     234ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.64ms     229ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.64ms     229ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.64ms     229ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.64ms     269ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.64ms     253ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.64ms     229ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.64ms     233ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.64ms     245ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.64ms     237ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.64ms     240ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.64ms     229ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.64ms  127.11us                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.77ms     261ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.77ms     279ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.77ms     230ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.77ms     229ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.77ms     230ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.77ms     230ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.78ms     231ns                    -               -         -         -         -         -           -                -         -         -  cuDeviceGetAttribute
780.78ms  5.4230us                    -               -         -         -         -         -           -                -         -         -  cudaConfigureCall
780.79ms  2.9660us                    -               -         -         -         -         -           -                -         -         -  cudaSetupArgument
780.79ms  1.0260us                    -               -         -         -         -         -           -                -         -         -  cudaSetupArgument
780.79ms  56.559ms                    -               -         -         -         -         -           -                -         -         -  cudaLaunch (void SignalReductionKernel<float, SignalReductionFunctor<float, float, float, BasicOperation<float, tReductionOp=0>>>(int, float) [547])
837.35ms  182.86us             (66 1 1)       (256 1 1)        11        0B  1.0240KB         -           -  Quadro 5000 (0)         1         7  void SignalReductionKernel<float, SignalReductionFunctor<float, float, float, BasicOperation<float, tReductionOp=0>>>(int, float) [547]
837.35ms  1.0080us                    -               -         -         -         -         -           -                -         -         -  cudaConfigureCall
837.35ms     733ns                    -               -         -         -         -         -           -                -         -         -  cudaSetupArgument
837.36ms     268ns                    -               -         -         -         -         -           -                -         -         -  cudaSetupArgument
837.36ms  8.9480us                    -               -         -         -         -         -           -                -         -         -  cudaLaunch (void SignalReductionKernel<float, SignalReductionFunctor<float, float, float, MeanOperation<float, tReductionOp=0>>>(int, float) [551])
837.37ms     345ns                    -               -         -         -         -         -           -                -         -         -  cudaGetLastError
837.37ms     495ns                    -               -         -         -         -         -           -                -         -         -  cudaGetDeviceCount
837.37ms  1.5950us                    -               -         -         -         -         -           -                -         -         -  cudaGetDevice
837.37ms  297.61us                    -               -         -         -         -         -           -                -         -         -  cudaGetDeviceProperties
837.54ms  4.8500us              (1 1 1)       (128 1 1)        11        0B      512B         -           -  Quadro 5000 (0)         1         7  void SignalReductionKernel<float, SignalReductionFunctor<float, float, float, MeanOperation<float, tReductionOp=0>>>(int, float) [551]
837.67ms     236ns                    -               -         -         -         -         -           -                -         -         -  cudaGetDeviceCount
837.67ms     681ns                    -               -         -         -         -         -           -                -         -         -  cudaGetDevice
837.67ms  268.77us                    -               -         -         -         -         -           -                -         -         -  cudaGetDeviceProperties
837.94ms     482ns                    -               -         -         -         -         -           -                -         -         -  cudaConfigureCall
837.94ms     357ns                    -               -         -         -         -         -           -                -         -         -  cudaSetupArgument
837.94ms     281ns                    -               -         -         -         -         -           -                -         -         -  cudaSetupArgument
837.95ms  12.138us                    -               -         -         -         -         -           -                -         -         -  cudaLaunch (void SignalReductionKernel<float, SignalReductionFunctor<float, float, float, StdDevOperation1<float, float, tReductionOp=0>>>(int, float) [562])
837.96ms     375ns                    -               -         -         -         -         -           -                -         -         -  cudaConfigureCall
837.96ms  188.45us             (66 1 1)       (256 1 1)        12        0B  1.0240KB         -           -  Quadro 5000 (0)         1         7  void SignalReductionKernel<float, SignalReductionFunctor<float, float, float, StdDevOperation1<float, float, tReductionOp=0>>>(int, float) [562]
837.96ms     317ns                    -               -         -         -         -         -           -                -         -         -  cudaSetupArgument
837.96ms     243ns                    -               -         -         -         -         -           -                -         -         -  cudaSetupArgument
837.96ms  8.2640us                    -               -         -         -         -         -           -                -         -         -  cudaLaunch (void SignalReductionKernel<float, SignalReductionFunctor<float, float, float, StdDevOperation2<float, tReductionOp=0>>>(int, float) [566])
837.97ms     285ns                    -               -         -         -         -         -           -                -         -         -  cudaGetLastError
837.97ms     257ns                    -               -         -         -         -         -           -                -         -         -  cudaGetDeviceCount
837.97ms     753ns                    -               -         -         -         -         -           -                -         -         -  cudaGetDevice
837.97ms  267.06us                    -               -         -         -         -         -           -                -         -         -  cudaGetDeviceProperties
838.15ms  5.9460us              (1 1 1)       (128 1 1)        11        0B      512B         -           -  Quadro 5000 (0)         1         7  void SignalReductionKernel<float, SignalReductionFunctor<float, float, float, StdDevOperation2<float, tReductionOp=0>>>(int, float) [566]
838.24ms     234ns                    -               -         -         -         -         -           -                -         -         -  cudaGetDeviceCount
838.24ms     631ns                    -               -         -         -         -         -           -                -         -         -  cudaGetDevice
838.24ms  263.24us                    -               -         -         -         -         -           -                -         -         -  cudaGetDeviceProperties
838.50ms     406ns                    -               -         -         -         -         -           -                -         -         -  cudaConfigureCall
838.51ms     370ns                    -               -         -         -         -         -           -                -         -         -  cudaSetupArgument
838.51ms     294ns                    -               -         -         -         -         -           -                -         -         -  cudaSetupArgument
838.51ms  9.3740us                    -               -         -         -         -         -           -                -         -         -  cudaLaunch (void SignalReductionKernel<float, SignalReductionFunctor<float, float, float, BasicOperation<float, tReductionOp=0>>>(int, float) [577])
838.52ms     344ns                    -               -         -         -         -         -           -                -         -         -  cudaConfigureCall
838.52ms  181.46us             (66 1 1)       (256 1 1)        11        0B  1.0240KB         -           -  Quadro 5000 (0)         1         7  void SignalReductionKernel<float, SignalReductionFunctor<float, float, float, BasicOperation<float, tReductionOp=0>>>(int, float) [577]
838.52ms     319ns                    -               -         -         -         -         -           -                -         -         -  cudaSetupArgument
838.52ms     244ns                    -               -         -         -         -         -           -                -         -         -  cudaSetupArgument
838.52ms  6.5750us                    -               -         -         -         -         -           -                -         -         -  cudaLaunch (void SignalReductionKernel<float, SignalReductionFunctor<float, float, float, MeanOperation<float, tReductionOp=0>>>(int, float) [581])
838.53ms     282ns                    -               -         -         -         -         -           -                -         -         -  cudaGetLastError
838.53ms     253ns                    -               -         -         -         -         -           -                -         -         -  cudaGetDeviceCount
838.53ms     740ns                    -               -         -         -         -         -           -                -         -         -  cudaGetDevice
838.53ms  266.13us                    -               -         -         -         -         -           -                -         -         -  cudaGetDeviceProperties
838.70ms  4.2840us              (1 1 1)       (128 1 1)        11        0B      512B         -           -  Quadro 5000 (0)         1         7  void SignalReductionKernel<float, SignalReductionFunctor<float, float, float, MeanOperation<float, tReductionOp=0>>>(int, float) [581]
838.79ms     240ns                    -               -         -         -         -         -           -                -         -         -  cudaGetDeviceCount
838.79ms     616ns                    -               -         -         -         -         -           -                -         -         -  cudaGetDevice
838.80ms  267.94us                    -               -         -         -         -         -           -                -         -         -  cudaGetDeviceProperties
839.06ms     454ns                    -               -         -         -         -         -           -                -         -         -  cudaConfigureCall
839.06ms     394ns                    -               -         -         -         -         -           -                -         -         -  cudaSetupArgument
839.07ms     280ns                    -               -         -         -         -         -           -                -         -         -  cudaSetupArgument
839.07ms  9.8910us                    -               -         -         -         -         -           -                -         -         -  cudaLaunch (void SignalReductionKernel<float, SignalReductionFunctor<float, float, float, StdDevOperation1<float, float, tReductionOp=0>>>(int, float) [592])
839.08ms     351ns                    -               -         -         -         -         -           -                -         -         -  cudaConfigureCall
839.08ms  186.20us             (66 1 1)       (256 1 1)        12        0B  1.0240KB         -           -  Quadro 5000 (0)         1         7  void SignalReductionKernel<float, SignalReductionFunctor<float, float, float, StdDevOperation1<float, float, tReductionOp=0>>>(int, float) [592]
839.08ms     317ns                    -               -         -         -         -         -           -                -         -         -  cudaSetupArgument
839.08ms     244ns                    -               -         -         -         -         -           -                -         -         -  cudaSetupArgument
839.08ms  6.5550us                    -               -         -         -         -         -           -                -         -         -  cudaLaunch (void SignalReductionKernel<float, SignalReductionFunctor<float, float, float, StdDevOperation2<float, tReductionOp=0>>>(int, float) [596])
839.09ms     292ns                    -               -         -         -         -         -           -                -         -         -  cudaGetLastError
839.09ms  192.77us                    -               -         -         -         -         -           -                -         -         -  cudaMemcpy
839.27ms  5.4110us              (1 1 1)       (128 1 1)        11        0B      512B         -           -  Quadro 5000 (0)         1         7  void SignalReductionKernel<float, SignalReductionFunctor<float, float, float, StdDevOperation2<float, tReductionOp=0>>>(int, float) [596]
839.27ms  1.8560us                    -               -         -         -         -        4B  2.1552MB/s  Quadro 5000 (0)         1         7  [CUDA memcpy DtoH]
839.28ms  135.97us                    -               -         -         -         -         -           -                -         -         -  cudaFree
839.42ms  90.993us                    -               -         -         -         -         -           -                -         -         -  cudaFree

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.

Thanks for generating actual data. I am inclined to say that the relatively high execution time of cudaGetDeviceProperties() is worthy of an RFE as well. Maybe my recollection is wrong, but that API call used to be fast in the past. Conceptually all it has to do is retrieve some information previously determined by the CUDA driver and stored in the CUDART context. Maybe the call has been expanded over the years to include data that needs to be generated dynamically, causing execution time to go up?

I agree that libraries which have a context can and should cache relevant device properties in their own context during library initialization. Maybe NPP does not have a context? I have never used it, it maybe stateless.

I have filed an RFE. (two, actually) One (against npp) to suggest that the number of times that nppsStdDev_32f calls cudaGetDeviceProperties be reduced to at most once (currently appears to be 3 times). Even if npp is stateless, this should be possible I would think. The second (against cuda runtime) to suggest that the execution time of cudaGetDeviceProperties be reduced.

Thanks for the help of txbob and njuffa.

Npp library is stateless

each nppStdDev_32f() calls CudaGetDeviceproperties() 4 times,
each nppiMean_StdDev_32f_C1R() calls CudaGetDeviceproperties() 6 times,
each nppiFilterMedian_16u_C1R() calls CudaGetDeviceproperties() 3 times,

Therefore this probably is a issue across the NPP library.

Hi,

I have a similar issue which I will describe below.

When I call nppiMean_StdDev_8u_C1R on a ROI of 384x284 using CUDA v7.5 the execution time is an order of magnitude greater than when I call the same function with CUDA v4.2.

Looking at the nsight profiler output for “CUDA Launches”, it appears that the newer version (V7.5) launches 9 kernels and the old version (V4.2) launches 6, with the total time for execution of the kernels being around the same for V7.5 and V4.2.

So it does not look like the difference in execution time is down to the kernels. Examining the nsight profiler output for “CUDA Runtime API Calls”, v7.5 calls cudaLaunch 12 times (this includes a call to nppiSet_8u_C1R), however one of the calls takes 1,337,566 micro seconds. Examining the output for v4.2 none of the 9 calls to cudaLaunch exceed 3,988 microseconds.

I am very new to CUDA and may be looking at irrelavent profiler output, but to me it looks like in cudaLaunch is taking too long. Is it likely I am doing something incorrect?

My code listing is below.

int width = 384;
int height = 288;
NppiSize  roiSize = { width, height };

char* dCurrentImg8U;
size_t current_Pitch8U;
cudaMallocPitch((void**)&dCurrentImg8U, &current_Pitch8U, width * sizeof(char), height);

nppiSet_8u_C1R(1, (Npp8u*)dCurrentImg8U, current_Pitch8U, roiSize);

int meanStdDevBufferSize;
Npp8u* dMeanStdDevBuffer;
Npp64f  *dMean;
Npp64f  *dStdDev;

#if defined(NPP_V4_2)
NppStatus BufferStatus = nppiMeanStdDev8uC1RGetBufferHostSize(roiSize, &meanStdDevBufferSize);
#else
NppStatus BufferStatus = nppiMeanStdDevGetBufferHostSize_8u_C1R(roiSize, &meanStdDevBufferSize);
#endif

cudaMalloc((void **)&dMeanStdDevBuffer, meanStdDevBufferSize);
cudaMalloc((void **)(&dMean), sizeof(*dMean));
cudaMalloc((void **)(&dStdDev), sizeof(*dStdDev));

NppStatus Status = nppiMean_StdDev_8u_C1R((Npp8u*)dCurrentImg8U, current_Pitch8U, roiSize,
	dMeanStdDevBuffer, dMean, dStdDev);

Thank you,

James

CUDA libraries have initialization times. The first library call for a given function will often contain this overhead. That “initialization time” is likely what you are seeing in your comment that " cudaLaunch is taking too long"

To get comparative performance without this overhead, you should write benchmark code that does an untimed call as a “warm up” operation, then run the same call again and time it, to observe the actual timing for the function itself, without the overhead.

In that case, I think you’ll observe closer parity between CUDA 4.x and CUDA 7.x for the same function.

The one-time library overhead itself probably cannot be avoided. As these libraries continue to be developed and add features and capabilities, the one time initialization “cost” of using the library tends to increase.

Thanks txbob, I can’t believe I missed that, so v7.5 has a far far greater start initialization time than v4.2. I have profiled again calling the routine multiple times and I am pretty sure my problem is the same after all.

Strangely in my case it looks like it is the increase in duration of a calls to cudaGetDeviceProperties() which is the problem and not the number of calls. That is in v4.2 a single call to nppiMean_StdDev_8u_C1R() results in 13 calls to cudaGetDeviceProperties() with a total duration of only 251μs, however in v7.5 a single call to nppiMean_StdDev_8u_C1R() results in only 8 calls to cudaGetDeviceProperties() but the total duration of these is 10,190μs.

I think this explains why I get the following when timing 4 calls to nppiMean_StdDev_8u_C1R() including the warmup

Call #	Duration v4.2 (μs)	Duration v7.5 (μs)
1	1317.18	                11588.6
2	1017.82	                8627.51
3	905.346	                11030.1
4	1332.15	                9127.44

Is it possible that the nppi routines are sound, it is just the duration of calling cudaGetDeviceProperties() has increased dramatically between v4.2 and v7.5?

James.

I guess I’m not sure what you’re asking exactly or what you are profiling exactly.

If those durations are for a call into an npp library function, then I would say you’re not seeing an overhead effect. I can’t explain it really. If you want to provide a complete test code that I can copy and paste, and run without having to modify it, I will run it and see what I get.

The durations are for calls to nppiMean_StdDev_8u_C1R(), with the code above, when I repeat the call 4 times. It shows the order of magnitude differnce I get between Cuda API v4.2 and v7.5 on the same machine.

The post was to clarify that you were correct and that the difference appears to be due to increase the duration of the calls to cudaGetDeviceProperties() in going from v4.2 to v7.5.

The previous post imply that the problem with npp is due to the number of calls to cudaGetDeviceProperties() , however when I compare it to v4.2 the number of calls has decreased. It appears to be the duration of the calls which has dramatically increased, in the newer API version, however the libaries, npp at least, have not adjusted for this.

I ran a quick check, calling cudaGetDeviceProperties() 4 times. The maximum duration of the calls for v4.2 was 42μs and the maximum duration for v7.5 was 1820μs. Does anyone else get these results?

Is there a way to check the status of an RFE?

Thanks again.

Thanks for all the comments here. An effort was made to address the issues reported earlier in the thread concerning (for example) multiple calls to cudaGetDeviceProperties within NPP library functions. I expect there to be significant improvements in this regard in the next major CUDA release, affecting multiple NPP functions.