NPP function nppiCrossCorrFull_NormLevel_8u32f_C1R too slow???

NPP function nppiCrossCorrFull_NormLevel_8u32f_C1R is very slow on my computer ,image size 300*300 ,need more than 15s. what’s wrong ?

???

What GPU are you using? Are you using the latest CUDA version (either 6.5 or 7.0 release candidate)? Do other NPP functions run at the expected speed? How did you determined that the run time is 15 seconds? Can you post buildable and runnable code that reproduces the issue?

My GPU is Quadro K620,CUDA Version 6.5.

mytestCrossCorr(int imgsize,int roisize,Npp8u *src_image1,Npp8u *src_image2)
{
Npp8u *pSrc, *pTpl, *pMin, *pMax, *nMin, *nMax,*pDst0, *nDst0;
Npp32f *pDst, *nDst;
int nSrcStep, nTplStep, nDstStep;//, nBufferSize;
NppiSize oSrcRoiSize, oTplRoiSize;
Npp8u *pDeviceBuffer;
int nBufferSize;

int x1=imgsize;
int y1=imgsize;
int x2=roisize;
int y2=roisize;
nSrcStep = x1;
nTplStep = x1;
nDstStep = x2+x2-1;
oSrcRoiSize.height = x2;
oSrcRoiSize.width = y2;
oTplRoiSize.height = x2;
oTplRoiSize.width = y2;

cudaMalloc((void**)(&pSrc), sizeof(Npp8u)*x1*y1);
cudaMalloc((void**)(&pTpl), sizeof(Npp8u)*x1*y1);
cudaMalloc((void**)(&pDst), sizeof(Npp32f)*(x2+x2-1)*(y2+y2-1));
cudaMemset(pDst,0,(x2+x2-1)*(y2+y2-1));

nppiFullNormLevelGetBufferHostSize_8u32f_C1R(oSrcRoiSize, &nBufferSize);
cudaMalloc((void**)(&pDeviceBuffer), nBufferSize);



cudaMemcpy(pSrc, src_image1, sizeof(Npp8u)*x1*y1, cudaMemcpyHostToDevice);
cudaMemcpy(pTpl, src_image2, sizeof(Npp8u)*x1*y1, cudaMemcpyHostToDevice);

DWORD tt0=::GetTickCount();

NppStatus rr=nppiCrossCorrFull_NormLevel_8u32f_C1R(pSrc, nSrcStep, oSrcRoiSize, pTpl, nTplStep, oTplRoiSize, pDst, nDstStep*sizeof(Npp32f), pDeviceBuffer);


//cudaThreadSynchronize();
nDst = new Npp32f [(x2+x2-1)*(y2+y2-1)];
//cudaMemset(nDst,0,(x1+x1-1)*(y1+y1-1));
cudaError_t tttt=cudaMemcpy(nDst, pDst, sizeof(Npp32f)*(x2+x2-1)*(y2+y2-1), cudaMemcpyDeviceToHost);

cudaFree(pSrc);
cudaFree(pTpl);
cudaFree(pDst);
cudaFree(pDeviceBuffer);
DWORD tt1=::GetTickCount();

}

this is my code run time is (tt1-tt0);

thanks.

The Quadro K620 is a pretty low-end device, with memory bandwidth of 28.8 GB/sec and 812 single-precision GFLOPS.

Unfortunately the above code does not appear to be complete, so I cannot run it. The timed region of your code encompasses much more than just the NPP call. How does the timing change if you modify the code as follows to isolate the NPP API call:

cudaDeviceSynchronize();   // make sure all previous GPU work finished
DWORD tt0=::GetTickCount();
NppStatus rr=nppiCrossCorrFull_NormLevel_8u32f_C1R(pSrc, nSrcStep, oSrcRoiSize, pTpl, nTplStep, oTplRoiSize, pDst, nDstStep*sizeof(Npp32f), pDeviceBuffer);
cudaDeviceSynchronize();   // make sure NPP kernel completed
DWORD tt1=::GetTickCount();

What is the value of “rr” after the call to nppiCrossCorrFull_NormLevel_8u32f_C1R()? GetTickCount() is a low resolution timer that reports the time in milliseconds but that typically only has 10 millisecond resolution. What value (tt1-tt0) do you see? I would suggest using a high-resolution timer instead of GetTickCount(). Below is an alternative timer that reports the time in seconds, with slightly better than microsecond resolution.

#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
    LARGE_INTEGER t;
    static double oofreq;
    static int checkedForHighResTimer;
    static BOOL hasHighResTimer;

    if (!checkedForHighResTimer) {
        hasHighResTimer = QueryPerformanceFrequency (&t);
        oofreq = 1.0 / (double)t.QuadPart;
        checkedForHighResTimer = 1;
    }
    if (hasHighResTimer) {
        QueryPerformanceCounter (&t);
        return (double)t.QuadPart * oofreq;
    } else {
        return (double)GetTickCount() * 1.0e-3;
    }
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif

thanks for your reply, i have modified the codes like this.
cudaDeviceSynchronize();
DWORD tt0=::GetTickCount();
nppiCrossCorrFull_NormLevel_8u32f_C1R(pSrc, nSrcStep, oSrcRoiSize, pTpl, nTplStep, oTplRoiSize, pDst, nDstStep*sizeof(Npp32f), pDeviceBuffer);
cudaThreadSynchronize();
DWORD tt1=::GetTickCount();

cost time (tt1-tt0)= 15725 ms;
but intel ipp function only need 62 ms.

whst’s wrong? is my GPU too low?
my skype is daifugui, can I chat with you ? thanks.

15 seconds definitely seems extraordinarily long for a 300x300 image, even on a low-end GPU like yours.

Since I cannot run your code as you only posted partial code I cannot tell whether the problem is in your code or NPP. If you are quite sure that the problem is not in your code (e.g. incorrect data passed to the NPP function), consider filing a bug using the bug reporting form linked from the CUDA registered developer website. If you decide to file a bug, please remember to attach a self-contained, buildable, runnable code that reproduces the issue.

#include “cuda_runtime.h”
#include “device_launch_parameters.h”
#include <npp.h> // CUDA NPP Definitions
#include “nppi.h”
#include “npps.h”
#include “nppversion.h”
#include “nppcore.h”
#include “nppdefs.h”

#include<time.h>
#include <stdio.h>
#include <windows.h>

#include <math.h>

Npp8u datagenerate(int row_size, int col_size, int max_pix_value)
{
int i;
Npp8u src_image;
src_image = new Npp8u [row_size
col_size];
for (i=0; i<row_size
col_size; i++)
{
src_image[i] = rand() % max_pix_value;
}
return src_image;
}
cudaError_t mytestCrossCorr(int imgsize,int roisize,Npp8u *src_image1,Npp8u *src_image2)
{
Npp8u *pSrc, *pTpl, *pMin, *pMax, *nMin, *nMax,*pDst0, *nDst0;//, *pDeviceBuffer;
Npp32f *pDst, *nDst;
int nSrcStep, nTplStep, nDstStep;//, nBufferSize;
NppiSize oSrcRoiSize, oTplRoiSize;
Npp8u *pDeviceBuffer;
int nBufferSize;

cudaError_t tt=cudaSuccess;

//Npp8u *src_image1, *src_image2;

int x1=imgsize;
int y1=imgsize;
int x2=roisize;
int y2=roisize;
nSrcStep = x1;
nTplStep = x1;
nDstStep = x2+x2-1;
oSrcRoiSize.height = x2;
oSrcRoiSize.width = y2;
oTplRoiSize.height = x2;
oTplRoiSize.width = y2;

cudaMalloc((void**)(&pSrc), sizeof(Npp8u)*x1*y1);
cudaMalloc((void**)(&pTpl), sizeof(Npp8u)*x1*y1);
cudaMalloc((void**)(&pDst), sizeof(Npp32f)*(x2+x2-1)*(y2+y2-1));
cudaMemset(pDst,0,(x2+x2-1)*(y2+y2-1));

nppiFullNormLevelGetBufferHostSize_8u32f_C1R(oSrcRoiSize, &nBufferSize);
cudaMalloc((void**)(&pDeviceBuffer), nBufferSize);



cudaMemcpy(pSrc, src_image1, sizeof(Npp8u)*x1*y1, cudaMemcpyHostToDevice);
cudaMemcpy(pTpl, src_image2, sizeof(Npp8u)*x1*y1, cudaMemcpyHostToDevice);

cudaDeviceSynchronize();
DWORD tt0=::GetTickCount();
nppiCrossCorrFull_NormLevel_8u32f_C1R(pSrc, nSrcStep, oSrcRoiSize, pTpl, nTplStep, oTplRoiSize, pDst, nDstStep*sizeof(Npp32f), pDeviceBuffer);
cudaThreadSynchronize();
DWORD tt1=::GetTickCount();
nDst = new Npp32f [(x2+x2-1)*(y2+y2-1)];

cudaError_t tttt=cudaMemcpy(nDst, pDst, sizeof(Npp32f)*(x2+x2-1)*(y2+y2-1), cudaMemcpyDeviceToHost);

cudaFree(pSrc);
cudaFree(pTpl);
cudaFree(pDst);
cudaFree(pDeviceBuffer);

printf("cost time:%ld\n",tt1-tt0);

delete nDst;

return tt;
}

int main()
{
//float aaa = new Npp32f [1024010240*4];
cudaError_t cudaStatus;

// Choose which GPU to run on, change this on a multi-GPU system.
cudaStatus = cudaSetDevice(0);
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?");
    goto Error;
}
const int arraySize = 15;
const int a[arraySize] = { 1, 2, 3, 4, 5 };
const int b[arraySize] = { 10, 20, 30, 40, 50 };
int c[arraySize] = { 0 };


int size=300;

unsigned char *imga=datagenerate(size,size,256);;
unsigned char *imgb=datagenerate(size,size,256);;

mytestCrossCorr(size,size,(Npp8u*)imga,(Npp8u*)imga);

Error:

cudaStatus = cudaDeviceReset();
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaDeviceReset failed!");
    return 1;
}
return 0;

}

this is my whole codes, can you test it on your computer ,thanks

I can confirm that the call to nppiCrossCorrFull_NormLevel_8u32f_C1R() takes many seconds on a low-end GPU. This is with CUDA 6.5.

I am not really familiar with NPP. Are you sure that the correct data is passed to the call? I know that in general, a high percentage of erroneous usage of NPP is due to incorrect ROI specification. Based on my limited understanding of the specifications of this function I do not see any obvious errors here.

As I stated previously, if you are quite sure that you are using the function correctly, consider filing a bug, using the bug reporting form linked from the CUDA registered developer website. Please attach your self-contained repro case that you posted above to the report.