CUDA timing is inaccurate (Sobel example) Different ways to measure time
I have tried to match the speed of Sobel filter example from CUDA SDK and came to a conclusion that those 40,000 – 70,000 fps on a typical Windows Vista computer with say GeForce 9800 GT is an inaccurate measurement due to CUDA code. Please prove me wrong. The way code measures time is a complete phony!!!

The code occasionally runs Sobel in a display() procedure of open GL and then calls start and stop timer functions. This will give wrong timing since any __global__ Kernel() exits code asynchronously to pass the control back to the host without even running a substantial part of its code. Thus measuring time from Kernel start to its passing control to the host has nothing to do with the actual execution time on the device.

In fact requesting a full completion of code on the GPU by using function cudaThreadSynchronize() will result in fps = 60 or so in Release mode – 3 orders of magnitude slower then CUDA displays. Now, this speed is also inaccurate since it underestimates the effect of thread sliding. But the difference is tremendous.

The right way to measure time is probably to launch 1000 Kernels, then call cudaThreadSynchronize() and then average the time. Using a video buffer as output may also impose some restrictions so it is better to dump the result in regular global memory.
I have tried to match the speed of Sobel filter example from CUDA SDK and came to a conclusion that those 40,000 – 70,000 fps on a typical Windows Vista computer with say GeForce 9800 GT is an inaccurate measurement due to CUDA code. Please prove me wrong. The way code measures time is a complete phony!!!



The code occasionally runs Sobel in a display() procedure of open GL and then calls start and stop timer functions. This will give wrong timing since any __global__ Kernel() exits code asynchronously to pass the control back to the host without even running a substantial part of its code. Thus measuring time from Kernel start to its passing control to the host has nothing to do with the actual execution time on the device.



In fact requesting a full completion of code on the GPU by using function cudaThreadSynchronize() will result in fps = 60 or so in Release mode – 3 orders of magnitude slower then CUDA displays. Now, this speed is also inaccurate since it underestimates the effect of thread sliding. But the difference is tremendous.



The right way to measure time is probably to launch 1000 Kernels, then call cudaThreadSynchronize() and then average the time. Using a video buffer as output may also impose some restrictions so it is better to dump the result in regular global memory.

#1
Posted 06/29/2009 03:51 AM   
First, I'm moving this because it's in the wrong forum.

Second, you're wrong. Kernel calls are asynchronous, but cudaUnbindTexture isn't. So, in the SobelFilter function in SobelFilter_kernels.cu (I'm looking at the 2.1 SDK because that's what I happen to have installed on my home machine, but I doubt it's changed in 2.2), unbind will wait for the kernel to complete.

Just for laughs, I changed the code in the way you suggested:
[code] CUT_SAFE_CALL(cutStartTimer(timer));
sobelFilter(data, imWidth, imHeight, g_SobelDisplayMode, imageScale );
cudaThreadSynchronize();
CUT_SAFE_CALL(cutStopTimer(timer));[/code]
and there's no difference in performance.
First, I'm moving this because it's in the wrong forum.



Second, you're wrong. Kernel calls are asynchronous, but cudaUnbindTexture isn't. So, in the SobelFilter function in SobelFilter_kernels.cu (I'm looking at the 2.1 SDK because that's what I happen to have installed on my home machine, but I doubt it's changed in 2.2), unbind will wait for the kernel to complete.



Just for laughs, I changed the code in the way you suggested:

CUT_SAFE_CALL(cutStartTimer(timer));  

sobelFilter(data, imWidth, imHeight, g_SobelDisplayMode, imageScale );

cudaThreadSynchronize();

CUT_SAFE_CALL(cutStopTimer(timer));


and there's no difference in performance.

#2
Posted 06/29/2009 07:02 AM   
[quote name='tmurray' post='559105' date='Jun 29 2009, 12:02 AM']First, I'm moving this because it's in the wrong forum.

Second, you're wrong. Kernel calls are asynchronous, but cudaUnbindTexture isn't. So, in the SobelFilter function in SobelFilter_kernels.cu (I'm looking at the 2.1 SDK because that's what I happen to have installed on my home machine, but I doubt it's changed in 2.2), unbind will wait for the kernel to complete.

Just for laughs, I changed the code in the way you suggested:
[code] CUT_SAFE_CALL(cutStartTimer(timer));
sobelFilter(data, imWidth, imHeight, g_SobelDisplayMode, imageScale );
cudaThreadSynchronize();
CUT_SAFE_CALL(cutStopTimer(timer));[/code]
and there's no difference in performance.[/quote]

----------------
I'm sorry, where are you moving this post and who you are? I thought this forum is not "just for laughs"
but for a serious discussion. When I said there is a huge difference in performance I meant it. I am running
this code right now in the release mode and the speed 60 fps., not 30, 000 How could you explain that? What
is your speed, graphical card, and a computer?
[quote name='tmurray' post='559105' date='Jun 29 2009, 12:02 AM']First, I'm moving this because it's in the wrong forum.



Second, you're wrong. Kernel calls are asynchronous, but cudaUnbindTexture isn't. So, in the SobelFilter function in SobelFilter_kernels.cu (I'm looking at the 2.1 SDK because that's what I happen to have installed on my home machine, but I doubt it's changed in 2.2), unbind will wait for the kernel to complete.



Just for laughs, I changed the code in the way you suggested:

CUT_SAFE_CALL(cutStartTimer(timer));  

sobelFilter(data, imWidth, imHeight, g_SobelDisplayMode, imageScale );

cudaThreadSynchronize();

CUT_SAFE_CALL(cutStopTimer(timer));


and there's no difference in performance.



----------------

I'm sorry, where are you moving this post and who you are? I thought this forum is not "just for laughs"

but for a serious discussion. When I said there is a huge difference in performance I meant it. I am running

this code right now in the release mode and the speed 60 fps., not 30, 000 How could you explain that? What

is your speed, graphical card, and a computer?

#3
Posted 06/29/2009 05:37 PM   
I moved it from the CUDA contests forum to the CUDA development forum, which is where it should be. I'm an NVIDIA employee and I try to make sure everything stays nicely organized here (as well as answering questions).

Without seeing your code, you're probably timing everything and hitting vsync on the GL, not any CUDA related timing issue. I tested that on Vista 64, 185.85, and a GTX 280. Doing what I outlined above, there was no difference in performance. I tested in debug mode, so it's entirely possible that the cutil* functions were adding cudaThreadSynchronize() calls for the sake of error checking in the first place.
I moved it from the CUDA contests forum to the CUDA development forum, which is where it should be. I'm an NVIDIA employee and I try to make sure everything stays nicely organized here (as well as answering questions).



Without seeing your code, you're probably timing everything and hitting vsync on the GL, not any CUDA related timing issue. I tested that on Vista 64, 185.85, and a GTX 280. Doing what I outlined above, there was no difference in performance. I tested in debug mode, so it's entirely possible that the cutil* functions were adding cudaThreadSynchronize() calls for the sake of error checking in the first place.

#4
Posted 06/29/2009 05:44 PM   
[quote name='Voblin' post='559366' date='Jun 29 2009, 07:37 PM']----------------
I'm sorry, where are you moving this post and who you are? I thought this forum is not "just for laughs"
but for a serious discussion. When I said there is a huge difference in performance I meant it. I am running
this code right now in the release mode and the speed 60 fps., not 30, 000 How could you explain that? What
is your speed, graphical card, and a computer?[/quote]

Maybe you just forgot to disable Vsync. With Vsync enabled, the FPS can never exceed the refresh rate of your screen.

N.
[quote name='Voblin' post='559366' date='Jun 29 2009, 07:37 PM']----------------

I'm sorry, where are you moving this post and who you are? I thought this forum is not "just for laughs"

but for a serious discussion. When I said there is a huge difference in performance I meant it. I am running

this code right now in the release mode and the speed 60 fps., not 30, 000 How could you explain that? What

is your speed, graphical card, and a computer?



Maybe you just forgot to disable Vsync. With Vsync enabled, the FPS can never exceed the refresh rate of your screen.



N.

#5
Posted 06/29/2009 05:45 PM   
[quote name='tmurray' post='559376' date='Jun 29 2009, 10:44 AM']I moved it from the CUDA contests forum to the CUDA development forum, which is where it should be. I'm an NVIDIA employee and I try to make sure everything stays nicely organized here (as well as answering questions).

Without seeing your code, you're probably timing everything and hitting vsync on the GL, not any CUDA related timing issue. I tested that on Vista 64, 185.85, and a GTX 280. Doing what I outlined above, there was no difference in performance. I tested in debug mode, so it's entirely possible that the cutil* functions were adding cudaThreadSynchronize() calls for the sake of error checking in the first place.[/quote]
--------------------------------------------------------

Well, go to RELEASE mode to see the difference. Debug always gives you the same speed.
I am very well aware about Open GL callbacks so it is not the issue. The [b]miracle [/b]speed is due
to the openGL, though. Since a display() function is only occasionally called, as I mentioned
in the original post, a Kernel has a chance to finish its (not timed) operations before the next call to display().

Here is my code, which is exactly what you tried, except that DEBUG mode shatters any changes:

/ This is the normal display path
void display(void)
{
// Sobel operation
Pixel *data = NULL;
CUDA_SAFE_CALL(cudaGLMapBufferObject((void**)&data, pbo_buffer));
CUT_SAFE_CALL(cutStartTimer(timer));

sobelFilter(data, imWidth, imHeight, g_SobelDisplayMode, imageScale );
cudaThreadSynchronize();

CUT_SAFE_CALL(cutStopTimer(timer));
CUDA_SAFE_CALL(cudaGLUnmapBufferObject(pbo_buffer));
...
[quote name='tmurray' post='559376' date='Jun 29 2009, 10:44 AM']I moved it from the CUDA contests forum to the CUDA development forum, which is where it should be. I'm an NVIDIA employee and I try to make sure everything stays nicely organized here (as well as answering questions).



Without seeing your code, you're probably timing everything and hitting vsync on the GL, not any CUDA related timing issue. I tested that on Vista 64, 185.85, and a GTX 280. Doing what I outlined above, there was no difference in performance. I tested in debug mode, so it's entirely possible that the cutil* functions were adding cudaThreadSynchronize() calls for the sake of error checking in the first place.

--------------------------------------------------------



Well, go to RELEASE mode to see the difference. Debug always gives you the same speed.

I am very well aware about Open GL callbacks so it is not the issue. The miracle speed is due

to the openGL, though. Since a display() function is only occasionally called, as I mentioned

in the original post, a Kernel has a chance to finish its (not timed) operations before the next call to display().



Here is my code, which is exactly what you tried, except that DEBUG mode shatters any changes:



/ This is the normal display path

void display(void)

{

// Sobel operation

Pixel *data = NULL;

CUDA_SAFE_CALL(cudaGLMapBufferObject((void**)&data, pbo_buffer));

CUT_SAFE_CALL(cutStartTimer(timer));



sobelFilter(data, imWidth, imHeight, g_SobelDisplayMode, imageScale );

cudaThreadSynchronize();



CUT_SAFE_CALL(cutStopTimer(timer));

CUDA_SAFE_CALL(cudaGLUnmapBufferObject(pbo_buffer));

...

#6
Posted 06/29/2009 07:08 PM   
Like we said, it's a Vsync issue.
Just replace the display function with

[codebox]void display(void)
{

// Sobel operation
Pixel *data = NULL;
cutilSafeCall(cudaGLMapBufferObject((void**)&data, pbo_buffer));

cutilCheckError(cutStartTimer(timer));
sobelFilter(data, imWidth, imHeight, g_SobelDisplayMode, imageScale );
cudaThreadSynchronize();
cutilCheckError(cutStopTimer(timer));

cutilSafeCall(cudaGLUnmapBufferObject(pbo_buffer));

computeFPS();
glutPostRedisplay();
}[/codebox]

And you will see the actual computation time in CUDA. The reason you're only getting 60 FPS is because you do not have Vsync disabled, which causes the glutSwapBuffers() to stall until the next screen refresh.

N.
Like we said, it's a Vsync issue.

Just replace the display function with



[codebox]void display(void)

{



// Sobel operation

Pixel *data = NULL;

cutilSafeCall(cudaGLMapBufferObject((void**)&data, pbo_buffer));



cutilCheckError(cutStartTimer(timer));

sobelFilter(data, imWidth, imHeight, g_SobelDisplayMode, imageScale );

cudaThreadSynchronize();

cutilCheckError(cutStopTimer(timer));



cutilSafeCall(cudaGLUnmapBufferObject(pbo_buffer));



computeFPS();

glutPostRedisplay();

}[/codebox]



And you will see the actual computation time in CUDA. The reason you're only getting 60 FPS is because you do not have Vsync disabled, which causes the glutSwapBuffers() to stall until the next screen refresh.



N.

#7
Posted 06/29/2009 07:35 PM   
For the record, sobel filters are extremely simple filters with a very low filtertap count. Did you really think that a GF9800 would perform Sobel filtering at a mere 60FPS while it can run Crysis at good speeds.

N.

Sorry for the extra post, I hit the reply button instead of the edit button :)
For the record, sobel filters are extremely simple filters with a very low filtertap count. Did you really think that a GF9800 would perform Sobel filtering at a mere 60FPS while it can run Crysis at good speeds.



N.



Sorry for the extra post, I hit the reply button instead of the edit button :)

#8
Posted 06/29/2009 07:41 PM   
[quote name='Nico' post='559440' date='Jun 29 2009, 12:35 PM']Like we said, it's a Vsync issue.
Just replace the display function with

[codebox]void display(void)
{

// Sobel operation
Pixel *data = NULL;
cutilSafeCall(cudaGLMapBufferObject((void**)&data, pbo_buffer));

cutilCheckError(cutStartTimer(timer));
sobelFilter(data, imWidth, imHeight, g_SobelDisplayMode, imageScale );
cudaThreadSynchronize();
cutilCheckError(cutStopTimer(timer));

cutilSafeCall(cudaGLUnmapBufferObject(pbo_buffer));

computeFPS();
glutPostRedisplay();
}[/codebox]

And you will see the actual computation time in CUDA. The reason you're only getting 60 FPS is because you do not have Vsync disabled, which causes the glutSwapBuffers() to stall until the next screen refresh.

N.[/quote]

Well, I acknowledge that waiting for a video buffer can slow a program down. But even if one writes the result to the global memory instead,
as I suggested in my original post, the speed won't be 40,000 fps at all. So let's not blame a video buffer for all slow downs. The real speed
of Sobel that writes to a global memory is about 10,000 fps or if one has to load a new texture each frame - just 1000fps.
That speed should be reported, not 40, 000.
[quote name='Nico' post='559440' date='Jun 29 2009, 12:35 PM']Like we said, it's a Vsync issue.

Just replace the display function with



[codebox]void display(void)

{



// Sobel operation

Pixel *data = NULL;

cutilSafeCall(cudaGLMapBufferObject((void**)&data, pbo_buffer));



cutilCheckError(cutStartTimer(timer));

sobelFilter(data, imWidth, imHeight, g_SobelDisplayMode, imageScale );

cudaThreadSynchronize();

cutilCheckError(cutStopTimer(timer));



cutilSafeCall(cudaGLUnmapBufferObject(pbo_buffer));



computeFPS();

glutPostRedisplay();

}[/codebox]



And you will see the actual computation time in CUDA. The reason you're only getting 60 FPS is because you do not have Vsync disabled, which causes the glutSwapBuffers() to stall until the next screen refresh.



N.



Well, I acknowledge that waiting for a video buffer can slow a program down. But even if one writes the result to the global memory instead,

as I suggested in my original post, the speed won't be 40,000 fps at all. So let's not blame a video buffer for all slow downs. The real speed

of Sobel that writes to a global memory is about 10,000 fps or if one has to load a new texture each frame - just 1000fps.

That speed should be reported, not 40, 000.

#9
Posted 06/29/2009 09:18 PM   
[quote name='Voblin' post='559490' date='Jun 29 2009, 11:18 PM']The real speed of Sobel that writes to a global memory is about 10,000 fps or if one has to load a new texture each frame - just 1000fps.[/quote]

At these speeds you're hitting an inaccuracy of the CUtil timers, I suppose. These timers are far from perfect.
As you stated you'd need to run the sobel kernel in a loop a couple of hundred times to get an accurate reading.

The SDK samples are meant to be simple, and adding a super accurate timing to all of them would have
needlessly complicated them.
[quote name='Voblin' post='559490' date='Jun 29 2009, 11:18 PM']The real speed of Sobel that writes to a global memory is about 10,000 fps or if one has to load a new texture each frame - just 1000fps.



At these speeds you're hitting an inaccuracy of the CUtil timers, I suppose. These timers are far from perfect.

As you stated you'd need to run the sobel kernel in a loop a couple of hundred times to get an accurate reading.



The SDK samples are meant to be simple, and adding a super accurate timing to all of them would have

needlessly complicated them.

#10
Posted 06/29/2009 09:56 PM   
I'm not sure where you got those numbers(40,000-70,000) from, but on my ubuntu laptop with Quadro FX1600 M, I got about 600fps without any modification to the code.

N.
I'm not sure where you got those numbers(40,000-70,000) from, but on my ubuntu laptop with Quadro FX1600 M, I got about 600fps without any modification to the code.



N.

#11
Posted 06/29/2009 10:08 PM   
It is possible because of Multi-core CPUs. I have seen -ve times at times using QueryPerformanceCounter() -- which is wat cutil timers use below, I guess...

Try setting "Thread affinity" to 1 CPU and see if that helps... I forgot that windows API call -- may b, setThreadAffinity(threadHandle, bitmask)
It is possible because of Multi-core CPUs. I have seen -ve times at times using QueryPerformanceCounter() -- which is wat cutil timers use below, I guess...



Try setting "Thread affinity" to 1 CPU and see if that helps... I forgot that windows API call -- may b, setThreadAffinity(threadHandle, bitmask)

Ignorance Rules; Knowledge Liberates!

#12
Posted 06/30/2009 08:09 AM   
I get 143 FPS when I run three 9x9x9 3D-filters on a 128 cube volume, so why wouldn't a simple sobel-filter on an image result in 50.000+ FPS?
I get 143 FPS when I run three 9x9x9 3D-filters on a 128 cube volume, so why wouldn't a simple sobel-filter on an image result in 50.000+ FPS?

#13
Posted 06/30/2009 08:55 AM   
Scroll To Top