I have tried to match the speed of Sobel filter example from CUDA SDK and came to a conclusion that those 40,000 – 70,000 fps on a typical Windows Vista computer with say GeForce 9800 GT is an inaccurate measurement due to CUDA code. Please prove me wrong. The way code measures time is a complete phony!!!
The code occasionally runs Sobel in a display() procedure of open GL and then calls start and stop timer functions. This will give wrong timing since any global Kernel() exits code asynchronously to pass the control back to the host without even running a substantial part of its code. Thus measuring time from Kernel start to its passing control to the host has nothing to do with the actual execution time on the device.
In fact requesting a full completion of code on the GPU by using function cudaThreadSynchronize() will result in fps = 60 or so in Release mode – 3 orders of magnitude slower then CUDA displays. Now, this speed is also inaccurate since it underestimates the effect of thread sliding. But the difference is tremendous.
The right way to measure time is probably to launch 1000 Kernels, then call cudaThreadSynchronize() and then average the time. Using a video buffer as output may also impose some restrictions so it is better to dump the result in regular global memory.
First, I’m moving this because it’s in the wrong forum.
Second, you’re wrong. Kernel calls are asynchronous, but cudaUnbindTexture isn’t. So, in the SobelFilter function in SobelFilter_kernels.cu (I’m looking at the 2.1 SDK because that’s what I happen to have installed on my home machine, but I doubt it’s changed in 2.2), unbind will wait for the kernel to complete.
Just for laughs, I changed the code in the way you suggested:
I moved it from the CUDA contests forum to the CUDA development forum, which is where it should be. I’m an NVIDIA employee and I try to make sure everything stays nicely organized here (as well as answering questions).
Without seeing your code, you’re probably timing everything and hitting vsync on the GL, not any CUDA related timing issue. I tested that on Vista 64, 185.85, and a GTX 280. Doing what I outlined above, there was no difference in performance. I tested in debug mode, so it’s entirely possible that the cutil* functions were adding cudaThreadSynchronize() calls for the sake of error checking in the first place.
And you will see the actual computation time in CUDA. The reason you’re only getting 60 FPS is because you do not have Vsync disabled, which causes the glutSwapBuffers() to stall until the next screen refresh.
For the record, sobel filters are extremely simple filters with a very low filtertap count. Did you really think that a GF9800 would perform Sobel filtering at a mere 60FPS while it can run Crysis at good speeds.
N.
Sorry for the extra post, I hit the reply button instead of the edit button :)
I’m not sure where you got those numbers(40,000-70,000) from, but on my ubuntu laptop with Quadro FX1600 M, I got about 600fps without any modification to the code.
It is possible because of Multi-core CPUs. I have seen -ve times at times using QueryPerformanceCounter() – which is wat cutil timers use below, I guess…
Try setting “Thread affinity” to 1 CPU and see if that helps… I forgot that windows API call – may b, setThreadAffinity(threadHandle, bitmask)