Speed up initialization of CUDA About how to set the Device code translation cache

Hi all,

I have a question. I have been busy with CUDA for a while now, just got me a GTX680.

I have a program that i execute every so many times, and it takes 1~2 seconds for CUDA to ‘start up’, or make some runtime compile code. This is what the NVCC manual says about it:

So clearly i would like to be able of circumvent this initialization. How do I do this properly? Can someone give a practical example? If i google for “CUDA_DEVCODE_CACHE” i get 1 (one) unique result, haha!? Even Googling for “gpu monkey” gives me 5 results!

this probably isn’t related to the JIT cache. use nvidia-smi to enable persistence mode.

more info: i use Cuda from OpenCV. The openCV FAQ also mentions te thing i described. For example, when benchmarking one should first do a dummy function call with some random data just to wake up and initialize the gpu, or something. it usially takes a second or more.

Nope, did not work one bit. Any word on the 3xx.xx drivers for linux?

Can someone please elaborate on the following? It has nothing to do with persistance-mode.

So the problem: i run a C++ OpenCV program, and it first gets some device info/generally starts up the GPU. That part of the code takes 3 seconds already, which is a lot if we request the code often. What can we do with this? OpenCV says it has something to do with the device code translation cache, but there is no more documenation on that from CUDA of NVCC than the above quotation… Please help! :)

Bump…

Hi,

I updatet Opencv now is a little bit better but it takes still 2…3sec. You find a solution?

Hi Marstiger,

I’m not a CUDA GURU, but I had the same problem. You prbably already solved your issue, but for other people, maybe its interesting to read how to solve this.

First I had :
A WaitTrackAndStopWithMovieExport which ONLY did some workflow activities.
A FollowObjectComplete which did the tracking after an object was investigated.

Typical in the FollowObject function :

WriteLog(“FollowObjectComplete: CUDA GoodFeaturesToTrackDetector_GPU.\r\n”, tntContext);
cornerDetector = gpu::GoodFeaturesToTrackDetector_GPU(MAX_CORNERS,0.01, 5.0,3,0,0.04);

GpuMat gpumatImgA = GpuMat(imgA);
GpuMat gpumatImgB = GpuMat(imgB);
GpuMat gpumatforegroundMasked = GpuMat(foreGroundMasked);
GpuMat gpumatCornersA;
cornerDetector(gpumatImgA,gpumatCornersA,gpumatforegroundMasked);

GpuMat gpumatNextPts;
GpuMat gpumatStatus;
GpuMat gpumatError;

gpu::PyrLKOpticalFlow lkTracker;
lkTracker.sparse(gpumatImgA, gpumatImgB, gpumatCornersA, gpumatNextPts, gpumatStatus, &gpumatError);
WriteLog(“FollowObjectComplete: CUDA PyrLKOpticalFlow.\r\n”, tntContext);

Okay… with the upper code in a while loop, the first time some memory needs to be allocated at the device. The cudaruntime may be awake some time but memory allocation is an importent issue too.

Now, what I’ve done :

Put these havy objects into global scope of your class (like videotoolbox) :

class VideoToolbox
{
public:
VideoToolbox();
int MAX_RINGBUFFER_SIZE; //number of objects in buffer
gpu::GoodFeaturesToTrackDetector_GPU cornerDetector;


}

Do a stupid call, but WITH data to this cornerDetector object :

in WaitTrackAndStopWithMovieExport(…) i placed the code :

//Wake up GPU for cornerdetection. Alloceren geheugen kost nl veel tijd.
cornerDetector = gpu::GoodFeaturesToTrackDetector_GPU(MAX_CORNERS,0.01, 5.0,3,0,0.04);
IplImage *imgA = GetFrameFromSharedMemoryBuffer(tntContext->RingBufferName, 0);
IplImage *imgB = GetFrameFromSharedMemoryBuffer(tntContext->RingBufferName, 0);
IplImage *foreGroundMasked = GetForeGroundMasked(tntContext, 0, tntContext->fgMask, true, imgB);
GpuMat gpumatImgA = GpuMat(imgA);
GpuMat gpumatImgB = GpuMat(imgB);
GpuMat gpumatforegroundMasked = GpuMat(foreGroundMasked);
GpuMat gpumatCornersA;
cornerDetector(gpumatImgA,gpumatCornersA,gpumatforegroundMasked);
cvReleaseImage(&imgA);
cvReleaseImage(&imgB);
cvReleaseImage(&foreGroundMasked);
//end wakeup cuda or device memory allocation

now I didn’t change anything in my FollowObjectComplete funtion besisdes that this object is not declared there anymore.

The effect is, when some time can be wasted, do this call so not only the CUDA runtime is alive, but the memory is allocated for this object too.

Hope I helped some people.

Rudy