Speed up initialization of CUDA About how to set the Device code translation cache
Hi all,

I have a question. I have been busy with CUDA for a while now, just got me a GTX680.
I have a program that i execute every so many times, and it takes 1~2 seconds for CUDA to 'start up', or make some runtime compile code. This is what the NVCC manual says about it:

[quote][b]Enabling the device code translation cache[/b]
By default, the result of any runtime compiled ptx code will be used for the lifetime
of the process that compiles it, and then discarded. Runtime compilation is intended
to be an escape situation, but in case it occurs, it might be desirable to keep the
result for later invocations of the executable.
This can be achieved by defining the environment variable
CUDA_DEVCODE_CACHE to the name of a selected code repository. When
defined, the CUDA runtime system will add the result of runtime compiled code to
this repository, after creating it as a directory when it did not exist before.[/quote]

So clearly i would like to be able of circumvent this initialization. How do I do this properly? Can someone give a practical example? If i google for "CUDA_DEVCODE_CACHE" i get 1 (one) unique result, haha!? Even Googling for "gpu monkey" gives me 5 results!
Hi all,



I have a question. I have been busy with CUDA for a while now, just got me a GTX680.

I have a program that i execute every so many times, and it takes 1~2 seconds for CUDA to 'start up', or make some runtime compile code. This is what the NVCC manual says about it:



Enabling the device code translation cache

By default, the result of any runtime compiled ptx code will be used for the lifetime

of the process that compiles it, and then discarded. Runtime compilation is intended

to be an escape situation, but in case it occurs, it might be desirable to keep the

result for later invocations of the executable.

This can be achieved by defining the environment variable

CUDA_DEVCODE_CACHE to the name of a selected code repository. When

defined, the CUDA runtime system will add the result of runtime compiled code to

this repository, after creating it as a directory when it did not exist before.




So clearly i would like to be able of circumvent this initialization. How do I do this properly? Can someone give a practical example? If i google for "CUDA_DEVCODE_CACHE" i get 1 (one) unique result, haha!? Even Googling for "gpu monkey" gives me 5 results!

#1
Posted 04/19/2012 10:18 PM   
this probably isn't related to the JIT cache. use nvidia-smi to enable persistence mode.
this probably isn't related to the JIT cache. use nvidia-smi to enable persistence mode.

#2
Posted 04/19/2012 10:39 PM   
[quote name='tmurray' date='20 April 2012 - 12:39 AM' timestamp='1334875185' post='1398495']
this probably isn't related to the JIT cache. use nvidia-smi to enable persistence mode.
[/quote]
more info: i use Cuda from OpenCV. The openCV FAQ also mentions te thing i described. For example, when benchmarking one should first do a dummy function call with some random data just to wake up and initialize the gpu, or something. it usially takes a second or more.
[quote name='tmurray' date='20 April 2012 - 12:39 AM' timestamp='1334875185' post='1398495']

this probably isn't related to the JIT cache. use nvidia-smi to enable persistence mode.



more info: i use Cuda from OpenCV. The openCV FAQ also mentions te thing i described. For example, when benchmarking one should first do a dummy function call with some random data just to wake up and initialize the gpu, or something. it usially takes a second or more.

#3
Posted 04/19/2012 11:07 PM   
[quote name='tmurray' date='20 April 2012 - 12:39 AM' timestamp='1334875185' post='1398495']
this probably isn't related to the JIT cache. use nvidia-smi to enable persistence mode.
[/quote]

Nope, did not work one bit. Any word on the 3xx.xx drivers for linux?
[quote name='tmurray' date='20 April 2012 - 12:39 AM' timestamp='1334875185' post='1398495']

this probably isn't related to the JIT cache. use nvidia-smi to enable persistence mode.





Nope, did not work one bit. Any word on the 3xx.xx drivers for linux?

#4
Posted 04/20/2012 10:02 AM   
Can someone please elaborate on the following? It has nothing to do with persistance-mode.
[quote][i]Enabling the device code translation cache[/i]
By default, the result of any runtime compiled ptx code will be used for the lifetime
of the process that compiles it, and then discarded. Runtime compilation is intended
to be an escape situation, but in case it occurs, it might be desirable to keep the
result for later invocations of the executable.
This can be achieved by defining the environment variable
CUDA_DEVCODE_CACHE to the name of a selected code repository. When
defined, the CUDA runtime system will add the result of runtime compiled code to
this repository, after creating it as a directory when it did not exist before.[/quote]

So the problem: i run a C++ OpenCV program, and it first gets some device info/generally starts up the GPU. That part of the code takes 3 seconds already, which is a lot if we request the code often. What can we do with this? OpenCV says it has something to do with the device code translation cache, but there is no more documenation on that from CUDA of NVCC than the above quotation.. Please help! :)
Can someone please elaborate on the following? It has nothing to do with persistance-mode.

Enabling the device code translation cache

By default, the result of any runtime compiled ptx code will be used for the lifetime

of the process that compiles it, and then discarded. Runtime compilation is intended

to be an escape situation, but in case it occurs, it might be desirable to keep the

result for later invocations of the executable.

This can be achieved by defining the environment variable

CUDA_DEVCODE_CACHE to the name of a selected code repository. When

defined, the CUDA runtime system will add the result of runtime compiled code to

this repository, after creating it as a directory when it did not exist before.




So the problem: i run a C++ OpenCV program, and it first gets some device info/generally starts up the GPU. That part of the code takes 3 seconds already, which is a lot if we request the code often. What can we do with this? OpenCV says it has something to do with the device code translation cache, but there is no more documenation on that from CUDA of NVCC than the above quotation.. Please help! :)

#5
Posted 04/21/2012 08:29 PM   
Bump..
Bump..

#6
Posted 04/27/2012 05:10 PM   
[quote name='TZaman' date='27 April 2012 - 05:10 PM' timestamp='1335546641' post='1401644']
Bump..
[/quote]

Hi,

I updatet Opencv now is a little bit better but it takes still 2....3sec. You find a solution?
[quote name='TZaman' date='27 April 2012 - 05:10 PM' timestamp='1335546641' post='1401644']

Bump..





Hi,



I updatet Opencv now is a little bit better but it takes still 2....3sec. You find a solution?

#7
Posted 05/06/2012 01:36 AM   
Hi Marstiger, I'm not a CUDA GURU, but I had the same problem. You prbably already solved your issue, but for other people, maybe its interesting to read how to solve this. First I had : A WaitTrackAndStopWithMovieExport which ONLY did some workflow activities. A FollowObjectComplete which did the tracking after an object was investigated. Typical in the FollowObject function : WriteLog("FollowObjectComplete: CUDA GoodFeaturesToTrackDetector_GPU.\r\n", tntContext); cornerDetector = gpu::GoodFeaturesToTrackDetector_GPU(MAX_CORNERS,0.01, 5.0,3,0,0.04); GpuMat gpumatImgA = GpuMat(imgA); GpuMat gpumatImgB = GpuMat(imgB); GpuMat gpumatforegroundMasked = GpuMat(foreGroundMasked); GpuMat gpumatCornersA; cornerDetector(gpumatImgA,gpumatCornersA,gpumatforegroundMasked); GpuMat gpumatNextPts; GpuMat gpumatStatus; GpuMat gpumatError; gpu::PyrLKOpticalFlow lkTracker; lkTracker.sparse(gpumatImgA, gpumatImgB, gpumatCornersA, gpumatNextPts, gpumatStatus, &gpumatError); WriteLog("FollowObjectComplete: CUDA PyrLKOpticalFlow.\r\n", tntContext); Okay... with the upper code in a while loop, the first time some memory needs to be allocated at the device. The cudaruntime may be awake some time but memory allocation is an importent issue too. Now, what I've done : Put these havy objects into global scope of your class (like videotoolbox) : class VideoToolbox { public: VideoToolbox(); int MAX_RINGBUFFER_SIZE; //number of objects in buffer gpu::GoodFeaturesToTrackDetector_GPU cornerDetector; ... ... } Do a stupid call, but WITH data to this cornerDetector object : in WaitTrackAndStopWithMovieExport(...) i placed the code : //Wake up GPU for cornerdetection. Alloceren geheugen kost nl veel tijd. cornerDetector = gpu::GoodFeaturesToTrackDetector_GPU(MAX_CORNERS,0.01, 5.0,3,0,0.04); IplImage *imgA = GetFrameFromSharedMemoryBuffer(tntContext->RingBufferName, 0); IplImage *imgB = GetFrameFromSharedMemoryBuffer(tntContext->RingBufferName, 0); IplImage *foreGroundMasked = GetForeGroundMasked(tntContext, 0, tntContext->fgMask, true, imgB); GpuMat gpumatImgA = GpuMat(imgA); GpuMat gpumatImgB = GpuMat(imgB); GpuMat gpumatforegroundMasked = GpuMat(foreGroundMasked); GpuMat gpumatCornersA; cornerDetector(gpumatImgA,gpumatCornersA,gpumatforegroundMasked); cvReleaseImage(&imgA); cvReleaseImage(&imgB); cvReleaseImage(&foreGroundMasked); //end wakeup cuda or device memory allocation now I didn't change anything in my FollowObjectComplete funtion besisdes that this object is not declared there anymore. The effect is, when some time can be wasted, do this call so not only the CUDA runtime is alive, but the memory is allocated for this object too. Hope I helped some people. Rudy
Hi Marstiger,


I'm not a CUDA GURU, but I had the same problem. You prbably already solved your issue, but for other people, maybe its interesting to read how to solve this.

First I had :
A WaitTrackAndStopWithMovieExport which ONLY did some workflow activities.
A FollowObjectComplete which did the tracking after an object was investigated.

Typical in the FollowObject function :

WriteLog("FollowObjectComplete: CUDA GoodFeaturesToTrackDetector_GPU.\r\n", tntContext);
cornerDetector = gpu::GoodFeaturesToTrackDetector_GPU(MAX_CORNERS,0.01, 5.0,3,0,0.04);

GpuMat gpumatImgA = GpuMat(imgA);
GpuMat gpumatImgB = GpuMat(imgB);
GpuMat gpumatforegroundMasked = GpuMat(foreGroundMasked);
GpuMat gpumatCornersA;
cornerDetector(gpumatImgA,gpumatCornersA,gpumatforegroundMasked);

GpuMat gpumatNextPts;
GpuMat gpumatStatus;
GpuMat gpumatError;

gpu::PyrLKOpticalFlow lkTracker;
lkTracker.sparse(gpumatImgA, gpumatImgB, gpumatCornersA, gpumatNextPts, gpumatStatus, &gpumatError);
WriteLog("FollowObjectComplete: CUDA PyrLKOpticalFlow.\r\n", tntContext);


Okay... with the upper code in a while loop, the first time some memory needs to be allocated at the device. The cudaruntime may be awake some time but memory allocation is an importent issue too.

Now, what I've done :

Put these havy objects into global scope of your class (like videotoolbox) :

class VideoToolbox
{
public:
VideoToolbox();
int MAX_RINGBUFFER_SIZE; //number of objects in buffer
gpu::GoodFeaturesToTrackDetector_GPU cornerDetector;
...
...
}


Do a stupid call, but WITH data to this cornerDetector object :

in WaitTrackAndStopWithMovieExport(...) i placed the code :

//Wake up GPU for cornerdetection. Alloceren geheugen kost nl veel tijd.
cornerDetector = gpu::GoodFeaturesToTrackDetector_GPU(MAX_CORNERS,0.01, 5.0,3,0,0.04);
IplImage *imgA = GetFrameFromSharedMemoryBuffer(tntContext->RingBufferName, 0);
IplImage *imgB = GetFrameFromSharedMemoryBuffer(tntContext->RingBufferName, 0);
IplImage *foreGroundMasked = GetForeGroundMasked(tntContext, 0, tntContext->fgMask, true, imgB);
GpuMat gpumatImgA = GpuMat(imgA);
GpuMat gpumatImgB = GpuMat(imgB);
GpuMat gpumatforegroundMasked = GpuMat(foreGroundMasked);
GpuMat gpumatCornersA;
cornerDetector(gpumatImgA,gpumatCornersA,gpumatforegroundMasked);
cvReleaseImage(&imgA);
cvReleaseImage(&imgB);
cvReleaseImage(&foreGroundMasked);
//end wakeup cuda or device memory allocation


now I didn't change anything in my FollowObjectComplete funtion besisdes that this object is not declared there anymore.



The effect is, when some time can be wasted, do this call so not only the CUDA runtime is alive, but the memory is allocated for this object too.

Hope I helped some people.

Rudy

#8
Posted 04/26/2013 08:29 AM   
Scroll To Top