Porting Over To Copy Engine Texture Processing

I have a fairly complex multithreaded film scanning application developed with Visual Studio C++ for Windows. The current OpenGL implementation is as follows:

  1. Performs all OpenGL Initialization, texture and shader creation in application thread.
  2. Initializes Main Viewing Window with rendering context in application thread.
  3. During scan execution, creates a separate worker thread and continuously uploads images, renders to FBO, and downloads finished images with no parallel GPU activity.
  4. Uses the application GL thread to view finished images on the application viewing window.
  5. Uses a separete worker thread to write finished image buffers to disk.

I have studied the NVidia Copy Engine white paper, as well as Ch. 28 and 29 of the OpenGL Insights text book, but am still somewhat confused as to the proper OpenGL thread construction to take advantage of the Quadro dual copy engines. The OpenGL Insights sample code is difficult for me to parse, as it uses a c++ class that encapsulates many of the OpenGL calls.

My initial questions are:

  1. In the NVidia examples, the application thread is used for GL rendering, and shares its rendering context (wglShareLlists) with the upload thread. If my upload, render and download threads are all separate worker threads, does this change the context sharing structure?

  2. Why does the render thread only need to share contexts with the upload thread, and not the download thread (finished frame)?

  3. Regarding the Pixel Buffer Object buffers, I don’t quite understand the purpose of using two sets of buffers for both uploading and downloading. Is the reason to use one for even frames and one for odd frames, or is it use one to load from host memory and the other to copy the data from the from the first PBO to the second during one frame transfer?

If there is any dual copy-engine code out there that uses native OpenGL calls exclusively, I’d appreciate a link to it.

Nvidia copy-engine from my understanding is nothing more that a glorified DMA controller. PBO as required by the OpenGL allow for asynchronous behavior and as such I’m thinking that any implementation of OpenGL supporting PBO would have some for of DMA facility to be efficient. Also are you working with Quadros ?

  1. Going by what you mentioned, if the app thread is only sharing with the upload thread only, and the download thread does not share directly ( via the app thread ) or indirectly ( via the upload thread ), then the download thread MUST not be making any GL calls that utilize resources used by the other threads. There is no need for sharing context unless resources needs to be shared( used ) between both.

  2. Ping-ponging buffers is a common practice to prevent GL stalling on resource usage/modification. Remember when you submit a GL call its synchronous to the user but it call may not get executed on the device until several frames later. If you are uploading data to PBO p in frame n and then go to do another upload to p on frame n+1, PBO p may be in use in that frame. To ensure coherency, the driver may have to make a copy of the resource or even worse, pause all operation until its finished using the resource. That was a over-generalization of what happen, but you get the point that having several buffers in flight will minimize that particular case.

Thanks to busta78.

I use ten host buffers for both uploading to GL and downloading from GL. Would it be advisable to create 10 pbo/textures pairs for uploading, and 10 pbo’s and FBO render textures for downloading?

The example I described was the Nvidia example code. My download PBO’s will access a texture attached to an FBO that the render thread writes to. So do I need to share the FBO attached texture with the download thread?