parallel of CPU and GPU in real-time imaging

Hi Everyone rolleyes.gif

I am trying to use NI-IMAQ and CUDA to do real-time image acquisition(IMAQ) and processing(CUDA), my first trial is like below–

loop ()

{

IMAQfunction(&buffer); // grab a image to the buffer by NI-IMAQ function, (CPU)

CUDAfunction(buffer); // process the image bu CUDA, (GPU)

}

Now the problem is, the above process works fine but it is sequential.
The CUDA processing time + Host/Device I/O + display() together is only 2/3 of the IMAQ acquisition time,
so it will be great to make the CUDA and IMAQ process overlap,
I am thinking about the “Buffer ring” model like below–

loop ()

{

buffer[N]; //create a buffer list

Parallel() // The following process can be parallelly processed
{

IMAQfunction(&buffer[i%N]); // grab a image to the buffer[i%N] by NI-IMAQ function; (CPU)
//here i%N gives the mod

CUDAfunction(buffer[i%N-1]); // process last image buffer[i%N-1] by CUDA (GPU)

}

}

I have read something about the Async functions, but it don’t seem to fully solve my simple requirement. Please give me some advice to realize the parallel part
I am reading about the OpenMP stuff, but don’t know if that really helps.

Thank you very much! rolleyes.gif

Hi,
I dont really know about your problem but you should be able to do what you’ve thought of.
The sequence should be:

  1. Read first data.
  2. Open two threads.
  3. Thread A will run the CUDA code on the previous data
  4. Thread B will get the next data.
  5. Block till both threads have completed their work
  6. move the next data to be the input for the CUDA thread
  7. go back to step 3 untill there is no more data to read.

That way you overlap the reading of data and its CUDA processing. The amount of time
you have to wait in step 5 will be the largest between time 3 (time to calculate on the GPU ) and
time 4 (time to acquire the next data)

eyal

Thank you so much! That’s what I am looking for. So I need to use the multithread tech, right?

I am not very familiar with this but it would be great if you can show me some simple example like

how to create thread for the following two functions, suppose their data source are independent

{

function1();

function2();

}

I learned that OpenMP can simply do like this

#pragma omp parallel sections

{

	#pragma omp section

	   {

                     function1();

               }



	#pragma omp section

	   {

                     function2();

               }

    }

but I am not sure if this works well with CUDA, so any suggestions?

Thank you!

Hi,
Yes I guess OpenMP is the easiest, you can also google for pthreads (for linux) or CreateThread (for windows)
I do think you should first try to do things with 2-3 threads in OpenMP (or equiv) before running 1000s of threads on the GPU :)

eyal

Hi,

I’ve been trying something similar with visual studio, and wouldn’t like to comment about threads in linux as I haven’t done it for so long. I found it easiest to start two threads, one for capture and the other for analysis using CreateThread(…). In the capture thread use imgGrab(sid,buffer,1), then ReleaseSemaphore(…), within a loop. In the analysis thread, again in a loop, wait for the semaphore using WaitForSingleObject on the semaphore then do the cuda stuff. As I recall the important issue is the 1 as the last parameter in the imgGrab so that the call returns when the next available frame is transferred, simplifying synchronisation. My application does not copy the data back to the host each frame but I have no trouble analysing the 500fps from a PCIe1429 using a GTX285. You may also keep track of the buffer use when you ReleaseSemaphore and if the gets close to the maximum available suspending drawing to the screen helps!

Hoep that helps

cheers

Thank you very much ! I tried two threads and it works now, my dual-core CPU is almost 100% running ,

but I still have imperfection, that is I need to put all the cuda operations in the cuda thread,

so the memory allocations takes extra 50% the time of the cuda processing,

my first code sketch is like this, which is my desired situation but not work

/////////////////////

IMAQwithCUDA.cu

///////////////////////////

include <cudaheads.h>

include <imaqheads.h>

int* IMAQ_buffer;

int* CUDA_buffer;

global void CUDA_kernel(int * buffer)

{ process(buffer);}

DWORD WINAPI IMAQ_Proc( LPVOID lpParameter)

{

grab(&IMAQ_buffer);  //get data from IMAQ functions

}

DWORD WINAPI CUDA_Proc( LPVOID lpParameter)

{

cudaMemcopy(&CUDA_buffer, IMAQ_buffer, HosttoDevice);

CUDA_kernel<<<>>>(CUDA_buffer)

}

int main()

{

IMAQ_Malloc(&IMAQ_buffer);

    CUDAMalloc(&CUDA_buffer);

{

    HANDLE hThread1;

HANDLE hThread2;

hThread1=CreateThread(NULL,0,IMAQ_Proc,NULL,0,NULL);

    hThread2=CreateThread(NULL,0,CUDA_Proc,NULL,0,NULL);

Sleep(TIME); // TIME is the premeasured time which is larger than both IMAQ and CUDA thread time

CloseHandle(hThread1);

   CloseHandle(hThread2);

}

}

///////////////////////////////////

I set both IMAQ_buffer and CUDA_buffer to be “global” , and Malloc them in the main thread, to save time for CUDA_Proc.

But this dream doesn’t work, the screen just “flash and black” every time when CUDA_Proc thread is called, and the kernel failes to launch

it seems that the “global” CUDA_buffer is not accessible by the CUDA_kernel

so I have to make the CUDA_buffer to be “local” inside the CUDA_Proc thread , and it works, like

/////////////////////

IMAQwithCUDA.cu

///////////////////////////

include <cudaheads.h>

include <imaqheads.h>

int* IMAQ_buffer;

global void CUDA_kernel(int * buffer)

{ process(buffer);}

DWORD WINAPI IMAQ_Proc( LPVOID lpParameter)

{

grab(&IMAQ_buffer);  //get data from IMAQ functions

}

DWORD WINAPI CUDA_Proc( LPVOID lpParameter)

{

int* CUDA_buffer;   

CUDAMalloc(&CUDA_buffer); //// Here is the imperfection, takes some extra time External Media

cudaMemcopy(&CUDA_buffer, IMAQ_buffer, HosttoDevice);

CUDA_kernel<<< >>>(CUDA_buffer)

}

int main()

{

IMAQ_Malloc(&IMAQ_buffer);

{

    HANDLE hThread1;

HANDLE hThread2;

hThread1=CreateThread(NULL,0,IMAQ_Proc,NULL,0,NULL);

    hThread2=CreateThread(NULL,0,CUDA_Proc,NULL,0,NULL);

Sleep(TIME); // TIME is the premeasured time which is larger than both IMAQ and CUDA thread time

CloseHandle(hThread1);

   CloseHandle(hThread2);

}

}

///////////////////////////////////

and now it works, but as you seen, all the cuda operations has to be put into the cuda thread, and takes nearly 50% of the kernel time

I guess there is some solution to this, please give me a hand!

Thank you~!

Hi Bendudu,

I’m not exactly sure how you have structured you code, as I can’t tell where the loop is being performed, but I’m guessing you have the loop in the main routine, as below.

int main()

{

IMAQ_Malloc(&IMAQ_buffer);

{
HANDLE hThread1;
HANDLE hThread2;

for(::){ // I guess this is where you loop is

hThread1=CreateThread(NULL,0,IMAQ_Proc,NULL,0,NULL);
hThread2=CreateThread(NULL,0,CUDA_Proc,NULL,0,NULL);

Sleep(TIME); // TIME is the premeasured time which is larger than both IMAQ and CUDA thread time

CloseHandle(hThread1);
CloseHandle(hThread2);
}
}
}

If this is how you have structured the program then I think this is where your problem is. I have loops in the threads, so

DWORD WINAPI IMAQ_Proc( LPVOID lpParameter)
{
for(;;){
grab(&IMAQ_buffer); //get data from IMAQ functions
ReleaseSemaphore(hBufferAvailable,1,NULL); //hBufferAvailable is HANDLE and needs global scope
}
}

DWORD WINAPI CUDA_Proc( LPVOID lpParameter)
{
int* CUDA_buffer;

CUDAMalloc(&CUDA_buffer); //// Here is the imperfection, takes some extra time yucky.gif

for(;;){

WaitForSingleObject(hBufferAvailable,INFINITE);

cudaMemcopy(&CUDA_buffer, IMAQ_buffer, HosttoDevice);

CUDA_kernel<<< >>>(CUDA_buffer)
}
}

Check the NI IMAQ documentation but I think you should use imgGrab and also use a ring buffer for storing the captured images as you tried in your first post.

Hope thats is a bit clearer,

cheers

rob

Indeed, you don’t want to be creating new threads for every frame. In CUDA, there’s an overhead to just start up a new CUDA context, and that overhead is something like 100ms if I remember (don’t quote me!)

It’s a lot more efficient if you create one CUDA thread and leave it persistent. Have it wait at a semaphore until the capture card data is ready then launch your kernel, then go back and wait again (for both kernel completion and for the next capture data).

If you tend to have a slow kernel, you can overlap the memory transfer with the CUDA compute by using an asynchronous memory copy with streams. In fact it may work well to have two streams that ping-pong. You get your capture frame data ready, you fire off the copy and kernel execution to stream 0. You wait for the next capture data and now you fire off the copy and kernel execution to stream 1. This alternation smooths the kernel calling overhead issues since now the device has a queue of work to just keep loading from… it’s not waiting on a CPU loop.

There are of course issues if your kernel execution time starts being too long and you can’t do your compute as fast as you get data. That’s a separate issue and is more about your computational needs and not about your scheduling.