Computing design question Cuda, streaming
Dear all,

I have a general question about how to design my application. I have read the Cuda document, but still don't know what I should look into. Really appreciate it if someone could shed a light on it.

I want to do some real time analytics about stocks, say 100 stocks. And I have real time market data feed which will stream with updated market price. What I want to do are
1. pre-allocate memory black for each stock on the cuda card, and keep the memory during the day time.
2. when new data coming in, directly update the corresponding memory on Cuda card.
3. After updating, it issue signal or trigger event to start analytical calculation.
4. When calculation is done, write the result back to CPU memory.

Here are my questions:
1. what's the most efficient way to stream data from CPU memory to GPU memory? Because I want it in real time, so copying memory snapshot from CPU to GPU every second is not acceptable.
2. I may need to allocate memory block for 100 stocks both on CPU and GPU. How to mapping the CPU memory cell to each GPU memory cell?
3. How to trigger the analytics calculation when the new data arrive on Cuda card?

I am using a Tesla C1060 with Cuda 3.2 on Windows XP.

Thank you very much for any suggestion.
Dear all,



I have a general question about how to design my application. I have read the Cuda document, but still don't know what I should look into. Really appreciate it if someone could shed a light on it.



I want to do some real time analytics about stocks, say 100 stocks. And I have real time market data feed which will stream with updated market price. What I want to do are

1. pre-allocate memory black for each stock on the cuda card, and keep the memory during the day time.

2. when new data coming in, directly update the corresponding memory on Cuda card.

3. After updating, it issue signal or trigger event to start analytical calculation.

4. When calculation is done, write the result back to CPU memory.



Here are my questions:

1. what's the most efficient way to stream data from CPU memory to GPU memory? Because I want it in real time, so copying memory snapshot from CPU to GPU every second is not acceptable.

2. I may need to allocate memory block for 100 stocks both on CPU and GPU. How to mapping the CPU memory cell to each GPU memory cell?

3. How to trigger the analytics calculation when the new data arrive on Cuda card?



I am using a Tesla C1060 with Cuda 3.2 on Windows XP.



Thank you very much for any suggestion.

#1
Posted 04/24/2012 02:18 PM   
I would suggest to have a running Kernel on your GPU that exchange data in real-time with your CPU using Pinned Mapped Memory, creating two independent circular queues, 1 for writing stock data and requests, 1 for writing results from the GPU (no lock involved in each case with a write and read pointer each maintained by CPU or GPU, not both).

[list=1]
[*]Host write request and/or stock data on input queue
[*]GPU Kernel read input queue continuously until there's data available (typically write_ptr <> read_ptr)
[*]GPU process the data internally
[*]GPU write the result in the output queue
[*]During GPU computing, CPU could add next data in the INPUT Queue and get data from the output queue as soon as they are available
[/list]
I would suggest to have a running Kernel on your GPU that exchange data in real-time with your CPU using Pinned Mapped Memory, creating two independent circular queues, 1 for writing stock data and requests, 1 for writing results from the GPU (no lock involved in each case with a write and read pointer each maintained by CPU or GPU, not both).



[list=1]

  • Host write request and/or stock data on input queue
  • GPU Kernel read input queue continuously until there's data available (typically write_ptr <> read_ptr)
  • GPU process the data internally
  • GPU write the result in the output queue
  • During GPU computing, CPU could add next data in the INPUT Queue and get data from the output queue as soon as they are available
  • [/list]

    Parallelis.com, Parallel-computing technologies and benchmarks. Current Projects: OpenCL Chess & OpenCL Benchmark

    #2
    Posted 04/24/2012 07:17 PM   
    Parallelis, thank you for your reply.

    Do you know any example code about these input/output queue design?


    [quote name='parallelis' date='24 April 2012 - 07:17 PM' timestamp='1335295034' post='1400455']
    I would suggest to have a running Kernel on your GPU that exchange data in real-time with your CPU using Pinned Mapped Memory, creating two independent circular queues, 1 for writing stock data and requests, 1 for writing results from the GPU (no lock involved in each case with a write and read pointer each maintained by CPU or GPU, not both).

    [list=1]
    [*]Host write request and/or stock data on input queue
    [*]GPU Kernel read input queue continuously until there's data available (typically write_ptr <> read_ptr)
    [*]GPU process the data internally
    [*]GPU write the result in the output queue
    [*]During GPU computing, CPU could add next data in the INPUT Queue and get data from the output queue as soon as they are available
    [/list]
    [/quote]
    Parallelis, thank you for your reply.



    Do you know any example code about these input/output queue design?





    [quote name='parallelis' date='24 April 2012 - 07:17 PM' timestamp='1335295034' post='1400455']

    I would suggest to have a running Kernel on your GPU that exchange data in real-time with your CPU using Pinned Mapped Memory, creating two independent circular queues, 1 for writing stock data and requests, 1 for writing results from the GPU (no lock involved in each case with a write and read pointer each maintained by CPU or GPU, not both).



    [list=1]

  • Host write request and/or stock data on input queue
  • GPU Kernel read input queue continuously until there's data available (typically write_ptr <> read_ptr)
  • GPU process the data internally
  • GPU write the result in the output queue
  • During GPU computing, CPU could add next data in the INPUT Queue and get data from the output queue as soon as they are available
  • [/list]

    #3
    Posted 05/01/2012 12:55 AM   
    It's pretty basic (except for the pinned mapped memory part naturally), here's an example in pseudo-C, given whatever is a struct you wanna read or write, preferably a single 64bits or 128bit write (ie using vector)
    [code]whatever * queue;
    int queue_length=n, queue_read=0, queue_write=0;
    [/code]

    To read (element):
    [code]if( queue_read != queue_write ) {
    whatever element = queue[queue_read];
    queue_read = (queue_read+1) % queue_length;
    } else {
    // Queue is empty!!!
    }[/code]

    To write (element):
    [code]if( queue_read != (queue_write+1) % queue_length ) {
    queue[queue_write] = element;
    queue_write = (queue_write+1) % queue_length;
    } else {
    // Queue is full we will have to wait!
    }
    [/code]
    It's pretty basic (except for the pinned mapped memory part naturally), here's an example in pseudo-C, given whatever is a struct you wanna read or write, preferably a single 64bits or 128bit write (ie using vector)

    whatever * queue;

    int queue_length=n, queue_read=0, queue_write=0;




    To read (element):

    if( queue_read != queue_write ) {

    whatever element = queue[queue_read];

    queue_read = (queue_read+1) % queue_length;

    } else {

    // Queue is empty!!!

    }




    To write (element):

    if( queue_read != (queue_write+1) % queue_length ) {

    queue[queue_write] = element;

    queue_write = (queue_write+1) % queue_length;

    } else {

    // Queue is full we will have to wait!

    }

    Parallelis.com, Parallel-computing technologies and benchmarks. Current Projects: OpenCL Chess & OpenCL Benchmark

    #4
    Posted 05/01/2012 03:07 PM   
    Scroll To Top