R & D Architect Leveraging CUDA/GPU for High Frequency Trading
Our client has been successfully applying their disciplined, process-driven investment trading strategies for more than 10 years. These strategies, which are traded across various markets and asset classes, are based on statistical models developed using rigorous mathematical analysis. Their R & D team has an opening for a software engineer to build systems (leveraging CUDA and GPU technologies) that perform calculations and execute trades in sub microseconds. Candidates will have a Bachelors Degree (Masters or PhD preferred ((Machine Learning, Computer Science, Mathematics, Artificial Intelligence, Statistics etc.)) from a top university, and have academic or professional experience leveraging a similar skill set.

If you feel this role fits your background we should arrange a time for us to discuss my client in further detail. Please call or email at your next earliest convenience to chat.


Best regards,

Christopher Taranto
Alpha Advisors, LLC.
516-584-6930 (office)
914-424-0484 (cell)
ctaranto@alphaadvisorsllc.com
Our client has been successfully applying their disciplined, process-driven investment trading strategies for more than 10 years. These strategies, which are traded across various markets and asset classes, are based on statistical models developed using rigorous mathematical analysis. Their R & D team has an opening for a software engineer to build systems (leveraging CUDA and GPU technologies) that perform calculations and execute trades in sub microseconds. Candidates will have a Bachelors Degree (Masters or PhD preferred ((Machine Learning, Computer Science, Mathematics, Artificial Intelligence, Statistics etc.)) from a top university, and have academic or professional experience leveraging a similar skill set.



If you feel this role fits your background we should arrange a time for us to discuss my client in further detail. Please call or email at your next earliest convenience to chat.





Best regards,



Christopher Taranto

Alpha Advisors, LLC.

516-584-6930 (office)

914-424-0484 (cell)

ctaranto@alphaadvisorsllc.com

#1
Posted 03/01/2012 07:50 PM   
Hello Christopher,

Are you sure you quoted the run time correctly (sub-microsecond calculations)? In my opinion, that desired calculation time is infeasible with today's GPUs.

For example, if we used a GPU that had a clock rate of 1GHz (this is very fast), then 1us would equate to 1000 cycles of the GPU. There are several instructions that would eat into that clock cycle limit:

[list]
[*]Considering that these calculations would involve at least 1 global memory write transaction (required if you want to use the output), that transaction would take at least 400 cycles (C Programming guide, section 5.2.3). So basically 40% of that 1us would be spent writing a number so you could use it outside of the GPU world.
[*]Register latencies would be present if performing many arithmetic operations sequentially (ex, 6*5/10); this costs 22 cycles per occurrence according to the same section in the programming guide.
[*]More cycles would go by when transferring data to/from the graphics card from/to the host's memory.
[/list]

I know from my experience that very rarely do any of the kernels I write go below 10us. Those that do are simple kernels such as sum reductions, which find the sum of a set of numbers. Most of them are in the 100's of microseconds, which is more than 2 orders of magnitude more than your specification.

Hence, my concern about complex trade computations occurring in less than 1us.

Hope this helps,

-DL
Hello Christopher,



Are you sure you quoted the run time correctly (sub-microsecond calculations)? In my opinion, that desired calculation time is infeasible with today's GPUs.



For example, if we used a GPU that had a clock rate of 1GHz (this is very fast), then 1us would equate to 1000 cycles of the GPU. There are several instructions that would eat into that clock cycle limit:




  • Considering that these calculations would involve at least 1 global memory write transaction (required if you want to use the output), that transaction would take at least 400 cycles (C Programming guide, section 5.2.3). So basically 40% of that 1us would be spent writing a number so you could use it outside of the GPU world.
  • Register latencies would be present if performing many arithmetic operations sequentially (ex, 6*5/10); this costs 22 cycles per occurrence according to the same section in the programming guide.
  • More cycles would go by when transferring data to/from the graphics card from/to the host's memory.




I know from my experience that very rarely do any of the kernels I write go below 10us. Those that do are simple kernels such as sum reductions, which find the sum of a set of numbers. Most of them are in the 100's of microseconds, which is more than 2 orders of magnitude more than your specification.



Hence, my concern about complex trade computations occurring in less than 1us.



Hope this helps,



-DL

#2
Posted 03/07/2012 11:09 PM   
hi, will you consider the PhD graduate from HK?
hi, will you consider the PhD graduate from HK?

#3
Posted 03/09/2012 04:58 AM   
[quote name='slugwarz' date='07 March 2012 - 07:09 PM' timestamp='1331161782' post='1379956']
Hello Christopher,

Are you sure you quoted the run time correctly (sub-microsecond calculations)? In my opinion, that desired calculation time is infeasible with today's GPUs.

For example, if we used a GPU that had a clock rate of 1GHz (this is very fast), then 1us would equate to 1000 cycles of the GPU. There are several instructions that would eat into that clock cycle limit:

[list]
[*]Considering that these calculations would involve at least 1 global memory write transaction (required if you want to use the output), that transaction would take at least 400 cycles (C Programming guide, section 5.2.3). So basically 40% of that 1us would be spent writing a number so you could use it outside of the GPU world.
[*]Register latencies would be present if performing many arithmetic operations sequentially (ex, 6*5/10); this costs 22 cycles per occurrence according to the same section in the programming guide.
[*]More cycles would go by when transferring data to/from the graphics card from/to the host's memory.
[/list]

I know from my experience that very rarely do any of the kernels I write go below 10us. Those that do are simple kernels such as sum reductions, which find the sum of a set of numbers. Most of them are in the 100's of microseconds, which is more than 2 orders of magnitude more than your specification.

Hence, my concern about complex trade computations occurring in less than 1us.

Hope this helps,

-DL
[/quote]
Anyway it might be an interesting exercice to write a Kernel that exchange informations using Pinned Mapped Memory, reading Memory through PCI-Express until data is present, doing a simple calculation and sending back result directly in same Pinned Mapped Memory, just to have an idea on what's the minimum possible communication latency (given communication is a few bytes).

But, as slugwarz write it, I am pretty sure the communication and memory access latency will be so high that's there won't be any chance that a GPU could do better than a fast modern CPU (especially given that the CPU will have the data in it's own cache while writing them on the Pinned Mapped Memory!)
[quote name='slugwarz' date='07 March 2012 - 07:09 PM' timestamp='1331161782' post='1379956']

Hello Christopher,



Are you sure you quoted the run time correctly (sub-microsecond calculations)? In my opinion, that desired calculation time is infeasible with today's GPUs.



For example, if we used a GPU that had a clock rate of 1GHz (this is very fast), then 1us would equate to 1000 cycles of the GPU. There are several instructions that would eat into that clock cycle limit:




  • Considering that these calculations would involve at least 1 global memory write transaction (required if you want to use the output), that transaction would take at least 400 cycles (C Programming guide, section 5.2.3). So basically 40% of that 1us would be spent writing a number so you could use it outside of the GPU world.
  • Register latencies would be present if performing many arithmetic operations sequentially (ex, 6*5/10); this costs 22 cycles per occurrence according to the same section in the programming guide.
  • More cycles would go by when transferring data to/from the graphics card from/to the host's memory.




I know from my experience that very rarely do any of the kernels I write go below 10us. Those that do are simple kernels such as sum reductions, which find the sum of a set of numbers. Most of them are in the 100's of microseconds, which is more than 2 orders of magnitude more than your specification.



Hence, my concern about complex trade computations occurring in less than 1us.



Hope this helps,



-DL



Anyway it might be an interesting exercice to write a Kernel that exchange informations using Pinned Mapped Memory, reading Memory through PCI-Express until data is present, doing a simple calculation and sending back result directly in same Pinned Mapped Memory, just to have an idea on what's the minimum possible communication latency (given communication is a few bytes).



But, as slugwarz write it, I am pretty sure the communication and memory access latency will be so high that's there won't be any chance that a GPU could do better than a fast modern CPU (especially given that the CPU will have the data in it's own cache while writing them on the Pinned Mapped Memory!)

Parallelis.com, Parallel-computing technologies and benchmarks. Current Projects: OpenCL Chess & OpenCL Benchmark

#4
Posted 04/11/2012 06:48 PM   
[quote name='Chris @ Alpha Advisors' date='02 March 2012 - 04:50 AM' timestamp='1330631400' post='1377263']
Our client has been successfully applying their disciplined, process-driven investment trading strategies for more than 10 years. These strategies, which are traded across various markets and asset classes, are based on statistical models developed using rigorous mathematical analysis. Their R & D team has an opening for a software engineer to build systems (leveraging CUDA and GPU technologies) that perform calculations and execute trades in sub microseconds. Candidates will have a Bachelors Degree (Masters or PhD preferred ((Machine Learning, Computer Science, Mathematics, Artificial Intelligence, Statistics etc.)) from a top university, and have academic or professional experience leveraging a similar skill set.

If you feel this role fits your background we should arrange a time for us to discuss my client in further detail. Please call or email at your next earliest convenience to chat.


Best regards,

Christopher Taranto
Alpha Advisors, LLC.
516-584-6930 (office)
914-424-0484 (cell)
ctaranto@alphaadvisorsllc.com
[/quote]

Do you hire CUDA developers outside US ?
[quote name='Chris @ Alpha Advisors' date='02 March 2012 - 04:50 AM' timestamp='1330631400' post='1377263']

Our client has been successfully applying their disciplined, process-driven investment trading strategies for more than 10 years. These strategies, which are traded across various markets and asset classes, are based on statistical models developed using rigorous mathematical analysis. Their R & D team has an opening for a software engineer to build systems (leveraging CUDA and GPU technologies) that perform calculations and execute trades in sub microseconds. Candidates will have a Bachelors Degree (Masters or PhD preferred ((Machine Learning, Computer Science, Mathematics, Artificial Intelligence, Statistics etc.)) from a top university, and have academic or professional experience leveraging a similar skill set.



If you feel this role fits your background we should arrange a time for us to discuss my client in further detail. Please call or email at your next earliest convenience to chat.





Best regards,



Christopher Taranto

Alpha Advisors, LLC.

516-584-6930 (office)

914-424-0484 (cell)

ctaranto@alphaadvisorsllc.com





Do you hire CUDA developers outside US ?

#5
Posted 04/23/2012 03:01 PM   
Scroll To Top