I am sharing the GPU speedups for Binomial and trinomial tree implementation for option-pricing.
Parallel Algorithm Courtesy: Mr. Alexandros Gerbessiotis’ paper on parallelizing binomial & trinomial tree option pricing! Links to paper available inside the XLS!
Binomial Tree speedup is around 125x against un-optimized CPU code. It hovers around 65 to 85x against optimized (-O2) CPU code!
Trinomial ranges around 96x against un-optimized CPU code. It is around 27x against optimized (-O2) CPU code!
The CPU used was AMD Athlon running at 2.41GHz. GPU is 8800GTX.
The speedup factors do NOT include “memcopy” times (for both input and output)
Subtract some 10x to 15x for it to know the really real speedups. More in the note section of the XL sheet.
The attached XLS has 3 sheets inside! Two of them are for trinomial (2 different approaches to parallelizing) and one of them is for binomial! Note that the results published in the XL sheets are only for comparisons against un-optimized code! We will be publishing results soon with optimized CPU code!
If you check my username then you can see that I’m using a
quad Xeon 2.66
Intel Quad Xeon X5355 @ 2.66GHz
2GB ram
NVIDIA GeForce 8800GTS 320MB
Linux Fedora Core 6
The sequential algorithm is working on 1 core.
We also runned in @ omp then i went from 3 sec to 1 sec. And that is the calculation of the radiological depth for dose calculation for about 30,000,000 elements.
Assuming GTS has 16 Multiprocessors and 8 cores per MP, It comes to 128 cores running at 1.35GHz.
128*1.35/2.66 gives me 65.
As far as processing energy is concerned, you may visualize it as 65 CPUs running at 2.66GHz.
Since latency hiding for memory-fetches and other register-hazard issues are DONE extremely well in CUDA – you may get more than 65x.
Also note that similar kind of latency hiding is done by the CPU as well by having multiple-pipelines and having out-of-order executions… But generally it canNOT beat the super-threaded model of CUDA.
Still – 250x looks too much for this setup. How do you justify your vision of 250x? Kindly share your views.
I saw in your sheet you need to split up your program because the maximum number of blocks is 65535. It is however only the maximum for each dimension of the grid, so you can have 65535*65535=4294836225 blocks. Just use dim3(65535,65535,1) as input for your gridSize.
And in your kernel use blockIdx.x + __umul24(blockIdx.y * gridDm.x) instead of blockIdx.x
Your speedups look very nice to me, I hope I will reach them for my next project where a single simulation takes 12 hours or longer.
It is not the complete dose calculation we are porting to the GPU, at this time it is only the Raytracing part… After some tests we had seen that if we calculate about 70,000,000 voxels it only takes 2ms that is calculation time and then we need to copy the CT-dataset to the device that is approximately 50MB and copy back to device which is 100MB
If we will keep these good speedups we will switch to Tesla’s because this is much cheaper than the current clinical computers. Even if you buy 10 of them :P
Yes, I knew it. But scary of doing 2 dimensions… I am more comfortable with a single dimension… THanks for pointing out.
umul24? – 24 bits right – it can give answers only to a max of 16MB. 65535 * 65535 = 4GB. So will it still work??? OR does the 24-bits stand for the input operand size?
Thanks for your words on the speedup. I did NOT spend time on trinomial as much as I did for binomial. May b, therez scope for improvement there too.
Well, it is not scary at all, just a shame that with 2 dimensions another calculation is needed. And I did not think about the __umul24, so that might have to become a normal * :(
My 12hr project is on hold I am afraid. My PC died, or at least it will not boot up anymore, it doesn’t even seem to reach the BIOS…
I’ll check tomorrow. I think I did not see any switches mentioned in the manual, and as far as I can see the MB (big watercooling unit on it), I did not see any.
It did not happen because of some low-level programming (I find high level difficult enough). I came back to the pc that had just a browser open, to find I had no mouse-cursor anymore & keyboard was also not working anymore. So I performed a reset and have not seen anything on screen anymore, also it does not try to boot off cd/hd.
Well, good thing is that we have another 3 years of warranty on it, now I will just hope that next business day means that I can continue next week at least :D
Most of the recent MBs have some kind of post messaging system. And indicators of what is wrong. like the Dells have i think 4 or 5 leds that show whats wrong. My MB at home has a LED display showing whats wrong. So look at what post (Power On Self Test) messages are there.
I was not even next to it when it happened, so I did not kill anybody :D
Trouble with the support from Dell is that for XPS machines you get someone from India, so last time I have had more than a week of emails going back & forth before someone came to fix it. Yesterday I tried to call, but that is a whole different kind of experience…
In my experience, at least with Monte Carlo simulation, financial institutions are more interested in achieving high accuracy with smaller numbers of options than you are running (say 50 to 200 options). How does your performance compare to the CPU on smaller problem sizes (fewer options, possibly with more steps per option)?
After talking to people in the finance industry, we optimized the MonteCarlo example in the SDK to achieve more linear performance as the number of options and path samples is reduced (the goal is to achieve a constant samples/second rate regardless of option & sample count). Victor and I put some discussion of this in our update to the MonteCarlo whitepaper in CUDA SDK 1.1.
I would do it for the binomial thing – because I use multiple-blocks to process one option and so will be able to spawn more blocks – i need 96 blocks to hide my latencies.
I think I have asked this question before. but I think I would re-confirm it. Does NVIDIA offer embedded GPU cores – say something that can be programmed to an FPGA and accessed by a high-speed bus (not necessarily PCI)?
I have added the necessary details to the “Binomial” sheet. Please download the attachment again.
I have not benchmarked trinomial as I am sure its gonna take a beating (as I dedicated 1 full option to a block and I run only 32 threads per block).
For bigger time-steps, the speed up hovers around 100x… (72 to 120)
For very less time-steps and very less options, the speed up ranges from 20x to 50x.
Let me know if you want me to do this for Trinomial too.
btw,
I am surprised that financial institutions are asking benchmarks for lesser number of options… Coz, As I understand, they usually run their analytics on historical data which runs in tons…
Our own prospective customer wants to run Monte Carlo in large volumes. Well, thats what our business consultant said…
I was too. But it’s not that they don’t want to run large data sizes, it’s just that in selecting a new computational platfom they need to compare on even footing to what they can do now on the processors they presently employ. Therefore, they ask to see performance on smaller data sizes for comparison.
Glad to see your performance doesn’t degrade much with smaller option counts as long as the step count is high enough.