GPU 4 Finance Some results.

I am sharing the GPU speedups for Binomial and trinomial tree implementation for option-pricing.

Parallel Algorithm Courtesy: Mr. Alexandros Gerbessiotis’ paper on parallelizing binomial & trinomial tree option pricing! Links to paper available inside the XLS!

Binomial Tree speedup is around 125x against un-optimized CPU code. It hovers around 65 to 85x against optimized (-O2) CPU code!

Trinomial ranges around 96x against un-optimized CPU code. It is around 27x against optimized (-O2) CPU code!

The CPU used was AMD Athlon running at 2.41GHz. GPU is 8800GTX.

The speedup factors do NOT include “memcopy” times (for both input and output)
Subtract some 10x to 15x for it to know the really real speedups. More in the note section of the XL sheet.

The attached XLS has 3 sheets inside! Two of them are for trinomial (2 different approaches to parallelizing) and one of them is for binomial! Note that the results published in the XL sheets are only for comparisons against un-optimized code! We will be publishing results soon with optimized CPU code!

Best Regards,
Sarnath
GPUForFinance.xls (74 KB)

Hi Sarnath,

Very nice results, keep up the good work…
I one day hope that my Raytracer also has a good speedup. Theoretically it can be sped up to a max of 250x

If it’s ready I will also post my results. I know it’s not financial but still…

Thank you. I would be amazed if you can make it 250x. Thats enormous. Whats the CPU speed against which it is compared to?

OR

Do you intend to use 9800 GX2? :)

Let Good Luck be with you. And, Do post your findings!

Best Regards,

Sarnath

If you check my username then you can see that I’m using a

quad Xeon 2.66

Intel Quad Xeon X5355 @ 2.66GHz

2GB ram

NVIDIA GeForce 8800GTS 320MB

Linux Fedora Core 6

The sequential algorithm is working on 1 core.

We also runned in @ omp then i went from 3 sec to 1 sec. And that is the calculation of the radiological depth for dose calculation for about 30,000,000 elements.

Assuming GTS has 16 Multiprocessors and 8 cores per MP, It comes to 128 cores running at 1.35GHz.

128*1.35/2.66 gives me 65.

As far as processing energy is concerned, you may visualize it as 65 CPUs running at 2.66GHz.

Since latency hiding for memory-fetches and other register-hazard issues are DONE extremely well in CUDA – you may get more than 65x.
Also note that similar kind of latency hiding is done by the CPU as well by having multiple-pipelines and having out-of-order executions… But generally it canNOT beat the super-threaded model of CUDA.

Still – 250x looks too much for this setup. How do you justify your vision of 250x? Kindly share your views.

Best Regards,
Sarnath

I saw in your sheet you need to split up your program because the maximum number of blocks is 65535. It is however only the maximum for each dimension of the grid, so you can have 65535*65535=4294836225 blocks. Just use dim3(65535,65535,1) as input for your gridSize.

And in your kernel use blockIdx.x + __umul24(blockIdx.y * gridDm.x) instead of blockIdx.x

Your speedups look very nice to me, I hope I will reach them for my next project where a single simulation takes 12 hours or longer.

It is not the complete dose calculation we are porting to the GPU, at this time it is only the Raytracing part… After some tests we had seen that if we calculate about 70,000,000 voxels it only takes 2ms that is calculation time and then we need to copy the CT-dataset to the device that is approximately 50MB and copy back to device which is 100MB

If we will keep these good speedups we will switch to Tesla’s because this is much cheaper than the current clinical computers. Even if you buy 10 of them :P

So there’s your finance :)

Hi Denis,

Yes, I knew it. But scary of doing 2 dimensions… I am more comfortable with a single dimension… THanks for pointing out.

umul24? – 24 bits right – it can give answers only to a max of 16MB. 65535 * 65535 = 4GB. So will it still work??? OR does the 24-bits stand for the input operand size?

Thanks for your words on the speedup. I did NOT spend time on trinomial as much as I did for binomial. May b, therez scope for improvement there too.

All right,

Good Luck on your 12hr project,

Keep us posted.

Best Regards,

Sarnath

Well, it is not scary at all, just a shame that with 2 dimensions another calculation is needed. And I did not think about the __umul24, so that might have to become a normal * :(

My 12hr project is on hold I am afraid. My PC died, or at least it will not boot up anymore, it doesn’t even seem to reach the BIOS…

I saw your mesage on the other topic… So sad… btw, Did you switch on your monitor?? :-)

Denis,

I screwed up a 4 CPU massive server box while writing some low-level code. It did not even reach BIOS…

What happened was :

  1. Most BIOSes chips are flashable and there would be switch on your motherboard – that allows writing to BIOS.

  2. This switch was somehow – switched on on my machine.

  3. So, my scrwed up low-level code somehow find its way to re-write the BIOS.

To fix this:

  1. Reach out your motherboard manual or CD.

  2. Usually it will have a BIOS repgramming procedure.

  3. mostly you may need to download software from your BIOS provider

  4. Cut a boot floppy with that flash program.

  5. Change a SWITCH on your motherboard to boot from an alternate ROM (not writable) BIOS which will just boot from floppy and flash the writeable BIOS.

    Go ahead and boot the system and get your BIOS flashed.

  6. Change necessary SWITCH settings again as per your motherboard manual.

  7. Make sure you remove your floppy.

And yes, make sure that “writeable” BIOS switch is OFFed

I’ll check tomorrow. I think I did not see any switches mentioned in the manual, and as far as I can see the MB (big watercooling unit on it), I did not see any.

It did not happen because of some low-level programming (I find high level difficult enough). I came back to the pc that had just a browser open, to find I had no mouse-cursor anymore & keyboard was also not working anymore. So I performed a reset and have not seen anything on screen anymore, also it does not try to boot off cd/hd.

Well, good thing is that we have another 3 years of warranty on it, now I will just hope that next business day means that I can continue next week at least :D

Most of the recent MBs have some kind of post messaging system. And indicators of what is wrong. like the Dells have i think 4 or 5 leds that show whats wrong. My MB at home has a LED display showing whats wrong. So look at what post (Power On Self Test) messages are there.

Yeah, dell has 4 leds, but it stops before… The power-button is a constant orange, which means trouble with some device according to the manual…

So I took all devices away (all but CPU & MB), and still the same trouble…

anyhow, I found out we have next business day help until 2010, so I should be up & running again next week.

I think it is “next day business” warranty and then you should be up and running tomorrow… But it sounds like you killed your CPU or part of the MB.

I was not even next to it when it happened, so I did not kill anybody :D
Trouble with the support from Dell is that for XPS machines you get someone from India, so last time I have had more than a week of emails going back & forth before someone came to fix it. Yesterday I tried to call, but that is a whole different kind of experience…

Hi Sarnath,

In my experience, at least with Monte Carlo simulation, financial institutions are more interested in achieving high accuracy with smaller numbers of options than you are running (say 50 to 200 options). How does your performance compare to the CPU on smaller problem sizes (fewer options, possibly with more steps per option)?

After talking to people in the finance industry, we optimized the MonteCarlo example in the SDK to achieve more linear performance as the number of options and path samples is reduced (the goal is to achieve a constant samples/second rate regardless of option & sample count). Victor and I put some discussion of this in our update to the MonteCarlo whitepaper in CUDA SDK 1.1.

Thanks for your hard work,

Mark

Hi Mark,

Thanks. I will post the results in an hour.

I would do it for the binomial thing – because I use multiple-blocks to process one option and so will be able to spawn more blocks – i need 96 blocks to hide my latencies.

I think I have asked this question before. but I think I would re-confirm it. Does NVIDIA offer embedded GPU cores – say something that can be programmed to an FPGA and accessed by a high-speed bus (not necessarily PCI)?

Best Regards,
Sarnath

Hi Mark,

I have added the necessary details to the “Binomial” sheet. Please download the attachment again.

I have not benchmarked trinomial as I am sure its gonna take a beating (as I dedicated 1 full option to a block and I run only 32 threads per block).

For bigger time-steps, the speed up hovers around 100x… (72 to 120)

For very less time-steps and very less options, the speed up ranges from 20x to 50x.

Let me know if you want me to do this for Trinomial too.

btw,

I am surprised that financial institutions are asking benchmarks for lesser number of options… Coz, As I understand, they usually run their analytics on historical data which runs in tons…

Our own prospective customer wants to run Monte Carlo in large volumes. Well, thats what our business consultant said…

Thank you,

Best Regards,

Sarnath

I was too. But it’s not that they don’t want to run large data sizes, it’s just that in selecting a new computational platfom they need to compare on even footing to what they can do now on the processors they presently employ. Therefore, they ask to see performance on smaller data sizes for comparison.

Glad to see your performance doesn’t degrade much with smaller option counts as long as the step count is high enough.

Thanks!

Mark