GPU 4 Finance Some results.
  1 / 2    
I am sharing the GPU speedups for Binomial and trinomial tree implementation for option-pricing.


[b]Parallel Algorithm Courtesy[/b]: Mr. Alexandros Gerbessiotis' paper on parallelizing binomial & trinomial tree option pricing! Links to paper available inside the XLS!

[b]Binomial Tree speedup [/b]is around 125x against un-optimized CPU code. It hovers around [b] 65 to 85x against optimized (-O2) CPU code! [/b]

[i][b]Trinomial [/b][/i]ranges around 96x against un-optimized CPU code. It is around [b] [i]27x against optimized (-O2) CPU code![/i] [/b]

The CPU used was AMD Athlon running at 2.41GHz. GPU is 8800GTX.

The speedup factors do NOT include "memcopy" times (for both input and output)
Subtract some 10x to 15x for it to know the really real speedups. More in the note section of the XL sheet.

The attached XLS has 3 sheets inside! Two of them are for trinomial (2 different approaches to parallelizing) and one of them is for binomial! Note that the results published in the XL sheets are only for comparisons against un-optimized code! We will be publishing results soon with optimized CPU code!


Best Regards,
Sarnath
I am sharing the GPU speedups for Binomial and trinomial tree implementation for option-pricing.





Parallel Algorithm Courtesy: Mr. Alexandros Gerbessiotis' paper on parallelizing binomial & trinomial tree option pricing! Links to paper available inside the XLS!



Binomial Tree speedup is around 125x against un-optimized CPU code. It hovers around 65 to 85x against optimized (-O2) CPU code!



Trinomial ranges around 96x against un-optimized CPU code. It is around 27x against optimized (-O2) CPU code!



The CPU used was AMD Athlon running at 2.41GHz. GPU is 8800GTX.



The speedup factors do NOT include "memcopy" times (for both input and output)

Subtract some 10x to 15x for it to know the really real speedups. More in the note section of the XL sheet.



The attached XLS has 3 sheets inside! Two of them are for trinomial (2 different approaches to parallelizing) and one of them is for binomial! Note that the results published in the XL sheets are only for comparisons against un-optimized code! We will be publishing results soon with optimized CPU code!





Best Regards,

Sarnath
Attachments

GPUForFinance.xls

Ignorance Rules; Knowledge Liberates!

#1
Posted 03/19/2008 11:48 AM   
Hi Sarnath,

Very nice results, keep up the good work...
I one day hope that my Raytracer also has a good speedup. Theoretically it can be sped up to a max of 250x

If it's ready I will also post my results. I know it's not financial but still...
Hi Sarnath,



Very nice results, keep up the good work...

I one day hope that my Raytracer also has a good speedup. Theoretically it can be sped up to a max of 250x



If it's ready I will also post my results. I know it's not financial but still...

#2
Posted 03/19/2008 02:46 PM   
[quote name='jordyvaneijk' date='Mar 19 2008, 08:16 PM']Hi Sarnath,

Very nice results, keep up the good work...
I one day hope that my Raytracer also has a good speedup. Theoretically it can be sped up to a max of 250x

If it's ready I will also post my results. I know it's not financial but still...
[right][snapback]345631[/snapback][/right]
[/quote]

Thank you. I would be amazed if you can make it 250x. Thats enormous. Whats the CPU speed against which it is compared to?
OR
Do you intend to use 9800 GX2? :)

Let Good Luck be with you. And, Do post your findings!

Best Regards,
Sarnath
[quote name='jordyvaneijk' date='Mar 19 2008, 08:16 PM']Hi Sarnath,



Very nice results, keep up the good work...

I one day hope that my Raytracer also has a good speedup. Theoretically it can be sped up to a max of 250x



If it's ready I will also post my results. I know it's not financial but still...

[snapback]345631[/snapback]






Thank you. I would be amazed if you can make it 250x. Thats enormous. Whats the CPU speed against which it is compared to?

OR

Do you intend to use 9800 GX2? :)



Let Good Luck be with you. And, Do post your findings!



Best Regards,

Sarnath

Ignorance Rules; Knowledge Liberates!

#3
Posted 03/19/2008 03:25 PM   
[quote name='Sarnath' date='Mar 19 2008, 05:25 PM']Thank you. I would be amazed if you can make it 250x. Thats enormous.  Whats the CPU speed against which it is compared to?
OR
Do you intend to use 9800 GX2? :)

Let Good Luck be with you. And, Do post your findings!

Best Regards,
Sarnath
[right][snapback]345645[/snapback][/right]
[/quote]

If you check my username then you can see that I'm using a
quad Xeon 2.66

Intel Quad Xeon X5355 @ 2.66GHz
2GB ram
NVIDIA GeForce 8800GTS 320MB
Linux Fedora Core 6

The sequential algorithm is working on 1 core.
We also runned in @ omp then i went from 3 sec to 1 sec. And that is the calculation of the radiological depth for dose calculation for about 30,000,000 elements.
[quote name='Sarnath' date='Mar 19 2008, 05:25 PM']Thank you. I would be amazed if you can make it 250x. Thats enormous.  Whats the CPU speed against which it is compared to?

OR

Do you intend to use 9800 GX2? :)



Let Good Luck be with you. And, Do post your findings!



Best Regards,

Sarnath

[snapback]345645[/snapback]






If you check my username then you can see that I'm using a

quad Xeon 2.66



Intel Quad Xeon X5355 @ 2.66GHz

2GB ram

NVIDIA GeForce 8800GTS 320MB

Linux Fedora Core 6



The sequential algorithm is working on 1 core.

We also runned in @ omp then i went from 3 sec to 1 sec. And that is the calculation of the radiological depth for dose calculation for about 30,000,000 elements.

#4
Posted 03/19/2008 06:46 PM   
Assuming GTS has 16 Multiprocessors and 8 cores per MP, It comes to 128 cores running at 1.35GHz.

128*1.35/2.66 gives me 65.

As far as processing energy is concerned, you may visualize it as 65 CPUs running at 2.66GHz.

Since latency hiding for memory-fetches and other register-hazard issues are DONE extremely well in CUDA -- you may get more than 65x.
Also note that similar kind of latency hiding is done by the CPU as well by having multiple-pipelines and having out-of-order executions.... But generally it canNOT beat the super-threaded model of CUDA.

Still -- 250x looks too much for this setup. How do you justify your vision of 250x? Kindly share your views.

Best Regards,
Sarnath
Assuming GTS has 16 Multiprocessors and 8 cores per MP, It comes to 128 cores running at 1.35GHz.



128*1.35/2.66 gives me 65.



As far as processing energy is concerned, you may visualize it as 65 CPUs running at 2.66GHz.



Since latency hiding for memory-fetches and other register-hazard issues are DONE extremely well in CUDA -- you may get more than 65x.

Also note that similar kind of latency hiding is done by the CPU as well by having multiple-pipelines and having out-of-order executions.... But generally it canNOT beat the super-threaded model of CUDA.



Still -- 250x looks too much for this setup. How do you justify your vision of 250x? Kindly share your views.



Best Regards,

Sarnath

Ignorance Rules; Knowledge Liberates!

#5
Posted 03/20/2008 06:54 AM   
I saw in your sheet you need to split up your program because the maximum number of blocks is 65535. It is however only the maximum for each dimension of the grid, so you can have 65535*65535=4294836225 blocks. Just use dim3(65535,65535,1) as input for your gridSize.

And in your kernel use blockIdx.x + __umul24(blockIdx.y * gridDm.x) instead of blockIdx.x

Your speedups look very nice to me, I hope I will reach them for my next project where a single simulation takes 12 hours or longer.
I saw in your sheet you need to split up your program because the maximum number of blocks is 65535. It is however only the maximum for each dimension of the grid, so you can have 65535*65535=4294836225 blocks. Just use dim3(65535,65535,1) as input for your gridSize.



And in your kernel use blockIdx.x + __umul24(blockIdx.y * gridDm.x) instead of blockIdx.x



Your speedups look very nice to me, I hope I will reach them for my next project where a single simulation takes 12 hours or longer.

#6
Posted 03/20/2008 07:47 AM   
[quote name='Sarnath' date='Mar 20 2008, 08:54 AM']Assuming GTS has 16 Multiprocessors and 8 cores per MP, It comes to 128 cores running at 1.35GHz.

128*1.35/2.66 gives me 65.

As far as processing energy is concerned, you may visualize it as 65 CPUs running at 2.66GHz.

Since latency hiding for memory-fetches and other register-hazard issues are DONE extremely well in CUDA -- you may get more than 65x.
Also note that similar kind of latency hiding is done by the CPU as well by having multiple-pipelines and having out-of-order executions.... But generally it canNOT beat the super-threaded model of CUDA.

Still -- 250x looks too much for this setup. How do you justify your vision of 250x? Kindly share your views.

Best Regards,
Sarnath
[right][snapback]346371[/snapback][/right]
[/quote]


It is not the complete dose calculation we are porting to the GPU, at this time it is only the Raytracing part.. After some tests we had seen that if we calculate about 70,000,000 voxels it only takes 2ms that is calculation time and then we need to copy the CT-dataset to the device that is approximately 50MB and copy back to device which is 100MB

If we will keep these good speedups we will switch to Tesla's because this is much cheaper than the current clinical computers. Even if you buy 10 of them :P
So there's your finance :)
[quote name='Sarnath' date='Mar 20 2008, 08:54 AM']Assuming GTS has 16 Multiprocessors and 8 cores per MP, It comes to 128 cores running at 1.35GHz.



128*1.35/2.66 gives me 65.



As far as processing energy is concerned, you may visualize it as 65 CPUs running at 2.66GHz.



Since latency hiding for memory-fetches and other register-hazard issues are DONE extremely well in CUDA -- you may get more than 65x.

Also note that similar kind of latency hiding is done by the CPU as well by having multiple-pipelines and having out-of-order executions.... But generally it canNOT beat the super-threaded model of CUDA.



Still -- 250x looks too much for this setup. How do you justify your vision of 250x? Kindly share your views.



Best Regards,

Sarnath

[snapback]346371[/snapback]








It is not the complete dose calculation we are porting to the GPU, at this time it is only the Raytracing part.. After some tests we had seen that if we calculate about 70,000,000 voxels it only takes 2ms that is calculation time and then we need to copy the CT-dataset to the device that is approximately 50MB and copy back to device which is 100MB



If we will keep these good speedups we will switch to Tesla's because this is much cheaper than the current clinical computers. Even if you buy 10 of them :P

So there's your finance :)

#7
Posted 03/20/2008 08:33 AM   
[quote name='DenisR' date='Mar 20 2008, 01:17 PM']I saw in your sheet you need to split up your program because the maximum number of blocks is 65535. It is however only the maximum for each dimension of the grid, so you can have 65535*65535=4294836225 blocks. Just use dim3(65535,65535,1) as input for your gridSize.

And in your kernel use blockIdx.x + __umul24(blockIdx.y * gridDm.x) instead of blockIdx.x

Your speedups look very nice to me, I hope I will reach them for my next project where a single simulation takes 12 hours or longer.
[right][snapback]346414[/snapback][/right]
[/quote]

Hi Denis,

Yes, I knew it. But scary of doing 2 dimensions... I am more comfortable with a single dimension... THanks for pointing out.

umul24? -- 24 bits right -- it can give answers only to a max of 16MB. 65535 * 65535 = 4GB. So will it still work??? OR does the 24-bits stand for the input operand size?

Thanks for your words on the speedup. I did NOT spend time on trinomial as much as I did for binomial. May b, therez scope for improvement there too.

All right,

Good Luck on your 12hr project,
Keep us posted.

Best Regards,
Sarnath
[quote name='DenisR' date='Mar 20 2008, 01:17 PM']I saw in your sheet you need to split up your program because the maximum number of blocks is 65535. It is however only the maximum for each dimension of the grid, so you can have 65535*65535=4294836225 blocks. Just use dim3(65535,65535,1) as input for your gridSize.



And in your kernel use blockIdx.x + __umul24(blockIdx.y * gridDm.x) instead of blockIdx.x



Your speedups look very nice to me, I hope I will reach them for my next project where a single simulation takes 12 hours or longer.

[snapback]346414[/snapback]






Hi Denis,



Yes, I knew it. But scary of doing 2 dimensions... I am more comfortable with a single dimension... THanks for pointing out.



umul24? -- 24 bits right -- it can give answers only to a max of 16MB. 65535 * 65535 = 4GB. So will it still work??? OR does the 24-bits stand for the input operand size?



Thanks for your words on the speedup. I did NOT spend time on trinomial as much as I did for binomial. May b, therez scope for improvement there too.



All right,



Good Luck on your 12hr project,

Keep us posted.



Best Regards,

Sarnath

Ignorance Rules; Knowledge Liberates!

#8
Posted 03/20/2008 08:38 AM   
[quote name='Sarnath' date='Mar 20 2008, 10:38 AM']Hi Denis,

Yes, I knew it. But scary of doing 2 dimensions... I am more comfortable with a single dimension... THanks for pointing out.

umul24? -- 24 bits right -- it can give answers only to a max of 16MB. 65535 * 65535 = 4GB. So will it still work??? OR does the 24-bits stand for the input operand size?

Thanks for your words on the speedup. I did NOT spend time on trinomial as much as I did for binomial. May b, therez scope for improvement there too.

All right,

Good Luck on your 12hr project,
Keep us posted.

Best Regards,
Sarnath
[right][snapback]346454[/snapback][/right]
[/quote]

Well, it is not scary at all, just a shame that with 2 dimensions another calculation is needed. And I did not think about the __umul24, so that might have to become a normal * :(

My 12hr project is on hold I am afraid. My PC died, or at least it will not boot up anymore, it doesn't even seem to reach the BIOS...
[quote name='Sarnath' date='Mar 20 2008, 10:38 AM']Hi Denis,



Yes, I knew it. But scary of doing 2 dimensions... I am more comfortable with a single dimension... THanks for pointing out.



umul24? -- 24 bits right -- it can give answers only to a max of 16MB. 65535 * 65535 = 4GB. So will it still work??? OR does the 24-bits stand for the input operand size?



Thanks for your words on the speedup. I did NOT spend time on trinomial as much as I did for binomial. May b, therez scope for improvement there too.



All right,



Good Luck on your 12hr project,

Keep us posted.



Best Regards,

Sarnath

[snapback]346454[/snapback]






Well, it is not scary at all, just a shame that with 2 dimensions another calculation is needed. And I did not think about the __umul24, so that might have to become a normal * :(



My 12hr project is on hold I am afraid. My PC died, or at least it will not boot up anymore, it doesn't even seem to reach the BIOS...

#9
Posted 03/20/2008 08:47 AM   
[quote name='DenisR' date='Mar 20 2008, 02:17 PM']My 12hr project is on hold I am afraid. My PC died, or at least it will not boot up anymore, it doesn't even seem to reach the BIOS...
[right][snapback]346459[/snapback][/right]
[/quote]

I saw your mesage on the other topic... So sad... btw, Did you switch on your monitor?? :-)
[quote name='DenisR' date='Mar 20 2008, 02:17 PM']My 12hr project is on hold I am afraid. My PC died, or at least it will not boot up anymore, it doesn't even seem to reach the BIOS...

[snapback]346459[/snapback]






I saw your mesage on the other topic... So sad... btw, Did you switch on your monitor?? :-)

Ignorance Rules; Knowledge Liberates!

#10
Posted 03/20/2008 08:50 AM   
[quote name='DenisR' date='Mar 20 2008, 02:17 PM']Well, it is not scary at all, just a shame that with 2 dimensions another calculation is needed. And I did not think about the __umul24, so that might have to become a normal * :(

My 12hr project is on hold I am afraid. My PC died, or at least it will not boot up anymore, it doesn't even seem to reach the BIOS...
[right][snapback]346459[/snapback][/right]
[/quote]

Denis,

I screwed up a 4 CPU massive server box while writing some low-level code. It did not even reach BIOS....

What happened was :
1. Most BIOSes chips are flashable and there would be switch on your motherboard -- that allows writing to BIOS.
2. This switch was somehow -- switched on on my machine.
3. So, my scrwed up low-level code somehow find its way to re-write the BIOS.

To fix this:
1. Reach out your motherboard manual or CD.
2. Usually it will have a BIOS repgramming procedure.
3. mostly you may need to download software from your BIOS provider
4. Cut a boot floppy with that flash program.
5. Change a SWITCH on your motherboard to boot from an alternate ROM (not writable) BIOS which will just boot from floppy and flash the writeable BIOS.
Go ahead and boot the system and get your BIOS flashed.
6. Change necessary SWITCH settings again as per your motherboard manual.
7. Make sure you remove your floppy.

And yes, make sure that "writeable" BIOS switch is OFFed
[quote name='DenisR' date='Mar 20 2008, 02:17 PM']Well, it is not scary at all, just a shame that with 2 dimensions another calculation is needed. And I did not think about the __umul24, so that might have to become a normal * :(



My 12hr project is on hold I am afraid. My PC died, or at least it will not boot up anymore, it doesn't even seem to reach the BIOS...

[snapback]346459[/snapback]






Denis,



I screwed up a 4 CPU massive server box while writing some low-level code. It did not even reach BIOS....



What happened was :

1. Most BIOSes chips are flashable and there would be switch on your motherboard -- that allows writing to BIOS.

2. This switch was somehow -- switched on on my machine.

3. So, my scrwed up low-level code somehow find its way to re-write the BIOS.



To fix this:

1. Reach out your motherboard manual or CD.

2. Usually it will have a BIOS repgramming procedure.

3. mostly you may need to download software from your BIOS provider

4. Cut a boot floppy with that flash program.

5. Change a SWITCH on your motherboard to boot from an alternate ROM (not writable) BIOS which will just boot from floppy and flash the writeable BIOS.

Go ahead and boot the system and get your BIOS flashed.

6. Change necessary SWITCH settings again as per your motherboard manual.

7. Make sure you remove your floppy.



And yes, make sure that "writeable" BIOS switch is OFFed

Ignorance Rules; Knowledge Liberates!

#11
Posted 03/20/2008 08:55 AM   
I'll check tomorrow. I think I did not see any switches mentioned in the manual, and as far as I can see the MB (big watercooling unit on it), I did not see any.

It did not happen because of some low-level programming (I find high level difficult enough). I came back to the pc that had just a browser open, to find I had no mouse-cursor anymore & keyboard was also not working anymore. So I performed a reset and have not seen anything on screen anymore, also it does not try to boot off cd/hd.

Well, good thing is that we have another 3 years of warranty on it, now I will just hope that next business day means that I can continue next week at least :D
I'll check tomorrow. I think I did not see any switches mentioned in the manual, and as far as I can see the MB (big watercooling unit on it), I did not see any.



It did not happen because of some low-level programming (I find high level difficult enough). I came back to the pc that had just a browser open, to find I had no mouse-cursor anymore & keyboard was also not working anymore. So I performed a reset and have not seen anything on screen anymore, also it does not try to boot off cd/hd.



Well, good thing is that we have another 3 years of warranty on it, now I will just hope that next business day means that I can continue next week at least :D

#12
Posted 03/20/2008 11:53 AM   
[quote name='DenisR' date='Mar 20 2008, 01:53 PM']I'll check tomorrow. I think I did not see any switches mentioned in the manual, and as far as I can see the MB (big watercooling unit on it), I did not see any.

It did not happen because of some low-level programming (I find high level difficult enough). I came back to the pc that had just a browser open, to find I had no mouse-cursor anymore & keyboard was also not working anymore. So I performed a reset and have not seen anything on screen anymore, also it does not try to boot off cd/hd.

Well, good thing is that we have another 3 years of warranty on it, now I will just hope that next business day means that I can continue next week at least  :D
[right][snapback]346569[/snapback][/right]
[/quote]

Most of the recent MBs have some kind of post messaging system. And indicators of what is wrong. like the Dells have i think 4 or 5 leds that show whats wrong. My MB at home has a LED display showing whats wrong. So look at what post (Power On Self Test) messages are there.
[quote name='DenisR' date='Mar 20 2008, 01:53 PM']I'll check tomorrow. I think I did not see any switches mentioned in the manual, and as far as I can see the MB (big watercooling unit on it), I did not see any.



It did not happen because of some low-level programming (I find high level difficult enough). I came back to the pc that had just a browser open, to find I had no mouse-cursor anymore & keyboard was also not working anymore. So I performed a reset and have not seen anything on screen anymore, also it does not try to boot off cd/hd.



Well, good thing is that we have another 3 years of warranty on it, now I will just hope that next business day means that I can continue next week at least  :D

[snapback]346569[/snapback]






Most of the recent MBs have some kind of post messaging system. And indicators of what is wrong. like the Dells have i think 4 or 5 leds that show whats wrong. My MB at home has a LED display showing whats wrong. So look at what post (Power On Self Test) messages are there.

#13
Posted 03/20/2008 12:01 PM   
Yeah, dell has 4 leds, but it stops before.... The power-button is a constant orange, which means trouble with some device according to the manual...

So I took all devices away (all but CPU & MB), and still the same trouble...

anyhow, I found out we have next business day help until 2010, so I should be up & running again next week.
Yeah, dell has 4 leds, but it stops before.... The power-button is a constant orange, which means trouble with some device according to the manual...



So I took all devices away (all but CPU & MB), and still the same trouble...



anyhow, I found out we have next business day help until 2010, so I should be up & running again next week.

#14
Posted 03/20/2008 12:31 PM   
[quote name='DenisR' date='Mar 20 2008, 02:31 PM']Yeah, dell has 4 leds, but it stops before.... The power-button is a constant orange, which means trouble with some device according to the manual...

So I took all devices away (all but CPU & MB), and still the same trouble...

anyhow, I found out we have next business day help until 2010, so I should be up & running again next week.
[right][snapback]346582[/snapback][/right]
[/quote]

I think it is "next day business" warranty and then you should be up and running tomorrow... But it sounds like you killed your CPU or part of the MB.
[quote name='DenisR' date='Mar 20 2008, 02:31 PM']Yeah, dell has 4 leds, but it stops before.... The power-button is a constant orange, which means trouble with some device according to the manual...



So I took all devices away (all but CPU & MB), and still the same trouble...



anyhow, I found out we have next business day help until 2010, so I should be up & running again next week.

[snapback]346582[/snapback]






I think it is "next day business" warranty and then you should be up and running tomorrow... But it sounds like you killed your CPU or part of the MB.

#15
Posted 03/20/2008 01:16 PM   
  1 / 2    
Scroll To Top