Concurrent bandwidth test
  1 / 3    
During the course of testing various configurations, I wrote a concurrent bandwidth test that those of you interested in multi-GPU configurations will probably find useful. It's Linux-only (because I am too lazy to use Windows threads), and you can compile it with

gcc -o concBandwidthTest -std=c99 -I /usr/local/cuda/include concBandwidthTest.c -L /usr/local/cuda/lib -lcuda -lpthread

Replace paths with whatever is appropriate for your system. I'm fairly confident in the results; it behaves exactly as I would expect with up to 3 devices, but I haven't tested past that. For example, my results with a Harpertown Xeon, a FX 1700 (PCIe 1.0 16x), and a C1060 (PCie 2.0 16x):

[tim@ concBandwidthTest]$ ./concBandwidthTest 0 1
Device 0 took 1393.192749 ms
Device 1 took 2053.388184 ms
Average HtoD bandwidth in MB/s: 7710.564941
Device 0 took 1935.400879 ms
Device 1 took 2042.895264 ms
Average DtoH bandwidth in MB/s: 6439.616943

As you would probably expect, it's hitting FSB limitations quickly, so I'm interested to see how it runs on Nehalem. The code's fairly ugly to get around some casting nonsense; if you want to clean it up or add anything to it, feel free (or if you want anything added, let me know and I'll see what I can do).
During the course of testing various configurations, I wrote a concurrent bandwidth test that those of you interested in multi-GPU configurations will probably find useful. It's Linux-only (because I am too lazy to use Windows threads), and you can compile it with



gcc -o concBandwidthTest -std=c99 -I /usr/local/cuda/include concBandwidthTest.c -L /usr/local/cuda/lib -lcuda -lpthread



Replace paths with whatever is appropriate for your system. I'm fairly confident in the results; it behaves exactly as I would expect with up to 3 devices, but I haven't tested past that. For example, my results with a Harpertown Xeon, a FX 1700 (PCIe 1.0 16x), and a C1060 (PCie 2.0 16x):



[tim@ concBandwidthTest]$ ./concBandwidthTest 0 1

Device 0 took 1393.192749 ms

Device 1 took 2053.388184 ms

Average HtoD bandwidth in MB/s: 7710.564941

Device 0 took 1935.400879 ms

Device 1 took 2042.895264 ms

Average DtoH bandwidth in MB/s: 6439.616943



As you would probably expect, it's hitting FSB limitations quickly, so I'm interested to see how it runs on Nehalem. The code's fairly ugly to get around some casting nonsense; if you want to clean it up or add anything to it, feel free (or if you want anything added, let me know and I'll see what I can do).

#1
Posted 01/12/2009 09:44 PM   
stealing this post so I can keep a nice changelog here.

1.0: first release as of 1/12/09.
1.1: 2/19/10, add bidirectional bandwidth test
stealing this post so I can keep a nice changelog here.



1.0: first release as of 1/12/09.

1.1: 2/19/10, add bidirectional bandwidth test

#2
Posted 01/12/2009 10:00 PM   
Sweet! Thanks, Tim. This is one of those tools I've been meaning to write myself for a while but have never found the free time.

For the record, here is the results on my system (a single 9800 GX2 in a EVGA 780i MB with DDR2 800 and a Q9300 CPU).
[code]$ ./concBandwidthTest.c 0 1
Device 0 took 3217.652344 ms
Device 1 took 3181.621338 ms
Average HtoD bandwidth in MB/s: 4000.580811
Device 0 took 3220.478027 ms
Device 1 took 3093.473389 ms
Average DtoH bandwidth in MB/s: 4056.154419[/code]

And on a Sun Ultra 40 M2 with a singleTesla D870 attached:
[code]Device 0 took 8044.992676 ms
Device 1 took 8044.969727 ms
Average HtoD bandwidth in MB/s: 1591.054016
Device 0 took 6418.201172 ms
Device 1 took 6418.122070 ms
Average DtoH bandwidth in MB/s: 1994.340515[/code]
Sweet! Thanks, Tim. This is one of those tools I've been meaning to write myself for a while but have never found the free time.



For the record, here is the results on my system (a single 9800 GX2 in a EVGA 780i MB with DDR2 800 and a Q9300 CPU).

$ ./concBandwidthTest.c 0 1

Device 0 took 3217.652344 ms

Device 1 took 3181.621338 ms

Average HtoD bandwidth in MB/s: 4000.580811

Device 0 took 3220.478027 ms

Device 1 took 3093.473389 ms

Average DtoH bandwidth in MB/s: 4056.154419




And on a Sun Ultra 40 M2 with a singleTesla D870 attached:

Device 0 took 8044.992676 ms

Device 1 took 8044.969727 ms

Average HtoD bandwidth in MB/s: 1591.054016

Device 0 took 6418.201172 ms

Device 1 took 6418.122070 ms

Average DtoH bandwidth in MB/s: 1994.340515

#3
Posted 01/13/2009 12:08 PM   
[code]./concBandwidthTest 0 1
Device 0 took 3785.703857 ms
Device 1 took 2280.186279 ms
Average HtoD bandwidth in MB/s: 4497.359009
Device 0 took 4099.594238 ms
Device 1 took 2147.289795 ms
Average DtoH bandwidth in MB/s: 4541.631348[/code]

On a Dell XPS720H2C with FX4800 (device1) and C1060 sample (device0)
./concBandwidthTest 0 1

Device 0 took 3785.703857 ms

Device 1 took 2280.186279 ms

Average HtoD bandwidth in MB/s: 4497.359009

Device 0 took 4099.594238 ms

Device 1 took 2147.289795 ms

Average DtoH bandwidth in MB/s: 4541.631348




On a Dell XPS720H2C with FX4800 (device1) and C1060 sample (device0)

greets,
Denis

#4
Posted 01/13/2009 12:52 PM   
Thanks for that really useful tool! Here is another data point:

Phenom 9950 on Asus M3A79-T with two GTX280:

[codebox]ldpaniak@cluster00:~/NVIDIA_CUDA_SDK/ConcBandTest$ ./concBandwidthTest 0 1
Device 0 took 2129.720703 ms
Device 1 took 2124.596436 ms
Average HtoD bandwidth in MB/s: 6017.425537
Device 0 took 2321.431152 ms
Device 1 took 2393.247314 ms
Average DtoH bandwidth in MB/s: 5431.110840[/codebox]

I noticed that the code set a maximum number of devices to test at 16. Is there a limit to the number of devices a single host can access from the point of view of the CUDA driver (especially in Linux...)?
Thanks for that really useful tool! Here is another data point:



Phenom 9950 on Asus M3A79-T with two GTX280:



[codebox]ldpaniak@cluster00:~/NVIDIA_CUDA_SDK/ConcBandTest$ ./concBandwidthTest 0 1

Device 0 took 2129.720703 ms

Device 1 took 2124.596436 ms

Average HtoD bandwidth in MB/s: 6017.425537

Device 0 took 2321.431152 ms

Device 1 took 2393.247314 ms

Average DtoH bandwidth in MB/s: 5431.110840[/codebox]



I noticed that the code set a maximum number of devices to test at 16. Is there a limit to the number of devices a single host can access from the point of view of the CUDA driver (especially in Linux...)?

#5
Posted 01/13/2009 06:59 PM   
[quote name='ldpaniak' post='490591' date='Jan 13 2009, 07:59 PM']Thanks for that really useful tool! Here is another data point:

Phenom 9950 on Asus M3A79-T with two GTX280:

[codebox]ldpaniak@cluster00:~/NVIDIA_CUDA_SDK/ConcBandTest$ ./concBandwidthTest 0 1
Device 0 took 2129.720703 ms
Device 1 took 2124.596436 ms
Average HtoD bandwidth in MB/s: 6017.425537
Device 0 took 2321.431152 ms
Device 1 took 2393.247314 ms
Average DtoH bandwidth in MB/s: 5431.110840[/codebox]

I noticed that the code set a maximum number of devices to test at 16. Is there a limit to the number of devices a single host can access from the point of view of the CUDA driver (especially in Linux...)?[/quote]

I think 16 is a limit that is already hard to reach. More than 8 seems not feasible with current MB's not supporting more than 4 PCI-E (x16/x8) slots as far as I know.
[quote name='ldpaniak' post='490591' date='Jan 13 2009, 07:59 PM']Thanks for that really useful tool! Here is another data point:



Phenom 9950 on Asus M3A79-T with two GTX280:



[codebox]ldpaniak@cluster00:~/NVIDIA_CUDA_SDK/ConcBandTest$ ./concBandwidthTest 0 1

Device 0 took 2129.720703 ms

Device 1 took 2124.596436 ms

Average HtoD bandwidth in MB/s: 6017.425537

Device 0 took 2321.431152 ms

Device 1 took 2393.247314 ms

Average DtoH bandwidth in MB/s: 5431.110840[/codebox]



I noticed that the code set a maximum number of devices to test at 16. Is there a limit to the number of devices a single host can access from the point of view of the CUDA driver (especially in Linux...)?



I think 16 is a limit that is already hard to reach. More than 8 seems not feasible with current MB's not supporting more than 4 PCI-E (x16/x8) slots as far as I know.

greets,
Denis

#6
Posted 01/13/2009 07:38 PM   
[quote name='E.D. Riedijk' post='490616' date='Jan 13 2009, 02:38 PM']I think 16 is a limit that is already hard to reach. More than 8 seems not feasible with current MB's not supporting more than 4 PCI-E (x16/x8) slots as far as I know.[/quote]

If the S1075 really does multiplex 4 cards per PCI-Express connection, you could hit 16 now, but that would be a frightening amount of contention for host memory bandwidth. (Not to mention you'd probably discover at least one BIOS bug for sure.)
[quote name='E.D. Riedijk' post='490616' date='Jan 13 2009, 02:38 PM']I think 16 is a limit that is already hard to reach. More than 8 seems not feasible with current MB's not supporting more than 4 PCI-E (x16/x8) slots as far as I know.



If the S1075 really does multiplex 4 cards per PCI-Express connection, you could hit 16 now, but that would be a frightening amount of contention for host memory bandwidth. (Not to mention you'd probably discover at least one BIOS bug for sure.)

#7
Posted 01/13/2009 08:36 PM   
On a Tesla S 1070 400 series with a AMD Opteron 2218 host machine with two PCIe 1.1 slots, one slot is 16x while the other is 8x

All four devices at once
./concBandwidthTest 0 1 2 3
Device 0 took 9048.228516 ms
Device 1 took 9245.776367 ms
Device 2 took 9269.258789 ms
Device 3 took 9264.301758 ms
Average HtoD bandwidth in MB/s: 2780.806885
Device 0 took 16982.855469 ms
Device 1 took 15100.006836 ms
Device 2 took 16524.843750 ms
Device 3 took 16907.255859 ms
Average DtoH bandwidth in MB/s: 1566.522888

Combinations:

./concBandwidthTest 0 1
Device 0 took 6742.535645 ms
Device 1 took 6724.684570 ms
Average HtoD bandwidth in MB/s: 1900.915344
Device 0 took 7924.658203 ms
Device 1 took 7718.469727 ms
Average DtoH bandwidth in MB/s: 1636.785767


./concBandwidthTest 0 2
Device 0 took 4122.261230 ms
Device 2 took 4328.962891 ms
Average HtoD bandwidth in MB/s: 3030.960083
Device 0 took 5804.275879 ms
Device 2 took 5668.028809 ms
Average DtoH bandwidth in MB/s: 2231.775757

./concBandwidthTest 2 3
Device 2 took 8493.328125 ms
Device 3 took 8483.612305 ms
Average HtoD bandwidth in MB/s: 1507.928284
Device 2 took 11255.169922 ms
Device 3 took 10999.840820 ms
Average DtoH bandwidth in MB/s: 1150.454163

./concBandwidthTest 1 3
Device 1 took 4178.731934 ms
Device 3 took 4353.884277 ms
Average HtoD bandwidth in MB/s: 3001.516846
Device 1 took 5849.678223 ms
Device 3 took 5841.941895 ms
Average DtoH bandwidth in MB/s: 2189.603394

This seems to confirm that our setup has bandwidth problems, something we already knew!
On a Tesla S 1070 400 series with a AMD Opteron 2218 host machine with two PCIe 1.1 slots, one slot is 16x while the other is 8x



All four devices at once

./concBandwidthTest 0 1 2 3

Device 0 took 9048.228516 ms

Device 1 took 9245.776367 ms

Device 2 took 9269.258789 ms

Device 3 took 9264.301758 ms

Average HtoD bandwidth in MB/s: 2780.806885

Device 0 took 16982.855469 ms

Device 1 took 15100.006836 ms

Device 2 took 16524.843750 ms

Device 3 took 16907.255859 ms

Average DtoH bandwidth in MB/s: 1566.522888



Combinations:



./concBandwidthTest 0 1

Device 0 took 6742.535645 ms

Device 1 took 6724.684570 ms

Average HtoD bandwidth in MB/s: 1900.915344

Device 0 took 7924.658203 ms

Device 1 took 7718.469727 ms

Average DtoH bandwidth in MB/s: 1636.785767





./concBandwidthTest 0 2

Device 0 took 4122.261230 ms

Device 2 took 4328.962891 ms

Average HtoD bandwidth in MB/s: 3030.960083

Device 0 took 5804.275879 ms

Device 2 took 5668.028809 ms

Average DtoH bandwidth in MB/s: 2231.775757



./concBandwidthTest 2 3

Device 2 took 8493.328125 ms

Device 3 took 8483.612305 ms

Average HtoD bandwidth in MB/s: 1507.928284

Device 2 took 11255.169922 ms

Device 3 took 10999.840820 ms

Average DtoH bandwidth in MB/s: 1150.454163



./concBandwidthTest 1 3

Device 1 took 4178.731934 ms

Device 3 took 4353.884277 ms

Average HtoD bandwidth in MB/s: 3001.516846

Device 1 took 5849.678223 ms

Device 3 took 5841.941895 ms

Average DtoH bandwidth in MB/s: 2189.603394



This seems to confirm that our setup has bandwidth problems, something we already knew!

#8
Posted 01/13/2009 09:54 PM   
many thanks for the test..
uh oh.... :( my first (hopefully) software-related cuda problem.... help !! so far the sdk examples were all ok, including simpleMultiGPU.
I keep cuda & pthread libraries in /usr/lib64 on my fedora10 _64. I compiled, successfully,
localhost[75]:~/cuda/projects/concBandwidthTest$ gcc -o concBandwidthTest -std=c99 -I /usr/local/cuda/include concBandwidthTest.c -L /usr/lib64 -lcuda -lpthread

I have devices 0..2 (gtx280). device 0 is attached to my monitor, but 1 and 2 also create some low-res graphics that I don't display.

tests done on 0+1 cause failures, so I'll first show you that the code works ok with cards 0/1 (on x16 bus, via northbridge) cuncurrently with card 2 (x8 or pcie rev 1.0, via southbridge)

localhost[76]:~/cuda/projects/concBandwidthTest$ concBandwidthTest 1 2
Device 1 took 1310.880371 ms
Device 2 took 3865.682373 ms
Average HtoD bandwidth in MB/s: 6537.809204
Device 1 took 1579.436279 ms
Device 2 took 3677.684326 ms
Average DtoH bandwidth in MB/s: 5792.304077

localhost[77]:~/cuda/projects/concBandwidthTest$ concBandwidthTest 0 2
Device 0 took 1416.739380 ms
Device 2 took 3874.536621 ms
Average HtoD bandwidth in MB/s: 6169.225464
Device 0 took 1621.908691 ms
Device 2 took 3733.955078 ms
Average DtoH bandwidth in MB/s: 5659.968262

but now problems start, when I do devices 0 and 1, during transfer from host to device 0 but not back:

[here, my system actually hang and I had to restart. other modes of failure are the error message from the program, hopefully this time we'll see it, or - very rarely but it hapened once -- correct evaluation of bandwidth w/o errors]
localhost[3]:~/cuda/projects/concBandwidthTest$ concBandwidthTest 0 1
cuMemcpyHtOD failed!
cuMemcpyHtOD failed!
(...) (repeated some 40 times)
cuMemcpyHtOD failed!
cuMemcpyHtOD failed!
Device 0 took 17769188375480380660029325312.000000 ms
Device 1 took 1272.827148 ms
Average HtoD bandwidth in MB/s: 5028.176758
Device 0 took 2369.335693 ms
Device 1 took 2309.379639 ms
Average DtoH bandwidth in MB/s: 5472.486084
:">

this one lucky run was like this:

localhost[53]:~/cuda/projects/concBandwidthTest$ concBandwidthTest 0 1
Device 0 took 2567.785889 ms
Device 1 took 2052.517090 ms
Average HtoD bandwidth in MB/s: 5610.542236
Device 0 took 2376.507080 ms
Device 1 took 2336.560303 ms
Average DtoH bandwidth in MB/s: 5432.097168

...as if 0 and 1 tried to share bandwidth.. it shouldn't be so. I get those kind of numbers (5+ GB/s total bandwidth)
when I try your test with devices 0 0, 1 1 or 2 2, except, predictably, 2 GB/s on the third card. but in concurrency with itself, the test runs ok.
many thanks for the test..

uh oh.... :( my first (hopefully) software-related cuda problem.... help !! so far the sdk examples were all ok, including simpleMultiGPU.

I keep cuda & pthread libraries in /usr/lib64 on my fedora10 _64. I compiled, successfully,

localhost[75]:~/cuda/projects/concBandwidthTest$ gcc -o concBandwidthTest -std=c99 -I /usr/local/cuda/include concBandwidthTest.c -L /usr/lib64 -lcuda -lpthread



I have devices 0..2 (gtx280). device 0 is attached to my monitor, but 1 and 2 also create some low-res graphics that I don't display.



tests done on 0+1 cause failures, so I'll first show you that the code works ok with cards 0/1 (on x16 bus, via northbridge) cuncurrently with card 2 (x8 or pcie rev 1.0, via southbridge)



localhost[76]:~/cuda/projects/concBandwidthTest$ concBandwidthTest 1 2

Device 1 took 1310.880371 ms

Device 2 took 3865.682373 ms

Average HtoD bandwidth in MB/s: 6537.809204

Device 1 took 1579.436279 ms

Device 2 took 3677.684326 ms

Average DtoH bandwidth in MB/s: 5792.304077



localhost[77]:~/cuda/projects/concBandwidthTest$ concBandwidthTest 0 2

Device 0 took 1416.739380 ms

Device 2 took 3874.536621 ms

Average HtoD bandwidth in MB/s: 6169.225464

Device 0 took 1621.908691 ms

Device 2 took 3733.955078 ms

Average DtoH bandwidth in MB/s: 5659.968262



but now problems start, when I do devices 0 and 1, during transfer from host to device 0 but not back:



[here, my system actually hang and I had to restart. other modes of failure are the error message from the program, hopefully this time we'll see it, or - very rarely but it hapened once -- correct evaluation of bandwidth w/o errors]

localhost[3]:~/cuda/projects/concBandwidthTest$ concBandwidthTest 0 1

cuMemcpyHtOD failed!

cuMemcpyHtOD failed!

(...) (repeated some 40 times)

cuMemcpyHtOD failed!

cuMemcpyHtOD failed!

Device 0 took 17769188375480380660029325312.000000 ms

Device 1 took 1272.827148 ms

Average HtoD bandwidth in MB/s: 5028.176758

Device 0 took 2369.335693 ms

Device 1 took 2309.379639 ms

Average DtoH bandwidth in MB/s: 5472.486084

:">



this one lucky run was like this:



localhost[53]:~/cuda/projects/concBandwidthTest$ concBandwidthTest 0 1

Device 0 took 2567.785889 ms

Device 1 took 2052.517090 ms

Average HtoD bandwidth in MB/s: 5610.542236

Device 0 took 2376.507080 ms

Device 1 took 2336.560303 ms

Average DtoH bandwidth in MB/s: 5432.097168



...as if 0 and 1 tried to share bandwidth.. it shouldn't be so. I get those kind of numbers (5+ GB/s total bandwidth)

when I try your test with devices 0 0, 1 1 or 2 2, except, predictably, 2 GB/s on the third card. but in concurrency with itself, the test runs ok.

#9
Posted 01/14/2009 03:45 AM   
Are you using gcc 4.3? If you are, go back to 4.1 or 4.2 and try again. What motherboard do you have as well?
Are you using gcc 4.3? If you are, go back to 4.1 or 4.2 and try again. What motherboard do you have as well?

#10
Posted 01/14/2009 04:33 AM   
[quote name='seibert' post='490640' date='Jan 13 2009, 09:36 PM']If the S1075 really does multiplex 4 cards per PCI-Express connection, you could hit 16 now, but that would be a frightening amount of contention for host memory bandwidth. (Not to mention you'd probably discover at least one BIOS bug for sure.)[/quote]

As far as I have heard the S1075 will not leave the 'paper' phase
[quote name='seibert' post='490640' date='Jan 13 2009, 09:36 PM']If the S1075 really does multiplex 4 cards per PCI-Express connection, you could hit 16 now, but that would be a frightening amount of contention for host memory bandwidth. (Not to mention you'd probably discover at least one BIOS bug for sure.)



As far as I have heard the S1075 will not leave the 'paper' phase

greets,
Denis

#11
Posted 01/14/2009 06:55 AM   
[quote name='E.D. Riedijk' post='490616' date='Jan 13 2009, 08:38 PM']I think 16 is a limit that is already hard to reach. More than 8 seems not feasible with current MB's not supporting more than 4 PCI-E (x16/x8) slots as far as I know.[/quote]
never heard about pcie backplanes? e.g. that one here: [url="http://www.onestopsystems.com/passive_backplanes_b.html"]http://www.onestopsystems.com/passive_backplanes_b.html[/url]
1 host card->19 devices
you can use any number of devices with such systems... only catch: you have like no bandwidth at all and the machine needs an hour to initialize the devices when booting. ;-)
[quote name='E.D. Riedijk' post='490616' date='Jan 13 2009, 08:38 PM']I think 16 is a limit that is already hard to reach. More than 8 seems not feasible with current MB's not supporting more than 4 PCI-E (x16/x8) slots as far as I know.

never heard about pcie backplanes? e.g. that one here: http://www.onestopsystems.com/passive_backplanes_b.html

1 host card->19 devices

you can use any number of devices with such systems... only catch: you have like no bandwidth at all and the machine needs an hour to initialize the devices when booting. ;-)

#12
Posted 01/14/2009 04:50 PM   
[quote name='Ocire' post='491062' date='Jan 14 2009, 05:50 PM']never heard about pcie backplanes? e.g. that one here: [url="http://www.onestopsystems.com/passive_backplanes_b.html"]http://www.onestopsystems.com/passive_backplanes_b.html[/url]
1 host card->19 devices
you can use any number of devices with such systems... only catch: you have like no bandwidth at all and the machine needs an hour to initialize the devices when booting. ;-)[/quote]
Yeah I heard of them, and no I don't think there will be a lot of those used together with S1070's ;)
[quote name='Ocire' post='491062' date='Jan 14 2009, 05:50 PM']never heard about pcie backplanes? e.g. that one here: http://www.onestopsystems.com/passive_backplanes_b.html

1 host card->19 devices

you can use any number of devices with such systems... only catch: you have like no bandwidth at all and the machine needs an hour to initialize the devices when booting. ;-)

Yeah I heard of them, and no I don't think there will be a lot of those used together with S1070's ;)

greets,
Denis

#13
Posted 01/14/2009 08:18 PM   
We have build a test machine with 8 GPUs (2 S1070) in our lab. The machine contains this chipset
[url="http://images.anandtech.com/reviews/cpu/intel/nehalem/review/x58.jpg"]http://images.anandtech.com/reviews/cpu/in.../review/x58.jpg[/url]
The 8 GPUs are linked with PCI Express V2 16x. There are total 36 lanes, which
gives a peak bandwidth 18 GB/s. However, the best number we get is ~10GB/s, or 55% of the theoretical peak.
I am using cuda 2.3 in Linux.

[code]$ ./bandwidth 0 1 2 3 4 5 6 7
Device 0 took 5318.502930 ms
Device 1 took 5520.778320 ms
Device 2 took 4169.996094 ms
Device 3 took 4174.846680 ms
Device 4 took 4964.340332 ms
Device 5 took 4800.694824 ms
Device 6 took 4888.432617 ms
Device 7 took 4813.492676 ms
Average HtoD bandwidth in MB/s: 10691.510864
Device 0 took 6348.020508 ms
Device 1 took 6627.868652 ms
Device 2 took 4610.388184 ms
Device 3 took 4760.468262 ms
Device 4 took 5756.009766 ms
Device 5 took 5895.148438 ms
Device 6 took 5976.834961 ms
Device 7 took 6057.340820 ms
Average DtoH bandwidth in MB/s: 9031.272766[/code]

My question is
1) what is the possible bottleneck?
2) have you achieved better bandwidth than 10 GB/s with multiple GPUs? If you do, could you give out more details about your system?

thanks
We have build a test machine with 8 GPUs (2 S1070) in our lab. The machine contains this chipset

http://images.anandtech.com/reviews/cpu/in.../review/x58.jpg

The 8 GPUs are linked with PCI Express V2 16x. There are total 36 lanes, which

gives a peak bandwidth 18 GB/s. However, the best number we get is ~10GB/s, or 55% of the theoretical peak.

I am using cuda 2.3 in Linux.



$ ./bandwidth  0 1 2 3 4 5 6 7

Device 0 took 5318.502930 ms

Device 1 took 5520.778320 ms

Device 2 took 4169.996094 ms

Device 3 took 4174.846680 ms

Device 4 took 4964.340332 ms

Device 5 took 4800.694824 ms

Device 6 took 4888.432617 ms

Device 7 took 4813.492676 ms

Average HtoD bandwidth in MB/s: 10691.510864

Device 0 took 6348.020508 ms

Device 1 took 6627.868652 ms

Device 2 took 4610.388184 ms

Device 3 took 4760.468262 ms

Device 4 took 5756.009766 ms

Device 5 took 5895.148438 ms

Device 6 took 5976.834961 ms

Device 7 took 6057.340820 ms

Average DtoH bandwidth in MB/s: 9031.272766




My question is

1) what is the possible bottleneck?

2) have you achieved better bandwidth than 10 GB/s with multiple GPUs? If you do, could you give out more details about your system?



thanks

#14
Posted 08/15/2009 12:22 AM   
I think QPI is actually a pair of unidirectional buses, so you're getting 90% of the QPI bandwidth.
I think QPI is actually a pair of unidirectional buses, so you're getting 90% of the QPI bandwidth.

#15
Posted 08/15/2009 01:12 AM   
  1 / 3    
Scroll To Top