Concurrent bandwidth test

During the course of testing various configurations, I wrote a concurrent bandwidth test that those of you interested in multi-GPU configurations will probably find useful. It’s Linux-only (because I am too lazy to use Windows threads), and you can compile it with

gcc -o concBandwidthTest -std=c99 -I /usr/local/cuda/include concBandwidthTest.c -L /usr/local/cuda/lib -lcuda -lpthread

Replace paths with whatever is appropriate for your system. I’m fairly confident in the results; it behaves exactly as I would expect with up to 3 devices, but I haven’t tested past that. For example, my results with a Harpertown Xeon, a FX 1700 (PCIe 1.0 16x), and a C1060 (PCie 2.0 16x):

[tim@ concBandwidthTest]$ ./concBandwidthTest 0 1
Device 0 took 1393.192749 ms
Device 1 took 2053.388184 ms
Average HtoD bandwidth in MB/s: 7710.564941
Device 0 took 1935.400879 ms
Device 1 took 2042.895264 ms
Average DtoH bandwidth in MB/s: 6439.616943

As you would probably expect, it’s hitting FSB limitations quickly, so I’m interested to see how it runs on Nehalem. The code’s fairly ugly to get around some casting nonsense; if you want to clean it up or add anything to it, feel free (or if you want anything added, let me know and I’ll see what I can do).

stealing this post so I can keep a nice changelog here.

1.0: first release as of 1/12/09.
1.1: 2/19/10, add bidirectional bandwidth test

Sweet! Thanks, Tim. This is one of those tools I’ve been meaning to write myself for a while but have never found the free time.

For the record, here is the results on my system (a single 9800 GX2 in a EVGA 780i MB with DDR2 800 and a Q9300 CPU).

$ ./concBandwidthTest.c 0 1

Device 0 took 3217.652344 ms						 

Device 1 took 3181.621338 ms						 

Average HtoD bandwidth in MB/s: 4000.580811		  

Device 0 took 3220.478027 ms						 

Device 1 took 3093.473389 ms																								

Average DtoH bandwidth in MB/s: 4056.154419

And on a Sun Ultra 40 M2 with a singleTesla D870 attached:

Device 0 took 8044.992676 ms

Device 1 took 8044.969727 ms

Average HtoD bandwidth in MB/s: 1591.054016

Device 0 took 6418.201172 ms

Device 1 took 6418.122070 ms

Average DtoH bandwidth in MB/s: 1994.340515
./concBandwidthTest 0 1

Device 0 took 3785.703857 ms

Device 1 took 2280.186279 ms

Average HtoD bandwidth in MB/s: 4497.359009

Device 0 took 4099.594238 ms

Device 1 took 2147.289795 ms

Average DtoH bandwidth in MB/s: 4541.631348

On a Dell XPS720H2C with FX4800 (device1) and C1060 sample (device0)

Thanks for that really useful tool! Here is another data point:

Phenom 9950 on Asus M3A79-T with two GTX280:

[codebox]ldpaniak@cluster00:~/NVIDIA_CUDA_SDK/ConcBandTest$ ./concBandwidthTest 0 1

Device 0 took 2129.720703 ms

Device 1 took 2124.596436 ms

Average HtoD bandwidth in MB/s: 6017.425537

Device 0 took 2321.431152 ms

Device 1 took 2393.247314 ms

Average DtoH bandwidth in MB/s: 5431.110840[/codebox]

I noticed that the code set a maximum number of devices to test at 16. Is there a limit to the number of devices a single host can access from the point of view of the CUDA driver (especially in Linux…)?

I think 16 is a limit that is already hard to reach. More than 8 seems not feasible with current MB’s not supporting more than 4 PCI-E (x16/x8) slots as far as I know.

If the S1075 really does multiplex 4 cards per PCI-Express connection, you could hit 16 now, but that would be a frightening amount of contention for host memory bandwidth. (Not to mention you’d probably discover at least one BIOS bug for sure.)

On a Tesla S 1070 400 series with a AMD Opteron 2218 host machine with two PCIe 1.1 slots, one slot is 16x while the other is 8x

All four devices at once
./concBandwidthTest 0 1 2 3
Device 0 took 9048.228516 ms
Device 1 took 9245.776367 ms
Device 2 took 9269.258789 ms
Device 3 took 9264.301758 ms
Average HtoD bandwidth in MB/s: 2780.806885
Device 0 took 16982.855469 ms
Device 1 took 15100.006836 ms
Device 2 took 16524.843750 ms
Device 3 took 16907.255859 ms
Average DtoH bandwidth in MB/s: 1566.522888

Combinations:

./concBandwidthTest 0 1
Device 0 took 6742.535645 ms
Device 1 took 6724.684570 ms
Average HtoD bandwidth in MB/s: 1900.915344
Device 0 took 7924.658203 ms
Device 1 took 7718.469727 ms
Average DtoH bandwidth in MB/s: 1636.785767

./concBandwidthTest 0 2
Device 0 took 4122.261230 ms
Device 2 took 4328.962891 ms
Average HtoD bandwidth in MB/s: 3030.960083
Device 0 took 5804.275879 ms
Device 2 took 5668.028809 ms
Average DtoH bandwidth in MB/s: 2231.775757

./concBandwidthTest 2 3
Device 2 took 8493.328125 ms
Device 3 took 8483.612305 ms
Average HtoD bandwidth in MB/s: 1507.928284
Device 2 took 11255.169922 ms
Device 3 took 10999.840820 ms
Average DtoH bandwidth in MB/s: 1150.454163

./concBandwidthTest 1 3
Device 1 took 4178.731934 ms
Device 3 took 4353.884277 ms
Average HtoD bandwidth in MB/s: 3001.516846
Device 1 took 5849.678223 ms
Device 3 took 5841.941895 ms
Average DtoH bandwidth in MB/s: 2189.603394

This seems to confirm that our setup has bandwidth problems, something we already knew!

many thanks for the test…
uh oh… :( my first (hopefully) software-related cuda problem… help !! so far the sdk examples were all ok, including simpleMultiGPU.
I keep cuda & pthread libraries in /usr/lib64 on my fedora10 _64. I compiled, successfully,
localhost[75]:~/cuda/projects/concBandwidthTest$ gcc -o concBandwidthTest -std=c99 -I /usr/local/cuda/include concBandwidthTest.c -L /usr/lib64 -lcuda -lpthread

I have devices 0…2 (gtx280). device 0 is attached to my monitor, but 1 and 2 also create some low-res graphics that I don’t display.

tests done on 0+1 cause failures, so I’ll first show you that the code works ok with cards 0/1 (on x16 bus, via northbridge) cuncurrently with card 2 (x8 or pcie rev 1.0, via southbridge)

localhost[76]:~/cuda/projects/concBandwidthTest$ concBandwidthTest 1 2
Device 1 took 1310.880371 ms
Device 2 took 3865.682373 ms
Average HtoD bandwidth in MB/s: 6537.809204
Device 1 took 1579.436279 ms
Device 2 took 3677.684326 ms
Average DtoH bandwidth in MB/s: 5792.304077

localhost[77]:~/cuda/projects/concBandwidthTest$ concBandwidthTest 0 2
Device 0 took 1416.739380 ms
Device 2 took 3874.536621 ms
Average HtoD bandwidth in MB/s: 6169.225464
Device 0 took 1621.908691 ms
Device 2 took 3733.955078 ms
Average DtoH bandwidth in MB/s: 5659.968262

but now problems start, when I do devices 0 and 1, during transfer from host to device 0 but not back:

[here, my system actually hang and I had to restart. other modes of failure are the error message from the program, hopefully this time we’ll see it, or - very rarely but it hapened once – correct evaluation of bandwidth w/o errors]
localhost[3]:~/cuda/projects/concBandwidthTest$ concBandwidthTest 0 1
cuMemcpyHtOD failed!
cuMemcpyHtOD failed!
(…) (repeated some 40 times)
cuMemcpyHtOD failed!
cuMemcpyHtOD failed!
Device 0 took 17769188375480380660029325312.000000 ms
Device 1 took 1272.827148 ms
Average HtoD bandwidth in MB/s: 5028.176758
Device 0 took 2369.335693 ms
Device 1 took 2309.379639 ms
Average DtoH bandwidth in MB/s: 5472.486084
:">

this one lucky run was like this:

localhost[53]:~/cuda/projects/concBandwidthTest$ concBandwidthTest 0 1
Device 0 took 2567.785889 ms
Device 1 took 2052.517090 ms
Average HtoD bandwidth in MB/s: 5610.542236
Device 0 took 2376.507080 ms
Device 1 took 2336.560303 ms
Average DtoH bandwidth in MB/s: 5432.097168

…as if 0 and 1 tried to share bandwidth… it shouldn’t be so. I get those kind of numbers (5+ GB/s total bandwidth)
when I try your test with devices 0 0, 1 1 or 2 2, except, predictably, 2 GB/s on the third card. but in concurrency with itself, the test runs ok.

Are you using gcc 4.3? If you are, go back to 4.1 or 4.2 and try again. What motherboard do you have as well?

As far as I have heard the S1075 will not leave the ‘paper’ phase

never heard about pcie backplanes? e.g. that one here: http://www.onestopsystems.com/passive_backplanes_b.html

1 host card->19 devices

you can use any number of devices with such systems… only catch: you have like no bandwidth at all and the machine needs an hour to initialize the devices when booting. ;-)

Yeah I heard of them, and no I don’t think there will be a lot of those used together with S1070’s ;)

We have build a test machine with 8 GPUs (2 S1070) in our lab. The machine contains this chipset

http://images.anandtech.com/reviews/cpu/in…/review/x58.jpg

The 8 GPUs are linked with PCI Express V2 16x. There are total 36 lanes, which

gives a peak bandwidth 18 GB/s. However, the best number we get is ~10GB/s, or 55% of the theoretical peak.

I am using cuda 2.3 in Linux.

$ ./bandwidth  0 1 2 3 4 5 6 7

Device 0 took 5318.502930 ms

Device 1 took 5520.778320 ms

Device 2 took 4169.996094 ms

Device 3 took 4174.846680 ms

Device 4 took 4964.340332 ms

Device 5 took 4800.694824 ms

Device 6 took 4888.432617 ms

Device 7 took 4813.492676 ms

Average HtoD bandwidth in MB/s: 10691.510864

Device 0 took 6348.020508 ms

Device 1 took 6627.868652 ms

Device 2 took 4610.388184 ms

Device 3 took 4760.468262 ms

Device 4 took 5756.009766 ms

Device 5 took 5895.148438 ms

Device 6 took 5976.834961 ms

Device 7 took 6057.340820 ms

Average DtoH bandwidth in MB/s: 9031.272766

My question is

  1. what is the possible bottleneck?

  2. have you achieved better bandwidth than 10 GB/s with multiple GPUs? If you do, could you give out more details about your system?

thanks

I think QPI is actually a pair of unidirectional buses, so you’re getting 90% of the QPI bandwidth.

Thanks for replying. We thought about that.

I modified the concurrent bandwidth code to do two-way communication, D2H on stream 0 and H2D in stream 1. And this should get us more than 10 GB/s

. Unfortunately the system is not very stable and the two-way communication test either kill the system (meaning the system is dead, not pingable, needs power cycle) or it runs really slow that it takes 10 times longer to finish. The two way communication test runs fine with our other multi-GPU systems in the lab.

Any suggestions?

Well, file uploading isn’t working, so I’ll just attach 1.1 (with bidirectional bandwidth measurements) to this post.

[codebox]/*

  • Copyright 1993-2010 NVIDIA Corporation. All rights reserved.

  • NOTICE TO USER:

  • This source code is subject to NVIDIA ownership rights under U.S. and

  • international Copyright laws. Users and possessors of this source code

  • are hereby granted a nonexclusive, royalty-free license to use this code

  • in individual and commercial software.

  • NVIDIA MAKES NO REPRESENTATION ABOUT THE SUITABILITY OF THIS SOURCE

  • CODE FOR ANY PURPOSE. IT IS PROVIDED “AS IS” WITHOUT EXPRESS OR

  • IMPLIED WARRANTY OF ANY KIND. NVIDIA DISCLAIMS ALL WARRANTIES WITH

  • REGARD TO THIS SOURCE CODE, INCLUDING ALL IMPLIED WARRANTIES OF

  • MERCHANTABILITY, NONINFRINGEMENT, AND FITNESS FOR A PARTICULAR PURPOSE.

  • IN NO EVENT SHALL NVIDIA BE LIABLE FOR ANY SPECIAL, INDIRECT, INCIDENTAL,

  • OR CONSEQUENTIAL DAMAGES, OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS

  • OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE

  • OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE

  • OR PERFORMANCE OF THIS SOURCE CODE.

  • U.S. Government End Users. This source code is a “commercial item” as

  • that term is defined at 48 C.F.R. 2.101 (OCT 1995), consisting of

  • “commercial computer software” and "commercial computer software

  • documentation" as such terms are used in 48 C.F.R. 12.212 (SEPT 1995)

  • and is provided to the U.S. Government only as a commercial end item.

  • Consistent with 48 C.F.R.12.212 and 48 C.F.R. 227.7202-1 through

  • 227.7202-4 (JUNE 1995), all U.S. Government End Users acquire the

  • source code with only those rights set forth herein.

  • Any use of this source code in individual and commercial software must

  • include, in the user documentation and internal comments to the code,

  • the above Disclaimer and U.S. Government End Users Notice.

*/

#include <stdlib.h>

#include <stdio.h>

#include <cuda.h>

#include <pthread.h>

#define MEMCOPY_ITERATIONS 50

#define MEMCOPY_SIZE (1 << 27) // 128M

#define MAX_DEVICES 16 //supports up to 16 devices at a time. 16 devices should be enough for anyone!

unsigned int devices[MAX_DEVICES];

unsigned int numDevices;

volatile unsigned int numWaiting = 0;

pthread_mutex_t lock;

pthread_cond_t condvar;

pthread_t devThreads[MAX_DEVICES];

float elapsedTimes[MAX_DEVICES];

typedef union data_t

{

float f;

void* v;

unsigned int ui;

} PackedType;

void* testBandwidthHtoD(void* id)

{

PackedType arg = (PackedType)(id);

unsigned int devID = arg.ui;

CUdevice dev;

CUcontext ctx;

CUevent start, stop;

void* loc1;

CUdeviceptr loc2;

cuDeviceGet(&dev, devID);

if (cuCtxCreate(&ctx, CU_CTX_SCHED_AUTO, dev) != CUDA_SUCCESS) {

    printf("Creating a context with devID %u failed, aborting\n", devID);

    pthread_exit((void*)1);      

}

if (cuMemAllocHost(&loc1, MEMCOPY_SIZE) != CUDA_SUCCESS) {

    printf("cuMemAllocHost failed, aborting\n");

    pthread_exit((void*)1);

}

if (cuMemAlloc(&loc2, MEMCOPY_SIZE) != CUDA_SUCCESS) {

    printf("cuMemAlloc failed, aborting\n");

    pthread_exit((void*)1);

}

cuEventCreate(&start, 0);

cuEventCreate(&stop, 0);

//critical section

pthread_mutex_lock(&lock);

++numWaiting;

pthread_cond_wait(&condvar, &lock);

pthread_mutex_unlock(&lock);

cuEventRecord(start, 0);

for (int i = 0; i < MEMCOPY_ITERATIONS; i++) {

    if (cuMemcpyHtoDAsync(loc2, loc1, MEMCOPY_SIZE, 0) != CUDA_SUCCESS) {

        printf("cuMemcpyHtOD failed!\n");

    }

}

cuEventRecord(stop, 0);

cuEventSynchronize(stop);

float elapsedTime;

cuEventElapsedTime(&elapsedTime, start, stop);

PackedType retval;

retval.f = elapsedTime;

return (void*)retval.v;

}

void* testBandwidthDtoH(void* id)

{

PackedType arg = (PackedType)(id);

unsigned int devID = arg.ui;

CUdevice dev;

CUcontext ctx;

CUevent start, stop;

CUdeviceptr loc1;

void* loc2;

cuDeviceGet(&dev, devID);

if (cuCtxCreate(&ctx, CU_CTX_SCHED_AUTO, dev) != CUDA_SUCCESS) {

    printf("Creating a context with devID %u failed, aborting\n", devID);

    pthread_exit((void*)1);      

}

if (cuMemAllocHost(&loc2, MEMCOPY_SIZE) != CUDA_SUCCESS) {

    printf("cuMemAllocHost failed, aborting\n");

    pthread_exit((void*)1);

}

if (cuMemAlloc(&loc1, MEMCOPY_SIZE) != CUDA_SUCCESS) {

    printf("cuMemAlloc failed, aborting\n");

    pthread_exit((void*)1);

}

cuEventCreate(&start, 0);

cuEventCreate(&stop, 0);

//critical section

pthread_mutex_lock(&lock);

++numWaiting;

pthread_cond_wait(&condvar, &lock);

pthread_mutex_unlock(&lock);

cuEventRecord(start, 0);

for (int i = 0; i < MEMCOPY_ITERATIONS; i++) {

    if (cuMemcpyDtoHAsync(loc2, loc1, MEMCOPY_SIZE, 0) != CUDA_SUCCESS) {

        printf("cuMemcpyDtOH failed!\n");

    }

}

cuEventRecord(stop, 0);

cuEventSynchronize(stop);

float elapsedTime;

cuEventElapsedTime(&elapsedTime, start, stop);

PackedType retval;

retval.f = elapsedTime;

return (void*)retval.v;

}

void* testBandwidthBidirectional(void* id)

{

PackedType arg = (PackedType)(id);

unsigned int devID = arg.ui;

CUdevice dev;

CUcontext ctx;

CUevent start, stop;

CUstream stream1, stream2;

CUdeviceptr loc1, loc3;

void* loc2, *loc4;

cuDeviceGet(&dev, devID);

if (cuCtxCreate(&ctx, CU_CTX_SCHED_AUTO, dev) != CUDA_SUCCESS) {

    printf("cuStreamCreate failed\n");

    pthread_exit((void*)1);      

}

if (cuStreamCreate(&stream1, 0) != CUDA_SUCCESS) {

    printf("cuStreamCreate failed\n");

    pthread_exit((void*)1);      

}

if (cuStreamCreate(&stream2, 0) != CUDA_SUCCESS) {

    printf("cuStreamCreate failed\n");

    pthread_exit((void*)1);      

}

if (cuMemAllocHost(&loc2, MEMCOPY_SIZE) != CUDA_SUCCESS) {

    printf("cuMemAllocHost failed, aborting\n");

    pthread_exit((void*)1);

}

if (cuMemAllocHost(&loc4, MEMCOPY_SIZE) != CUDA_SUCCESS) {

    printf("cuMemAllocHost failed, aborting\n");

    pthread_exit((void*)1);

}

if (cuMemAlloc(&loc1, MEMCOPY_SIZE) != CUDA_SUCCESS) {

    printf("cuMemAlloc failed, aborting\n");

    pthread_exit((void*)1);

}

if (cuMemAlloc(&loc3, MEMCOPY_SIZE) != CUDA_SUCCESS) {

    printf("cuMemAlloc failed, aborting\n");

    pthread_exit((void*)1);

}

cuEventCreate(&start, 0);

cuEventCreate(&stop, 0);

//critical section

pthread_mutex_lock(&lock);

++numWaiting;

pthread_cond_wait(&condvar, &lock);

pthread_mutex_unlock(&lock);

cuEventRecord(start, 0);

for (int i = 0; i < MEMCOPY_ITERATIONS; i++) {

    if (cuMemcpyDtoHAsync(loc2, loc1, MEMCOPY_SIZE, stream1) != CUDA_SUCCESS) {

        printf("cuMemcpyDtOH failed!\n");

    }

    if (cuMemcpyHtoDAsync(loc3, loc4, MEMCOPY_SIZE, stream2) != CUDA_SUCCESS) {

        printf("cuMemcpyHtoDAsync failed!\n");

    }

}

cuEventRecord(stop, 0);

cuCtxSynchronize();

float elapsedTime;

cuEventElapsedTime(&elapsedTime, start, stop);

PackedType retval;

retval.f = elapsedTime;

return (void*)retval.v;

}

int main (int argc, char** argv)

{

if (argc == 1) {

    printf("usage: %s deviceID deviceID...\n", argv[0]);

    exit(1);

}

if (cuInit(0) != CUDA_SUCCESS) {

    printf("cuInit failed, aborting...\n");

    exit(1);

}

for (int i = 0; i < argc - 1; i++) {

    int dev = atoi(argv[i+1]);

    CUdevice device;

    if (cuDeviceGet(&device, dev) != CUDA_SUCCESS) {

        printf("Could not get device %d, aborting\n", dev);

        exit(1);

    }

    devices[i] = dev;

}

numDevices = argc - 1;

pthread_mutex_init(&lock, NULL);

pthread_cond_init(&condvar, NULL);

for (int i = 0; i < numDevices; i++) {

    PackedType arg;

    arg.ui = devices[i];

    pthread_create(&devThreads[i], NULL, (testBandwidthHtoD),arg.v);

}

while (numWaiting != numDevices) ;

pthread_cond_broadcast(&condvar);

void* returnVal = 0;

float maxElapsedTime = 0.f;

for (int i = 0; i < numDevices; i++) {

    pthread_join(devThreads[i], &returnVal);

    PackedType d = (PackedType)returnVal;

    printf("Device %u took %f ms\n", devices[i], d.f);

    elapsedTimes[i] = d.f;

    if (d.f > maxElapsedTime) {

        maxElapsedTime = d.f;

    }

}

double bandwidthInMBs = 0;

for (int i = 0; i < numDevices; i++) {

    bandwidthInMBs += (1e3f * MEMCOPY_SIZE * (float)MEMCOPY_ITERATIONS) / (elapsedTimes[i] * (float)(1 << 20));

}

printf("Average HtoD bandwidth in MB/s: %f\n", bandwidthInMBs);

numWaiting = 0;

for (int i = 0; i < numDevices; i++) {

    PackedType arg;

    arg.ui = devices[i];

    pthread_create(&devThreads[i], NULL, (testBandwidthDtoH),arg.v);

}

while (numWaiting != numDevices) ;

pthread_cond_broadcast(&condvar);

returnVal = 0;

maxElapsedTime = 0.f;

for (int i = 0; i < numDevices; i++) {

    pthread_join(devThreads[i], &returnVal);

    PackedType d = (PackedType)returnVal;

    printf("Device %u took %f ms\n", devices[i], d.f);

    elapsedTimes[i] = d.f;

    if (d.f > maxElapsedTime)

        maxElapsedTime = d.f;

}

bandwidthInMBs = 0;

for (int i = 0; i < numDevices; i++) {

    bandwidthInMBs += (1e3f * MEMCOPY_SIZE * (float)MEMCOPY_ITERATIONS) / (elapsedTimes[i] * (float)(1 << 20));

}

printf("Average DtoH bandwidth in MB/s: %f\n", bandwidthInMBs);

numWaiting = 0;

for (int i = 0; i < numDevices; i++) {

    PackedType arg;

    arg.ui = devices[i];

    pthread_create(&devThreads[i], NULL, (testBandwidthBidirectional),arg.v);

}

while (numWaiting != numDevices) ;

pthread_cond_broadcast(&condvar);

returnVal = 0;

maxElapsedTime = 0.f;

for (int i = 0; i < numDevices; i++) {

    pthread_join(devThreads[i], &returnVal);

    PackedType d = (PackedType)returnVal;

    printf("Device %u took %f ms\n", devices[i], d.f);

    elapsedTimes[i] = d.f;

    if (d.f > maxElapsedTime)

        maxElapsedTime = d.f;

}

bandwidthInMBs = 0;

for (int i = 0; i < numDevices; i++) {

    bandwidthInMBs += (1e3f * MEMCOPY_SIZE * 2 * (float)MEMCOPY_ITERATIONS) / (elapsedTimes[i] * (float)(1 << 20));

}

printf("Average bidirectional bandwidth in MB/s: %f\n", bandwidthInMBs);

}

[/codebox]

Ran it on a Nehalem machine with 2 S1070 connected to it.

tmurray - maybe you can have a look at the results and further explain what I get… :)

I don’t understand the different values between the same configuration runs and why running more

GPUs gives higher average values?

-bash-3.2$ ./concBandwidthTest 0

Device 0 took 1396.748657 ms

Average HtoD bandwidth in MB/s: 4582.069824

Device 0 took 2057.564453 ms

Average DtoH bandwidth in MB/s: 3110.473633

cuStreamCreate failed

cuStreamCreate failed

-bash-3.2$ ./concBandwidthTest 0

Device 0 took 1200.671387 ms

Average HtoD bandwidth in MB/s: 5330.351074

Device 0 took 3532.220215 ms

Average DtoH bandwidth in MB/s: 1811.891602

-bash-3.2$ ./concBandwidthTest 0

Device 0 took 1396.767334 ms

Average HtoD bandwidth in MB/s: 4582.008789

Device 0 took 2057.551758 ms

Average DtoH bandwidth in MB/s: 3110.492920

cuStreamCreate failed

-bash-3.2$ ./concBandwidthTest 0

Device 0 took 1357.357422 ms

Average HtoD bandwidth in MB/s: 4715.043945

Device 0 took 2057.560791 ms

Average DtoH bandwidth in MB/s: 3110.479248

cuStreamCreate failed

cuStreamCreate failed

-bash-3.2$ ./concBandwidthTest 0

Device 0 took 1304.883057 ms

Average HtoD bandwidth in MB/s: 4904.654297

Device 0 took 2057.561523 ms

Average DtoH bandwidth in MB/s: 3110.478027

cuStreamCreate failed

cuStreamCreate failed

-bash-3.2$ ./concBandwidthTest 0

Device 0 took 1200.655273 ms

Average HtoD bandwidth in MB/s: 5330.422363

Device 0 took 2057.501953 ms

Average DtoH bandwidth in MB/s: 3110.568115

cuStreamCreate failed

This is with 4 GPUs:

-bash-3.2$ ./concBandwidthTest 0 1 2 3

Device 0 took 5437.022949 ms

Device 1 took 5437.017090 ms

Device 2 took 5436.961914 ms

Device 3 took 5436.957031 ms

Average HtoD bandwidth in MB/s: 4708.487793

Device 0 took 13258.119141 ms

Device 1 took 13685.979492 ms

Device 2 took 13906.995117 ms

Device 3 took 14127.769531 ms

Average DtoH bandwidth in MB/s: 1863.563538

-bash-3.2$ ./concBandwidthTest 0 1 2 3

Device 0 took 5053.490234 ms

Device 1 took 5053.427246 ms

Device 2 took 5053.377930 ms

Device 3 took 5053.387207 ms

Average HtoD bandwidth in MB/s: 5065.875610

Device 0 took 13258.480469 ms

Device 1 took 13686.190430 ms

Device 2 took 13907.001953 ms

Device 3 took 14127.746094 ms

Average DtoH bandwidth in MB/s: 1863.543701

-bash-3.2$

-bash-3.2$ ./concBandwidthTest 0 1 2 3

Device 0 took 5427.071777 ms

Device 1 took 5427.039551 ms

Device 2 took 5427.074219 ms

Device 3 took 5427.046875 ms

Average HtoD bandwidth in MB/s: 4717.104492

Device 0 took 11163.727539 ms

Device 1 took 11520.315430 ms

Device 2 took 11683.252930 ms

Device 3 took 11812.166992 ms

Average DtoH bandwidth in MB/s: 2218.432434

-bash-3.2$ ./concBandwidthTest 0 1 2 3

Device 0 took 5282.594727 ms

Device 1 took 5282.777832 ms

Device 2 took 5282.791016 ms

Device 3 took 5282.584473 ms

Average HtoD bandwidth in MB/s: 4846.018799

Device 0 took 9035.470703 ms

Device 1 took 9388.610352 ms

Device 2 took 9532.751953 ms

Device 3 took 9753.575195 ms

Average DtoH bandwidth in MB/s: 2717.535828

8 GPUs:

-bash-3.2$ ./concBandwidthTest 0 1 2 3 4 5 6 7

Device 0 took 5775.768066 ms

Device 1 took 5776.991211 ms

Device 2 took 5777.027832 ms

Device 3 took 5776.927246 ms

Device 4 took 5770.483398 ms

Device 5 took 5769.739258 ms

Device 6 took 5769.744629 ms

Device 7 took 5770.465820 ms

Average HtoD bandwidth in MB/s: 8868.270874

Device 0 took 10368.433594 ms

Device 1 took 10708.926758 ms

Device 2 took 10888.794922 ms

Device 3 took 11138.063477 ms

Device 4 took 10298.125977 ms

Device 5 took 10690.266602 ms

Device 6 took 10873.223633 ms

Device 7 took 11054.353516 ms

Average DtoH bandwidth in MB/s: 4764.963684

cuStreamCreate failed

-bash-3.2$ ./concBandwidthTest 0 1 2 3 4 5 6 7

Device 0 took 6023.660645 ms

Device 1 took 6023.886230 ms

Device 2 took 6023.917969 ms

Device 3 took 6023.893555 ms

Device 4 took 6295.713379 ms

Device 5 took 6295.741211 ms

Device 6 took 6295.745117 ms

Device 7 took 6295.710449 ms

Average HtoD bandwidth in MB/s: 8316.030762

Device 0 took 10126.357422 ms

Device 1 took 10440.270508 ms

Device 2 took 10597.230469 ms

Device 3 took 10829.771484 ms

Device 4 took 11077.995117 ms

Device 5 took 11461.150391 ms

Device 6 took 11660.485352 ms

Device 7 took 11880.779297 ms

Average DtoH bandwidth in MB/s: 4663.597290

-bash-3.2$ ./concBandwidthTest 0 1 2 3 4 5 6 7

Device 0 took 5905.445801 ms

Device 1 took 5905.427734 ms

Device 2 took 5905.451660 ms

Device 3 took 5905.144531 ms

Device 4 took 6013.040527 ms

Device 5 took 6013.087402 ms

Device 6 took 6013.110352 ms

Device 7 took 6013.126953 ms

Average HtoD bandwidth in MB/s: 8592.417114

Device 0 took 11750.368164 ms

Device 1 took 12099.052734 ms

Device 2 took 12249.042969 ms

Device 3 took 12377.944336 ms

Device 4 took 10342.506836 ms

Device 5 took 10674.576172 ms

Device 6 took 10902.281250 ms

Device 7 took 11129.486328 ms

Average DtoH bandwidth in MB/s: 4493.612305

cuStreamCreate failed

cuStreamCreate failed

cpuproc information: 8 cores with hyperthreading…

processor	   : 15

vendor_id	   : GenuineIntel

cpu family	  : 6

model		   : 26

model name	  : Intel(R) Xeon(R) CPU		   E5520  @ 2.27GHz

stepping		: 5

cpu MHz		 : 1600.000

cache size	  : 8192 KB

physical id	 : 1

siblings		: 8

core id		 : 3

cpu cores	   : 4

apicid		  : 23

fpu			 : yes

fpu_exception   : yes

cpuid level	 : 11

wp			  : yes

flags		   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc ida nonstop_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr popcnt lahf_lm

bogomips		: 4533.52

clflush size	: 64

cache_alignment : 64

address sizes   : 40 bits physical, 48 bits virtual

power management: [8]

It’s not a fire-breathing Nehalem, but I just stuffed 3 GT 240’s (w/GDDR5) into a unique Socket 775 motherboard with an IDT 24-lane PCIe Gen2 switch: [url=“http://www.pixel.io/blog/2010/2/23/triple-nvidia-gt-240-gpu-workstation.html”]http://www.pixel.io/blog/2010/2/23/triple-...orkstation.html[/url]

The first slot is connected directly to the P45’s first x8 port and the second x8 port connects to the second and third slots via the IDT switch.

I’ll take a shot at getting concBandwidthTest working with Windows threads… I slapped together a Visual Studio project but will have to put some time into either changing the pthread references to Win32 equivalents or perhaps just use one of the Win32 pthread libraries that are already out there.

Let me know if someone else has done this already! External Image

External Media

I found a bug with this when I ported it to Windows, so I’ll update this at some point. (somebody punch me if I don’t, there’s a dumb race condition that triggers on certain configs)