I ran the “oclSimpleMultiGPU” example on Vista 64bit with two GeForce 280’s. The program runs fine, both devices have been used for the computation. However, the devices are never used simulataneously. I ran the program using the Visual OpenCL Profiler. The “GPU time width plot” shows the workloads for both devices. But I never ever managed to achieve a state where the devices have been busy at the same time (indicated by the bar plots for the devices, which do never overlap).
How can it be achieved that two devices are simultaneously computing a kernel? Have I overlooked a magic compiler flag, or a documentation about this for the current SDK version?
Are you using 3.0 Release SDK? I think I remember reading about multi-gpu not working correctly on Windows 7 and Vista (serialization) in one release note, but I don’t see this in 3.0 release notes. Perhaps it was in the beta version?
I was able to get two devices working together by creating multiple contexts, one for each device. This should not be necessary as I understand the OpenCL spec, and is verifiably not required on AMD but seems to work in practice on nVidia, except now I seem to run out of memory as there must be significant overhead with each context created…