four 9800GX2 cards: will it work?

Posted: Thu Mar 20, 2008 2:21 pm Post subject: four 9800GX2 cards: will it work?


I work as a Computer Scientist at an academic image processing department. In my research, I regularly work with Cuda for GPGPU in the context of image reconstruction problems. It seems possible to me to put four of the new 9800GX2 cards on a single motherboard, for example the MSI K9A2 Platinum V2 ([url=“http://global.msi.com.tw/index.php?func=proddesc&prod_no=1395&maincat_no=1”]http://global.msi.com.tw/index.php?func=pr...95&maincat_no=1[/url]) that has four (physical) PCI-Express X16 slots with double spacing between the slots.

I am hoping to get access to eight separate GPUs by using four 9800GX2 cards. My computations can be distributed over these eight GPUs without requiring communication between the different cards (or between the two GPUs on each card).

So far, I have not found any information from others who have tried this before. Cooling and power are obvious problems that can probably be dealt with. Any ideas on problems that I can expect on the software side? In particular, I am looking for more information on the following issues:

(1) Will the (non-SLI) driver see the four 9800GX2 cards as eight independent GPUs? Or is there a maximum limit built into the current WinXP drivers for the non-SLI mode?

(2) Is there a maximum GPU count limit built into CUDA?

(3) I know early versions of CUDA required a separate CPU core for each GPU (so four GPUs would be the maximum on a quad-core CPU). Is this still the case?

I would greatly appreciate detailed comments on these issues from an NVIDIA staff member.

Any other comments would also be appreciated!

You would need at least eight CPU cores to be able to benefit four 9800GX2 cards. The motherboard you have in mind only have one CPU socket …

– Kuisma

Edit: Oh, you already addressed this issue. :">

I sound like a broken record on this issue, but I have yet to hear of anyone fitting 4 double-wide PCI express cards into an ATX case. The card in the last slot will extend past the edge of motherboard, and probably hit the bottom of the case.

(If anyone knows of a computer case which does not have this problem, please let me know.)

Kuisma: Do you know of a clear reference (for example in the CUDA documentation) for the one-core-per-thread requirement? The opinions on this forum seem to vary on this (see, e.g., [url=“http://forums.nvidia.com/lofiversion/index.php?t57154.html”]http://forums.nvidia.com/lofiversion/index.php?t57154.html[/url]).

If it is not a requirement, but just a recommendation, that would be fine for me, as I have very little CPU activity throughout the computations. Putting two threads on each core would be fine for me.

I also recall (but maybe I’m wrong) that in the early days of CUDA, each kernel execution blocked the CPU thread. Maybe the requirement comes from that version and is not present in the newer versions…

It would be great if someone from NVIDIA could clarify this?

The Lian-Li Armorsuit has 10 slots :)

[url=“http://www.overclockers.co.uk/showproduct.php?prodid=CA-117-LL&groupid=701&catid=7&subcat=”]http://www.overclockers.co.uk/showproduct....catid=7&subcat=[/url]

No, really not. Only that someone here in the forum reported performance problems running more GPUs then CPU cores, solving the problem replacing a duo core with a quad core. Also the fact that I always see one core 100% busy each time I run my CUDA applications (one GPU). But I have not seen this confirmed by Nvidia.

Why not perform a test with one core and two GPUs?

– Kuisma

Sorry, I don’t have a clear reference. Most of the posts that discuss this are kind of old and hard to find. But, I’ll just describe the issues here and you can make your own judgment call: just consider yourself warned.

The problem has nothing to do with the CPU calculations your application performs. CUDA uses 100% or 1 CPU core whenever you perform a cudaThreadSynchronize(), or an implicit sync is performed for you (i.e. when doing a memcpy or you make more than 16 kernel calls in a row). The reason it performs this is simple: there is very little latency in detecting when the kernel completes. (well, some people don’t think it is little: http://forums.nvidia.com/index.php?showtopic=62610). Hence, the recommendation by nearly everyone on these forums to have 1 CPU core per GPU so they can all busy wait together without problems.

Is it an absolute 100% requirement? Not in all circumstances. If you make lots of short kernel calls or memcpys and thus have lots of implicit syncs occuring I would say it is a requirement. There is one post (I wish I could find it) where a particular user had their 2 GPU code executing slower than the 1 GPU version in a one CPU system. Once they upgraded to 2 CPU cores, the problem went away and performance doubled as expected.

But, if your kernel calls take seconds a few extra ms of polling can’t hurt too much so you can get by with fewer cores (more on polling below).

The only way to work around the 100% CPU utilization is to insert events into your streams and then write your own busy wait loop with a short sleep in it using the stream query facility. This obviously increases the latency for detecting when you reach that point in the stream, but because of the sleep the overhead for having 2 busy loops on one CPU core should be minimal.

Also, I might add that with that many GPU’s in a single case you are going to need a monster of a power supply, massive cooling, and I wouldn’t trust it not to overheat anywhere but in an air conditioned server room. They are more expensive, but have you considered a couple S870 or D870 units to obtain the same number of GPUs per workstation? They at least have their own power supplies and cooling.

Here is an issue related to the one CPU core per GPU: a low priority background process (not CUDA) caused severe performance degradation to a CUDA program. [url=“http://forums.nvidia.com/index.php?showtopic=51980&hl=background”]http://forums.nvidia.com/index.php?showtop...0&hl=background[/url]

Notice the post by nwilt (NVIDIA rep) near the bottom where the busy loop and thread yielding are discussed.

MisterAnderson42: Thanks a lot! This really clarifies things for me. I agree that buying a stack of Tesla boxes would probably be a better/safer choice, although I think you will need at least three D870s to match the (potential, ) performance of four 9800GX2 cards.

There is also a certain “fun factor” involved in this, so I may try it temporarily, just to see if it works :) … and then take two cards out and put them in a second PC.

No problem. And I can’t argue with the fun factor. Upload a digital picture of the inside of the case when you get it running so we can all see :) Kill-a-watt measurements of the systems power could be entertaining too, if you have access to one.

The 9800 GX2 requires an 8-pin PCI-e power connector.
You will need 4 of them.

The Thermaltake Toughpower 1500W http://www.thermaltake.com/product/Power/T…w0171/w0171.asp seems to be capable of that (and even built for it?).

One thing that still worries me: is there an upper limit on the number of GPUs built into the driver or not?

The problem is going to be on the BIOS side.
I have seen MBs able to support 4 GPUs, other 6.

Maybe someone has some experience with streams. I am thinking of using 2 GPU’s per processor core (2 cores have to do other things in my case)

I was thinking of doing the following :

receive data from another machine to be processed
determine if it has to go to GPU1 or 2
insert memcopy into the stream for GPU#
insert kernel calls into the stream (about 8)
insert memcopy into the stream for results
insert event

check if event on GPU1 or GPU 2 is reached
if event is reached, transfer results to other machine
if data to be processed has been received, do the things above, otherwise check again for an event in queue for GPU1 or 2

Like that I have a busy loop that is doing 3 things:
-receiving data
-filling the streams
-moving results to other machine (when memcopy is finished)

As I have very little experience in this area (I am a MATLAB coder basically), I would like to know if people already see trouble with this setup. It is for a realtime system, so normally I will send and receive data while the kernels are running, so my busy loop will actually be doing interesting stuff. When it does not receive stuff anymore, my kernel should be finished, and otherwise I don’t mind to busyloop.

DenisR: no offense, but what does your post have to do with the topic of this thread? ;)

Ah, I hadn’t thought about that yet. Have you ever seen a motherboard capable of supporting 8?

Wow, neat! Does it let you offset the motherboard up one slot, then?

Definitely post pictures of this monster when you get it working! :)

No offense taken, but as you yourself posted a message about putting 2 ‘GPU-threads’ on 1 CPU-core my post seems quite relevant to me (using streams like this may be a solution for the 1 core per GPU ‘problem’ that I will also encounter if I will swap my 2x 8800GTX for 2x 9800GX2)

I contacted MSI customer support about this:

and got their reply today:

At least the customer support people are not aware of a fixed limitation (but I won’t count on it) :)

Since you are managing your own busyloop, this should work out without too many problems. 8 kernel calls shouldn’t trigger an implicit sync, but I’ve never tested the queue depth with the streaming API.