4 x GTX-295: CUDA only sees 5 x GPU (NOT the usual issues!)

I have been benchmarking various CUDA configurations on an ASUS P6T7 WS Motherboard. This includes configurations consisting of the following cards: 9600GT, GTX-295, Quadro FX5800, and three Tesla C1060 cards. Everything seems to work as expected except when I have multiple GTX-295 cards in the system. I am running Vista Ultimate 64 SP2 and using the 186.18 drivers.

For the 4 x GTX-295 config: The Device Manager see’s all 8 GPU’s. GPU-Z sees all 8 GPU’s. However the CUDA Apps I am using (BOINC and EDPR) only report 5 GPU’s in the system. Through several reconfigurations it appears that the primary video card is getting credit for two GPU’s and the rest only get 1 per card. (IOW. 2 cards present 3 GPU’s, 3 cards present 4 GPU’s, 4 cards present 5 GPU’s).

I have PhysX enabled and SLI disabled.

I have monitors attached to all video ports on all cards and the desktop is spread across all of them. (A KVM works nicely for this - I can see all desktops independently without actually needing multiple monitors)

I have also tried the “LimitVideoPresent Sources” and “DisplayLessPolicy” registry updates and this makes no difference… (Although in my initial (unedited) registry, the “DisplayLessPolicy” was not at the 0000 or 0001 level but rather just under the GPUID key. “LimitVideoPresent” was at the 0000 level however in the unedited registry and was a binary entry - not DWORD). In any case, I don’t think this really matters if I have all 8 monitors hooked up…

My current config in focus is only 2xGTX-295 (which only presents 3 GPU’s). I don’t see any point in messing with 3 or 4 cards until I get two properly presented…

Anyone have a clue? I spent most of the day yesterday searching the web for helpful info… Could this be a 186.18 driver issue? I also had Tesla cards in the system at one point - could they have “poisoned” the registry and limited functionality of any replacement cards in their positions?

Although I did reinstall the box once already to try and diagnose this issue, I am in the process of a fresh install again to clear the decks…

Thanks in advance,
Ed

Here’s my limited understanding of what’s going on, from reading tons of stuff online and my own experimentation.

  1. Windows Vista and Windows 7, and the related server OSes, need a monitor (or dummy plug) attached to a video port in order to detect and use additional video cards. Usually, making sure that all installed GPUs show up in Windows Device Manager is a prerequisite (but not a guarantee) for having those GPUs detected by CUDA. I think the LimitVideoPresentSources is supposed to get around this at the NVIDIA driver level, since it’s necessary for Tesla support (in other words, the card won’t show up under Display Devices in Windows and you don’t need to extend your desktop, but CUDA will still see it). If you already have a KVM or dummy plugs, the registry settings don’t seem necessary, but they might help.

  2. GTX 295 cards that have the old 2-board sandwich design actually have three video output ports: two DVI and one HDMI. Both DVI ports seem to be associated with one GPU, and the HDMI port is associated with the other GPU. This means that attaching monitors/dummy plugs to both DVI ports is not enough; you need something connected to the HDMI port. Ideally this would be an actual monitor and not a HDMI → DVI → VGA dummy plug.

  3. Windows (or BIOS?) has a limit of 8 video cards per computer. Because each GTX 295 has three video outs (2x DVI, 1x HDMI), and GPUs are detected based on video port in Vista/Win7, Windows is going to hit the 8-port limit before it can “see” all 8 GPUs. In other words, just because Windows has detected 8 video cards doesn’t mean it’s actually seeing all 8 GPUs, since (as stated above) it seems that both DVI ports are related to the same GPU. In the real world, this means that you hit the 8 video card limit at the 5th GPU. The theory is that it works something like this:

GTX 295 #1: [DVI - DVI] - [HDMI]
GTX 295 #2: [DVI - DVI] - [HDMI]
GTX 295 #3: [DVI - DVI] - limit reached

The brackets show how the video outs are related to GPUs, and why you hit the limit at 5: Windows has counted 8 video outs at that point. Now, Windows seems to randomize the monitor assignment across cards, so which 5 GPUs are actually used may vary, but I’ve never seen anyone able to use access than 5 GPUs on Vista. The registry magic might solve this too, but I haven’t seen any reports that it does.

  1. There seems to be some acknowledgement that there is a problem with some drivers (not sure if it’s just Vista 64 or Vista in general) in that GTX 295 CUDA support is lacking. The theory is that disabling SLI in the driver sometimes isn’t enough in Vista, and that you need to remove the physical SLI bridge. Since this is not possible with the all-in-one GTX 295, problems occur. There’s speculation that the new [topic=“99797”]CUDA Toolkit and SDK 2.3 betas[/topic] might resolve this because it mentions “GPUs in an SLI group are now enumerated individually, so you can achieve multi-GPU performance even when SLI is enabled for graphics.” NVIDIA drivers in general can be fickle about detecting cards, so I’m still hoping new drivers will resolve a lot of these issues.

evanevery, I think you’re running into issue #2, assuming you have the old sandwich-style GTX 295s. Even if you solve it by hooking your KVM to the HDMI ports (or by playing with the registry), you’ll probably run into issue #3 if you have 3 or 4 GTX 295s installed. However, if you could get CUDA to fully utilize 4x GPU with two cards installed, that would be a partial success. (You might wonder why your first card is using both GPUs if you don’t have anything hooked to that HDMI port. I’m guessing it’s because the NVIDIA control panel PhysX/SLI settings are doing something special for that card, since it’s reasonable to assume NVIDIA has tested a single GTX 295 card more than they have tested multiple cards. You might check if you can spot anything unusual in your registry, where one card has different settings than the others.)

My problem is slightly different, in that I have three of the new single-PCB GTX 295s. These cards do not have HDMI outputs; they only have dual DVI ports, so problems #2 and #3 should be avoided. However, so far I’m only able to access the first GPU on each card, even though I can extend the desktop across 6 monitors. I suspect I might be running into problem #4, or some new as-yet-unidentified problem related to this new hardware design. I’ve only tried Windows 7 x64, so that might be the main issue. I’m going to experiment with other OSes this week. My goal is to break the 5-GPU barrier on some OS newer than XP.

SazanEyes,

Thanks for all the good info. However, I think that most of it may be irrelevant in my case:

1 I have currently fallen back to a 2 x GTX-295 implementation since I can’t even get 4 x GPU’s visible in that config. This mitigates a lot of the rest of the issues. There is no reason for me to struggle with all four cards if I can’t even get two of them properly enumerated. So, my testing sits at two cards for the moment. I’ve not heard of a 5 GPU limit, but it certainly shouldn’t be an issue with just 2 x GTX-295’s in any case. With 2 x 295’s, I only see 3 GPU’s max! So I have to get past that first!

  1. I’m using Zotac GTX-295 cards. I have both RevA and RevB boards from that manufacturer. I have 1 RevA board (Dual PCB with HDMI) and about 30 (!!!) RevB Boards (Single PCB without HDMI). I’m using all RevB boards at the moment so they all match. As they do not have an HDMI interface, this should not be an issue. Even if there was some sort of missing support for HDMI being enumerated, I should still see at least 4 GPU’s with 2 cards. (…and as I add more cards I still get 1 more GPU per 295 added so I did not hit any limit at 2 cards)

  2. I am using a monitor (KVM Actually) for all available video ports. The desktop is spread across all screens and I can confirm that with the KVM. So registry editing should be irrelevant. PhysX enabled, SLI disabled…

  3. I have edited the crap out of my registry in the hopes that it might help (to no avail). I have a GHOST image of the baseline system so its really no big deal to restore the image and try again from time-to-time. After the baseline install, I actually have six registry keys which refer back to the GTX-295. Two of those keys have 0000 and 0001 subkeys which each contain a “Settings” subkey that contains a Device Description of GTX 295. Two other keys contain only a 0000 subkey that has a GTX 295 description key. So there are a total of four parent “GPUID” keys - two of them have 0000 and 0001 subkeys with 295 descriptions and two of them have only one 0000 subkey with a 295 description. Anyway, I’ve tried a myriad of combination’s on the test system and can never get more than 3 GPU’s to show up with 2 x 295’s installed! It would be REALLY helpful if these registry keys were better documented. The “two-liner” description thats floating around doesn’t cover what I’m seeing. …and it would be really helpful if there was some way to tie the parent GPUID back to the actual driver instance. I’ve plowed around for hours comparing registry GPUID keys to the properties of each driver instance in the device manager and can’t find the link… In any case, none of this SHOULD matter if I have all video ports in use, right?

Once again, since I’m using the RevB boards for testing (Single PCB, No HDMI), your issues number 2 and 3 seem to be irrelevant (as you also suggest)…

I firmly believe we have a driver or CUDA issue. The first card gets both GPU’s enumerated and any additional card I attach only gets a single GPU enumerated. As you also suggest, the code must support those folks with a single 295 card so the developers obviously tested that. The code also must support the Tesla Boards ( 1 GPU) to augment a primary video card, so they obviously should have tested that as well. How much you want to bet no one tested multiple 295 cards and that the additional logic required to support multi GPU cards simply didn’t make it past the “primary” video card?

Is anyone from NVIDIA reading this thread? I have about $23,000.00 of video cards on my test bench for this project (inc a Quadro FX5800, 31xGTX-295 (RevA and RevB), 3xTesla C1060, and a crap load of 9600GT’s)? If anyone from NVIDIA wants to step up and try and troubleshoot this issue, I have the hardware standing by!

Ed

That’s really interesting. You’re the only other person I’ve found that’s seriously tried to get the new (RevB) GTX 295 boards working with CUDA on Vista. Most people seem to stick with XP because it’s safe, and I have an XP box with four 9800GX2 cards that makes all 8 GPUs available to CUDA with no issues.

You’re one step ahead of me, because I’ve never gotten more than one GPU to work on a card. I think Windows 7 is part of the problem, and NVIDIA has said on this forum that it’s not supported yet. I’m installing 32-bit Vista on my GTX 295 box now, and I’ll let you know how it goes.

I installed the 190.15 drivers on Vista, and I can see four CUDA devices in BOINC. There’s definitely a difference between Vista and Win7.

I installed an XP image on my test platform and XP appears to see all GPU’s with no issue. I can get up to 3 cards installed before I get a blue screen (XP driver v182.50). I’m thinking the 4th card is a power issue though so I’ve got a bigger PSU coming…

Additionally, the XP drivers did not require monitors to be attached or the registry to be hacked in order to get all GPU’s enumerated to CUDA.

I’m also downloading 186.18 drivers for XP to see if that makes a difference (wrt the blue screen). The new PSU will be here tomorrow in any case.

under XP 2 cards=4 GPU, 3 cards=6 GPU, so everything looks like it should. Same exact system and cards as used for the Vista test. This pretty much nails it as a Vista Driver Issue! (Is anyone from NVIDIA following this thread?)

Where did you get v190.15 drivers? I’ld like to try them…

UPDATE: V186.18 drivers did NOT resolve the bluescreen when adding the 4th card. I’m hoping its a power issue and that a larger PSU will take care of the issue (Currently running 3 cards on 1100W, expecting a 1500W PSU tomorrow…)

UPDATE: Found the 190.15 leaked drivers - doesn’t change anything with Vista64 (Still only sees 5 GPU Max = 2+1+1+1) and it also didn’t cure my bluescreen under XP when trying to add the 4th card… Unless new info is discovered, I see no point in continuing to test Vista - it simply doesn’t appear to support more than 5 GPU’s. (Suggestions welcome!). I will continue testing with XP once I get the larger PSU. Someone needs to tell NVIDIA that XP is EOL (unfortunately) and CUDA support under Vista appears broken wrt multi-GPU cards!

It definitely sounds like you have a power issue. My 4x 9800GX2 box running XP has 1500W (two 750W PSUs), and my triple GTX 295 box has 1050W.

NVIDIA has always blamed Microsoft for changing the video architecture in Vista to make multiple monitors or dummy plugs necessary for CUDA detection to work. This is true to some extent, but as the registry hacks show it’s not completely true. In fact, once I got four GPUs to be recognized in Vista last night, I removed all dummy plugs, rebooted, and CUDA still was able to use four GPUs. I had done some registry hacking, and I don’t know if that helped or not, but I was running BOINC on four GPUs with three grayed-out monitors in Windows display properties (desktop NOT extended) and only one monitor checked in the multi-display part of the NVIDIA Control Panel. If it’s possible with four GPUs, it should be possible with more.

To add to the weirdness, at one point I saw seven monitors in the Windows display properties. At that time in HKEY_LOCAL_MACHINE\Hardware\DeviceMap\Video there were indeed seven entries pointing to NVIDIA drivers. Six of the entries(maybe Video3 through Video8) were pointing at the 0000 and 0001 entries below three GUIDs in HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Video{GUID}, which seems logical for three GTX 295 cards. There was also a Video10 which pointed to a Control\Video GUID with only a 0000 child. There were a couple extra entries in Control\Video with GTX 295 descriptions but no corresponding entry in DeviceMap\Video. All these registry entries seem to be controlled by the various display control panels, and can be rewritten by turning PhysX off and back on, for example. I think there’s a limit to how much registry hacking can accomplish without better drivers, or at least better documentation of how the NVIDIA drivers work. I really don’t want to have to teach myself how to write a video driver for Vista just to get CUDA working.

  1. Have you gotten 4 GPU’s visible under Vista using only two dual GPU cards? I can get 4 GPU’s enumerated but I have to use 3 x 295’s (2+1+1), a 295 and two Teslas (2+1+1), or a Quadro and three Teslas (1+1+1+1). The only “position” I have EVER gotten to recognize 2 GPU’s on a single card is the “primary” video card (in Vista).

  2. I see the same issues in the registry. When I have 4 x 295’s installed, I get four registry “parents” with both 0000 and 0001 subkeys, and four registry “parents” with only a 0000 subkey. Additionally, you have to drill down even further in the “0000/0001 parent” keys to find the “295” descriptor (I believe its under “settings”). The “0000 parent” keys have the descriptor right at the 0000 level…

I also tried hacking the inf for XP driver v178.13 as noted in the following thread: http://forums.nvidia.com/index.php?showtop…20&start=20

…and I still get bluescreened when trying to add the fourth card (under XP). I’m REALLY hoping this is a PSU issue as that will be resolved today. Unfortunately, I have found a bunch of other references to folks who get bluescreened specifically when adding a 4th card. It also happens with startling regularity as the driver just starts to load for the 4th card. I’m also a bit concerned that, because in the way it happens, that this is a BIOS (latest) or driver issue and not PSU related. WRT the bluescreen problem, 178.13 is not resolving the issue like others have reported (in XP)…

At this point, I’ld really like to be able to get this config running under VISTA OR XP! This project is actually a baseline configuration for a very specialized product we will be offering to our customers in a pretty unique field (that’s why we have so much money invested in hardware at this point). As an OEM equipment manufacturer, I CAN NOT continue to sell XP on our machines. (Although it went EOL for OEM’s last January, there are still a few copies in the pipeline). NVIDIA needs to recognize that they need to repair the issues with the drivers and CUDA under the current operating systems as that is what OEM’s must sell to their customers. While it may be OK for the customer base to continue to use existing copies of XP to run CUDA stuff, it is not appropriate (as far as M$ is concerned) for OEM manufacturer’s to do so. (Although there are some loopholes which might be exploited by an OEM via XP downgrade rights from Vista Ultimate its a grey area and needlessly complicated). NVIDIA needs to get their act together if they wish CUDA to move beyond the individual experimenter customer base and into OEM mainstream production!

UPDATE: The 1500W Power supply did NOT resolve the Blue Screen issue when adding the 4th 295 to an XP system! Trying another MB tomorrow… (This is getting old)

I upgraded to the new 190.38 beta drivers today. At first I didn’t see any change. Then I re-read this statement from the CUDA 2.3 changes: “GPUs in an SLI group are now enumerated individually, so you can achieve multi-GPU performance even when SLI is enabled for graphics.” I started to wonder what would happen if I turned SLI on in the NVIDIA control panel. So, after I did that (and waited a while for the changes to be saved or whatever), I fired up BOINC and saw five CUDA devices instead of four!

I don’t know if I’m still hitting a hard 5-GPU limit somewhere or not. I suspect that the control panel might just be expecting a maximum of four GPUs (for quad SLI) and so it doesn’t do the proper registry settings for the third card/sixth GPU. I haven’t dug around in the registry yet, but I’ll do that later this weekend. I just started up GPUGRID to confirm that all five GPUs actually do work. So far so good – five CUDA tasks are crunching away.

I notice that I still have six GTX 295 devices listed in Device Manager, and Windows Display properties shows three monitors, as you would expect with SLI enabled. However, the desktop was only extended across two monitors. Also, EVGA Precision is only detecting five cards, and only showing partial properties for those cards. I tried extending the desktop onto the third monitor, and that seemed to cause all sorts of problems, with Windows entering an endless loop displaying a message “Display driver stopped responding and has recovered.” Unchecking the “Extend monitor” box fixed it. Also, now I see that SLI is not enabled in the NVIDIA control panel, but I restarted BOINC and it still detects five GPU devices, so I suspect SLI is still enabled in the registry somewhere.

I’m guessing the drivers can handle bi-, tri-, and quad-SLI, but freak out when you try sex-SLI. I think we need to demand sex-SLI, and I think NVIDIA marketing should see the obvious advantages in making it available. :P Seriously, it seems like NVIDIA can fix this by investing a little more effort in testing systems with 3 or 4 GTX 295 cards installed. I think my next test will be enabling SLI on Windows 7 with the 190.38 drivers and seeing what happens.

BTW, my system is drawing about 600W from the wall with five CUDA processes running and the CPU idle. With the CPU under load, it’s drawing about 660W.

I just installed GPU-Z. It confirms that SLI is enabled on four of the GPUs, and disabled on two of them. I assume that CUDA is working on all four GPUs on the two SLI cards, and only detecting the remaining card as a single GPU (as is typical under Vista). I could be wrong, so I’ll have to experiment with removing a card and seeing if I still have four CUDA devices. We definitely need sex-SLI support in the drivers, if not in the hardware.

I’d be interested to see what happened if I tried to actually run a game with SLI support, since I don’t have the quad-SLI hardware bridge installed, and I think the card that is connected to my monitor is the one that doesn’t have SLI enabled (which is why I currently have two desktops in the Windows control panel). For some reason, with my DFI motherboard, the card in the middle PCI-Express x16 slot is detected as the primary card by BIOS. I never know which card or video out Windows will think is primary, but currently they match.

I removed one video card, re-enabled SLI, and I could use all four GPUs from both GTX 295s in CUDA applications. I also installed Windows Server 2008 R2 (RC) on this machine in a dual-boot setup, and installing the 190.38 drivers and enabling SLI there also allowed me to use all four GPUS. I assume it will work with Windows 7 as well. No dummy plugs are needed because with quad-SLI the four GPUs are viewed as one video device by Windows. I still have not installed the hardware bridge across the two cards to enable true quad-SLI.

I don’t know what the actual number is, but I wouldn’t be surprised if there is a 5 GPU limit on WDDM-based platforms. Let me ask around.

I got all four cards (8 GPU’s) running nicely under XP. The ASUS board would simply crash when adding the third card. The Asrock board does not have these issues, so the hardware platform has been validated.

HOWEVER, this does us little good as an OEM as we can’t sell XP to our customers! This MUST work under Vista (and then Win7 when it becomes available). I’m loading a fresh Vista install on the platform right now. Based on my past experience, I’m guessing I’ll only see 5 GPU’s with all four cards running though… I do have an incident open with NVIDIA. I’ll call them again once I confirm my findings…

UPDATE: Even with the new ASROCK MB, Vista64 still displays exactly the same issue we have always been seeing. Only 5 GPU’s max are enumerated (even with 4 GTX-295 cards installed). 2 GPU’s for the first card and 1 for each additional card… This is a fresh Vista64 Ultimate install (SP2) using the current driver set v186.18. I’ll try the new Beta 190.38 drivers shortly once I make an image of the existing system…

evanevery, I’m interested to see what happens when you turn on SLI on your four-card system with the latest beta drivers. Either you’ll have six CUDA devices available (the four from quad-SLI plus one from each of the remaining two cards) or you’ll only have five CUDA devices, which will confirm there’s a five-GPU limit buried somewhere (but I honestly don’t know why that would be the case).

BTW, NVIDIA Control Panel in Windows Server 2008 R2 was smart enough to detect that I had enabled SLI but not installed the hardware bridge, so it was giving me little popup messages when I logged in. I have no idea if these messages are logged anywhere; I couldn’t find them if they are. Anyway, I went ahead and installed the bridge and got a popup message “SLI Enabled” so I’m running true quad-SLI on this machine now. I still have four CUDA devices. I’m about to install a GTX 285 to see if I get five CUDA devices. This should work, since it’s basically a “quad-SLI plus PhysX card” configuration.

Edit: I installed the GTX 285, and I had to play around for a while but I got all five CUDA devices to show up. The GTX 285 is monitor 1, and I had to install a dummy plug on that card. My real monitor is hooked to the first GTX 295 in quad-SLI (monitor 2), SLI is enabled and the bridge across the two cards is installed. BOINC looks fine, but now I’m going to try to install five instances of Folding@home and see how that goes.

The 190.38 beta drivers will not even load with all four cards in the system! I can get 190.38 to load with a single card but once it starts card detection for the additional cards, all I get is a black screen. I am going to reload my baseline image (with the shipping drivers) and try reloading the beta drivers once again…

Sometimes when I see a black screen, it means that Windows Vista has chosen a different video out as the “default” (monitor 1) so you may have to just try hooking your monitor to the various DVI outputs (or switching on your KVM) until you get a signal. The other option is to install LogMeIn and connect from a different machine. LogMeIn shows you monitor 1 by default and also lets you switch monitors like a virtual KVM. I’ve used both methods before to get things working.

It’s actually much easier in Win7 compared to Vista. With Vista, your monitor just goes dead. Win7 extends desktops by default, so you usually get a blank desktop where you can right-click to get to display properties. There the Identify button will tell you you’re now on monitor 5 (or whatever) and you can choose to make that the default desktop. Because I had seen this weird behavior in Win7, I was able to figure out what was happening when the same thing happened in Vista.