CUDA slower in Windows 7 than in Windows XP same computer, two OSs, different run times

Hi.
I’m using GTX 295 (Gigabyte) for my work. Here are the specs of my PC:

MOBO: Gigabyte GA-MA790XT-UD4P
CPU: AMD Phenom II X4 965
Mem: OCZ 4GB DDR3 1066

I have a dual boot system consisting of Windows XP (SP2) and Window 7 (Ultimate).
When I run my program on XP each iteration takes ~350mS (CUDA 2.2).

But when I tried it on Windows 7 it took 800mS and every 10 (or so) times it jumped up to 2000.
I’ve tried updating the CUDA to 2.3 and Visual Studio from 2005 to 2008 (express). Nothing I do seems to get the result I get in XP.

I installed the latest updates from Microsoft (even the one that came out on Nov 5th).

Is there any solution for Windows 7 users?

Thank you,
Gadi

You are not alone. I observed that my program runs about 50% slower times on Windows 7.

Welcome to the fabulous world of WDDM launch overhead.

Does it really have to be WDDM? Why can’t the CUDA specific pieces of the driver use a different kernel level interface?

You’d still get your WHQL if the graphics driver bits remains WDDM, right?

Christian

Tim, that overhead applies just to the CUDA init time, right? Not every kernel launch?

Toolkit 3.0 helps a lot with init time overhead… does that help even more in W7?

I’m still based in Linux/XP … but dreading W7 because of WDDM. But won’t Nexus require WDDM and VS08? Argh!

That’s a good question. In fact do Tesla cards need to use WDDM? There’s not even any display hardware on the cards.

I like that idea, to allow CUDA-only cards to be classified as non-video and therefore exempt from the WDDM abstraction.

I suspect this would make drivers ugly, though.

I wouldn’t even mind if there’s a hardware jumper or different BIOS on the card so that even the board ID could change and therefore even look like a different class of hardware to Windows when initially queried.

No, the overhead is for every kernel launch (same in Windows Vista). I believe it also depends on how many memory allocations you’re using. We’re working with Microsoft to improve this.

That’s actually dramatically better in Win7 versus Vista–I measured it recently and the per-allocation hit seems to be about ~100x faster (so it’s negligible now). The flat-rate overhead is about the same, though.

Why can’t we use some other interface alongside WDDM:

WDDM is a lot more than just a rendering interface. It manages all the memory on the device so it can page it in and out as necessary, which is a good thing for display cards. However, we get zero benefit from it in CUDA, because we have pointers! As a result, you can’t really do paging in a CUDA app, so you get zero benefit from WDDM. However, because it’s the memory manager, we can’t just go around it for CUDA because WDDM will assume it owns the card completely, start moving memory, and whoops your CUDA app just exploded. So no, there’s not really some magic workaround for cards that can also be used as display.

I like the way you think. Wouldn’t that also mean Remote Desktop just works with CUDA, then? And maybe no TDR timeouts that you can only disable with a system-wide registry key!

edit: also, just so I don’t sound like I’m preaching the end of everything, this varies a lot based on your usage pattern. We batch kernel launches to try to amortize as much of the WDDM overhead as possible. The problem comes in when you can’t really batch things–you do a kernel, wait for its result, and then conditionally do something else. At that point, no batching, significant launch overhead penalties (especially if you have a short kernel), and poor performance compared to XP/Linux.

So, uh, don’t write your apps that way if you can avoid it…

Thanks for the technical insight, guys. Appreciate it.

Many thanks for the explanation :)
I guess there is also a reason why you cannot tell WDDM to allocate (nearly) all GPU memory to CUDA application and then manage it internally without useless overheads?

seems MS folks need to add “non-paged” memory for gpus, tell the os not to mess with this chunk of the memory and then not to check any thing when this kernel or shader is launched.

oh and my app is about 50% slower on vista and “only” 25% slower on 7 weeeeeee

So if I understand correctly, Windows Vista or Windows 7 both will not give me the entire RAM on the Tesla, will not give me all the speed up that Tesla could! (So I first pay for the awesome hardware and then pay for the OS to make it suck!) Further, if I need to use Nexus, I HAVE to use either Vista or 7.
Nice going. External Image

Does Cuda 3.0 help in this matter? Is the next version of WDDM going to address this?

I understand that nvidia is not who is pulling the strings here, but some pointers as to whether this issue is one that will be resolved soon can help developers decide whether they want to shift to these OS’ or take a different path. Any pointers would be appreciated!

man, it would be really nice if we wrote a driver that worked with Remote Desktop and didn’t have these launch overhead problems and no timeout because that would be great, wouldn’t it? well, I beat you to it.

(I wouldn’t have moved to software if I couldn’t actually solve problems, guys :) )

edit: that’s a screenshot from my Mac connected to my dev machine connected to my test machine, just in case you were skeptical. Xzibit would be proud.

Dont u know its not nice to tease ?!

Patience is a virtue… :) (would I be talking about it if it were six months out?)

Hi tmurray,

I didn’t understand the reason you gave why CUDA can not make use of the ‘Virtual-Memory’ feature of WDDM (at least for win-systems).

You mentioned something about:

WDDM is a lot more than just a rendering interface. It manages all the memory on the device so it can page it in and out as necessary, which is a good
thing for display cards. However, we get zero benefit from it in CUDA, because we have pointers!

can you explain me more detailed ?

Under WDDM, you can have more GPU memory allocated than can fit in its physical memory so long as the working set of a given rendering call is not greater than physical memory. Pretty straightforward–it tracks what resources a rendering call will use, pages in and out as necessary, no problem. This is a good thing for display cards, especially when the UI is 3D accelerated.

However, in CUDA, you can use pointers in device code, which means it’s completely impossible for us to tell what memory you’re actually using since the data structure you pass may include pointers to 5000 other regions in various places that have not been referenced since they were allocated. As a result, all of the memory allocated by that CUDA context must be present on the GPU whenever you run a kernel since the driver can’t tell what memory you plan on using.

Hi guys.
I didn’t really get an answer to my question.

Is there some solution for making CUDA run as fast on Windows 7 as on Windows XP?

(BTW did someone try to compare running CUDA on Linux VS. on XP?)

Thanks.

There should be a way to expose TESLA as a non-graphic card… That might help…