CUDA slower in Windows 7 than in Windows XP same computer, two OSs, different run times
  1 / 2    
Hi.
I'm using GTX 295 (Gigabyte) for my work. Here are the specs of my PC:

MOBO: Gigabyte GA-MA790XT-UD4P
CPU: AMD Phenom II X4 965
Mem: OCZ 4GB DDR3 1066

I have a dual boot system consisting of Windows XP (SP2) and Window 7 (Ultimate).
When I run my program on XP each iteration takes ~350mS (CUDA 2.2).

But when I tried it on Windows 7 it took 800mS and every 10 (or so) times it jumped up to 2000.
I've tried updating the CUDA to 2.3 and Visual Studio from 2005 to 2008 (express). Nothing I do seems to get the result I get in XP.

I installed the latest updates from Microsoft (even the one that came out on Nov 5th).

Is there any solution for Windows 7 users?

Thank you,
Gadi
Hi.

I'm using GTX 295 (Gigabyte) for my work. Here are the specs of my PC:



MOBO: Gigabyte GA-MA790XT-UD4P

CPU: AMD Phenom II X4 965

Mem: OCZ 4GB DDR3 1066



I have a dual boot system consisting of Windows XP (SP2) and Window 7 (Ultimate).

When I run my program on XP each iteration takes ~350mS (CUDA 2.2).



But when I tried it on Windows 7 it took 800mS and every 10 (or so) times it jumped up to 2000.

I've tried updating the CUDA to 2.3 and Visual Studio from 2005 to 2008 (express). Nothing I do seems to get the result I get in XP.



I installed the latest updates from Microsoft (even the one that came out on Nov 5th).



Is there any solution for Windows 7 users?



Thank you,

Gadi

#1
Posted 11/08/2009 05:27 PM   
You are not alone. I observed that my program runs about 50% slower times on Windows 7.
You are not alone. I observed that my program runs about 50% slower times on Windows 7.

#2
Posted 11/08/2009 06:10 PM   
Welcome to the fabulous world of WDDM launch overhead.
Welcome to the fabulous world of WDDM launch overhead.

#3
Posted 11/08/2009 07:11 PM   
[quote name='tmurray' post='947581' date='Nov 8 2009, 08:11 PM']Welcome to the fabulous world of WDDM launch overhead.[/quote]

Does it really have to be WDDM? Why can't the CUDA specific pieces of the driver use a different kernel level interface?
You'd still get your WHQL if the graphics driver bits remains WDDM, right?

Christian
[quote name='tmurray' post='947581' date='Nov 8 2009, 08:11 PM']Welcome to the fabulous world of WDDM launch overhead.



Does it really have to be WDDM? Why can't the CUDA specific pieces of the driver use a different kernel level interface?

You'd still get your WHQL if the graphics driver bits remains WDDM, right?



Christian

#4
Posted 11/08/2009 07:29 PM   
[quote name='tmurray' post='947581' date='Nov 8 2009, 12:11 PM']Welcome to the fabulous world of WDDM launch overhead.[/quote]

Tim, that overhead applies just to the CUDA init time, right? Not every kernel launch?
Toolkit 3.0 helps a lot with init time overhead.. does that help even more in W7?


I'm still based in Linux/XP ... but dreading W7 because of WDDM. But won't Nexus require WDDM and VS08? Argh!
[quote name='tmurray' post='947581' date='Nov 8 2009, 12:11 PM']Welcome to the fabulous world of WDDM launch overhead.



Tim, that overhead applies just to the CUDA init time, right? Not every kernel launch?

Toolkit 3.0 helps a lot with init time overhead.. does that help even more in W7?





I'm still based in Linux/XP ... but dreading W7 because of WDDM. But won't Nexus require WDDM and VS08? Argh!

#5
Posted 11/08/2009 08:47 PM   
[quote name='cbuchner1' post='947589' date='Nov 8 2009, 12:29 PM']Does it really have to be WDDM? Why can't the CUDA specific pieces of the driver use a different kernel level interface?
You'd still get your WHQL if the graphics driver bits remains WDDM, right?[/quote]

That's a good question. In fact do Tesla cards need to use WDDM? There's not even any display hardware on the cards.

I like that idea, to allow CUDA-only cards to be classified as non-video and therefore exempt from the WDDM abstraction.
I suspect this would make drivers ugly, though.

I wouldn't even mind if there's a hardware jumper or different BIOS on the card so that even the board ID could change and therefore even look like a different class of hardware to Windows when initially queried.
[quote name='cbuchner1' post='947589' date='Nov 8 2009, 12:29 PM']Does it really have to be WDDM? Why can't the CUDA specific pieces of the driver use a different kernel level interface?

You'd still get your WHQL if the graphics driver bits remains WDDM, right?



That's a good question. In fact do Tesla cards need to use WDDM? There's not even any display hardware on the cards.



I like that idea, to allow CUDA-only cards to be classified as non-video and therefore exempt from the WDDM abstraction.

I suspect this would make drivers ugly, though.



I wouldn't even mind if there's a hardware jumper or different BIOS on the card so that even the board ID could change and therefore even look like a different class of hardware to Windows when initially queried.

#6
Posted 11/08/2009 08:54 PM   
[quote name='SPWorley' post='947617' date='Nov 8 2009, 01:47 PM']Tim, that overhead applies just to the CUDA init time, right? Not every kernel launch?[/quote]

No, the overhead is for every kernel launch (same in Windows Vista). I believe it also depends on how many memory allocations you're using. We're working with Microsoft to improve this.
[quote name='SPWorley' post='947617' date='Nov 8 2009, 01:47 PM']Tim, that overhead applies just to the CUDA init time, right? Not every kernel launch?



No, the overhead is for every kernel launch (same in Windows Vista). I believe it also depends on how many memory allocations you're using. We're working with Microsoft to improve this.

#7
Posted 11/08/2009 09:11 PM   
[quote name='Simon Green' post='947628' date='Nov 8 2009, 01:11 PM']No, the overhead is for every kernel launch (same in Windows Vista). I believe it also depends on how many memory allocations you're using. We're working with Microsoft to improve this.[/quote]
That's actually dramatically better in Win7 versus Vista--I measured it recently and the per-allocation hit seems to be about ~100x faster (so it's negligible now). The flat-rate overhead is about the same, though.

Why can't we use some other interface alongside WDDM:

WDDM is a lot more than just a rendering interface. It manages all the memory on the device so it can page it in and out as necessary, which is a good thing for display cards. However, we get zero benefit from it in CUDA, because we have pointers! As a result, you can't really do paging in a CUDA app, so you get zero benefit from WDDM. However, because it's the memory manager, we can't just go around it for CUDA because WDDM will assume it owns the card completely, start moving memory, and whoops your CUDA app just exploded. So no, there's not really some magic workaround for cards that can also be used as display.

[quote]That's a good question. In fact do Tesla cards need to use WDDM? There's not even any display hardware on the cards.

I like that idea, to allow CUDA-only cards to be classified as non-video and therefore exempt from the WDDM abstraction.[/quote]
I like the way you think. Wouldn't that also mean Remote Desktop just works with CUDA, then? And maybe no TDR timeouts that you can only disable with a system-wide registry key!

edit: also, just so I don't sound like I'm preaching the end of everything, this varies a lot based on your usage pattern. We batch kernel launches to try to amortize as much of the WDDM overhead as possible. The problem comes in when you can't really batch things--you do a kernel, wait for its result, and then conditionally do something else. At that point, no batching, significant launch overhead penalties (especially if you have a short kernel), and poor performance compared to XP/Linux.

So, uh, don't write your apps that way if you can avoid it...
[quote name='Simon Green' post='947628' date='Nov 8 2009, 01:11 PM']No, the overhead is for every kernel launch (same in Windows Vista). I believe it also depends on how many memory allocations you're using. We're working with Microsoft to improve this.

That's actually dramatically better in Win7 versus Vista--I measured it recently and the per-allocation hit seems to be about ~100x faster (so it's negligible now). The flat-rate overhead is about the same, though.



Why can't we use some other interface alongside WDDM:



WDDM is a lot more than just a rendering interface. It manages all the memory on the device so it can page it in and out as necessary, which is a good thing for display cards. However, we get zero benefit from it in CUDA, because we have pointers! As a result, you can't really do paging in a CUDA app, so you get zero benefit from WDDM. However, because it's the memory manager, we can't just go around it for CUDA because WDDM will assume it owns the card completely, start moving memory, and whoops your CUDA app just exploded. So no, there's not really some magic workaround for cards that can also be used as display.



That's a good question. In fact do Tesla cards need to use WDDM? There's not even any display hardware on the cards.



I like that idea, to allow CUDA-only cards to be classified as non-video and therefore exempt from the WDDM abstraction.


I like the way you think. Wouldn't that also mean Remote Desktop just works with CUDA, then? And maybe no TDR timeouts that you can only disable with a system-wide registry key!



edit: also, just so I don't sound like I'm preaching the end of everything, this varies a lot based on your usage pattern. We batch kernel launches to try to amortize as much of the WDDM overhead as possible. The problem comes in when you can't really batch things--you do a kernel, wait for its result, and then conditionally do something else. At that point, no batching, significant launch overhead penalties (especially if you have a short kernel), and poor performance compared to XP/Linux.



So, uh, don't write your apps that way if you can avoid it...

#8
Posted 11/08/2009 09:38 PM   
Thanks for the technical insight, guys. Appreciate it.
Thanks for the technical insight, guys. Appreciate it.

#9
Posted 11/08/2009 11:34 PM   
Many thanks for the explanation :)
I guess there is also a reason why you cannot tell WDDM to allocate (nearly) all GPU memory to CUDA application and then manage it internally without useless overheads?
Many thanks for the explanation :)

I guess there is also a reason why you cannot tell WDDM to allocate (nearly) all GPU memory to CUDA application and then manage it internally without useless overheads?

#10
Posted 11/09/2009 01:58 AM   
seems MS folks need to add "non-paged" memory for gpus, tell the os not to mess with this chunk of the memory and then not to check any thing when this kernel or shader is launched.
seems MS folks need to add "non-paged" memory for gpus, tell the os not to mess with this chunk of the memory and then not to check any thing when this kernel or shader is launched.

#11
Posted 11/09/2009 02:48 PM   
oh and my app is about 50% slower on vista and "only" 25% slower on 7 weeeeeee
oh and my app is about 50% slower on vista and "only" 25% slower on 7 weeeeeee

#12
Posted 11/09/2009 02:50 PM   
So if I understand correctly, Windows Vista or Windows 7 both will not give me the entire RAM on the Tesla, will not give me all the speed up that Tesla could! (So I first pay for the awesome hardware and then pay for the OS to make it suck!) Further, if I need to use Nexus, I HAVE to use either Vista or 7.
Nice going. /yucky.gif' class='bbc_emoticon' alt=':yucky:' />

Does Cuda 3.0 help in this matter? Is the next version of WDDM going to address this?

I understand that nvidia is not who is pulling the strings here, but some pointers as to whether this issue is one that will be resolved soon can help developers decide whether they want to shift to these OS' or take a different path. Any pointers would be appreciated!
So if I understand correctly, Windows Vista or Windows 7 both will not give me the entire RAM on the Tesla, will not give me all the speed up that Tesla could! (So I first pay for the awesome hardware and then pay for the OS to make it suck!) Further, if I need to use Nexus, I HAVE to use either Vista or 7.

Nice going. /yucky.gif' class='bbc_emoticon' alt=':yucky:' />



Does Cuda 3.0 help in this matter? Is the next version of WDDM going to address this?



I understand that nvidia is not who is pulling the strings here, but some pointers as to whether this issue is one that will be resolved soon can help developers decide whether they want to shift to these OS' or take a different path. Any pointers would be appreciated!

#13
Posted 11/09/2009 06:36 PM   
man, it would be really nice if we wrote a driver that worked with Remote Desktop and didn't have these launch overhead problems and no timeout because that would be great, wouldn't it? well, I beat you to it.

(I wouldn't have moved to software if I couldn't actually solve problems, guys :) )

edit: that's a screenshot from my Mac connected to my dev machine connected to my test machine, just in case you were skeptical. Xzibit would be proud.
man, it would be really nice if we wrote a driver that worked with Remote Desktop and didn't have these launch overhead problems and no timeout because that would be great, wouldn't it? well, I beat you to it.



(I wouldn't have moved to software if I couldn't actually solve problems, guys :) )



edit: that's a screenshot from my Mac connected to my dev machine connected to my test machine, just in case you were skeptical. Xzibit would be proud.

#14
Posted 11/09/2009 07:21 PM   
Dont u know its not nice to tease ?!

[quote name='tmurray' post='948119' date='Nov 9 2009, 09:21 PM']man, it would be really nice if we wrote a driver that worked with Remote Desktop and didn't have these launch overhead problems and no timeout because that would be great, wouldn't it? well, I beat you to it.

(I wouldn't have moved to software if I couldn't actually solve problems, guys :) )

edit: that's a screenshot from my Mac connected to my dev machine connected to my test machine, just in case you were skeptical. Xzibit would be proud.[/quote]
Dont u know its not nice to tease ?!



[quote name='tmurray' post='948119' date='Nov 9 2009, 09:21 PM']man, it would be really nice if we wrote a driver that worked with Remote Desktop and didn't have these launch overhead problems and no timeout because that would be great, wouldn't it? well, I beat you to it.



(I wouldn't have moved to software if I couldn't actually solve problems, guys :) )



edit: that's a screenshot from my Mac connected to my dev machine connected to my test machine, just in case you were skeptical. Xzibit would be proud.

#15
Posted 11/09/2009 08:36 PM   
  1 / 2    
Scroll To Top