Windows 7 vs Windows xp CudaMalloc performance difference
Hello Everyone,

I had a quick question regarding the performance of cudamalloc opeartion on Win 7 machine and xp machine.
For about 3.5 Kbyte of memory allocation , it takes about 1.4msec on an win7 machine and < than 0.1ms on an xp machine.
Is it a known fact that the cudaMalloc operation is a lot slower on an windows 7 machine when compared to an xp machine? If so what is the reason? and is there any possible work around?

Any inputs is greatly appreciated

Thanks
--randal
Hello Everyone,



I had a quick question regarding the performance of cudamalloc opeartion on Win 7 machine and xp machine.

For about 3.5 Kbyte of memory allocation , it takes about 1.4msec on an win7 machine and < than 0.1ms on an xp machine.

Is it a known fact that the cudaMalloc operation is a lot slower on an windows 7 machine when compared to an xp machine? If so what is the reason? and is there any possible work around?



Any inputs is greatly appreciated



Thanks

--randal

#1
Posted 01/07/2011 06:47 PM   
[quote name='randal' date='07 January 2011 - 06:47 PM' timestamp='1294426059' post='1173096']
Hello Everyone,

I had a quick question regarding the performance of cudamalloc opeartion on Win 7 machine and xp machine.
For about 3.5 Kbyte of memory allocation , it takes about 1.4msec on an win7 machine and < than 0.1ms on an xp machine.
Is it a known fact that the cudaMalloc operation is a lot slower on an windows 7 machine when compared to an xp machine? If so what is the reason? and is there any possible work around?

Any inputs is greatly appreciated

Thanks
--randal
[/quote]

Any one?
[quote name='randal' date='07 January 2011 - 06:47 PM' timestamp='1294426059' post='1173096']

Hello Everyone,



I had a quick question regarding the performance of cudamalloc opeartion on Win 7 machine and xp machine.

For about 3.5 Kbyte of memory allocation , it takes about 1.4msec on an win7 machine and < than 0.1ms on an xp machine.

Is it a known fact that the cudaMalloc operation is a lot slower on an windows 7 machine when compared to an xp machine? If so what is the reason? and is there any possible work around?



Any inputs is greatly appreciated



Thanks

--randal





Any one?

#2
Posted 01/10/2011 09:13 PM   
[quote name='randal' date='07 January 2011 - 02:47 PM' timestamp='1294426059' post='1173096']
Hello Everyone,

I had a quick question regarding the performance of cudamalloc opeartion on Win 7 machine and xp machine.
For about 3.5 Kbyte of memory allocation , it takes about 1.4msec on an win7 machine and < than 0.1ms on an xp machine.
Is it a known fact that the cudaMalloc operation is a lot slower on an windows 7 machine when compared to an xp machine? If so what is the reason? and is there any possible work around?

[/quote]
I just got this from one of our users:

[quote]
We have seen some major speed decreases (2x on my code) when we have gone from windows XP to windows 7. I looked at the profile and there seems to be large amounts of idle time associated with calls of cudaThreadSynchronize(), cudaMalloc(), and cudaFree().
[/quote]
Any insight would be appreciated.
[quote name='randal' date='07 January 2011 - 02:47 PM' timestamp='1294426059' post='1173096']

Hello Everyone,



I had a quick question regarding the performance of cudamalloc opeartion on Win 7 machine and xp machine.

For about 3.5 Kbyte of memory allocation , it takes about 1.4msec on an win7 machine and < than 0.1ms on an xp machine.

Is it a known fact that the cudaMalloc operation is a lot slower on an windows 7 machine when compared to an xp machine? If so what is the reason? and is there any possible work around?





I just got this from one of our users:





We have seen some major speed decreases (2x on my code) when we have gone from windows XP to windows 7. I looked at the profile and there seems to be large amounts of idle time associated with calls of cudaThreadSynchronize(), cudaMalloc(), and cudaFree().



Any insight would be appreciated.

#3
Posted 01/19/2011 09:24 PM   
[quote name='Tom Milledge' date='19 January 2011 - 09:24 PM' timestamp='1295472299' post='1180050']
I just got this from one of our users:


Any insight would be appreciated.
[/quote]

Thanks Tom ! I have been facing exactly the same problems. The CudaThreadSynchronize() is ridiculously slow as is cudaMalloc().

Is there any work around for this?

Thanks
--Randal
[quote name='Tom Milledge' date='19 January 2011 - 09:24 PM' timestamp='1295472299' post='1180050']

I just got this from one of our users:





Any insight would be appreciated.





Thanks Tom ! I have been facing exactly the same problems. The CudaThreadSynchronize() is ridiculously slow as is cudaMalloc().



Is there any work around for this?



Thanks

--Randal

#4
Posted 01/20/2011 07:17 PM   
This is expected due to the overhead of having to interact with the Windows display driver scheduler on Vista/Win7. TCC mode doesn't have this performance impact.
This is expected due to the overhead of having to interact with the Windows display driver scheduler on Vista/Win7. TCC mode doesn't have this performance impact.

#5
Posted 01/20/2011 09:00 PM   
Allocate a large block of device memory then manage it as a heap from your CPU program. This eliminates latency and synchronization overhead.
Allocate a large block of device memory then manage it as a heap from your CPU program. This eliminates latency and synchronization overhead.

#6
Posted 01/27/2011 06:32 PM   
[quote name='Oxydius' date='27 January 2011 - 06:32 PM' timestamp='1296153146' post='1184629']
Allocate a large block of device memory then manage it as a heap from your CPU program. This eliminates latency and synchronization overhead.
[/quote]

hmm yeah I am in the process of implementing that.

Had another question regarding cuda and OS. Does the OS have any role to play once the kernel is launched? as in thread scheduling/memory management etc? This may sound stupid, but the reason i am asking this is I have a kernel which takes about 700 ms on xp and 1.4 seconds on win 7 and this is only the kernel execution time. I have gtx 285 on both machines. This seems to be the issue only when different threads work on memory areas which are wide apart.

Thanks
-_Randal
[quote name='Oxydius' date='27 January 2011 - 06:32 PM' timestamp='1296153146' post='1184629']

Allocate a large block of device memory then manage it as a heap from your CPU program. This eliminates latency and synchronization overhead.





hmm yeah I am in the process of implementing that.



Had another question regarding cuda and OS. Does the OS have any role to play once the kernel is launched? as in thread scheduling/memory management etc? This may sound stupid, but the reason i am asking this is I have a kernel which takes about 700 ms on xp and 1.4 seconds on win 7 and this is only the kernel execution time. I have gtx 285 on both machines. This seems to be the issue only when different threads work on memory areas which are wide apart.



Thanks

-_Randal

#7
Posted 01/28/2011 09:09 PM   
Scroll To Top