memcpyDtoH speed is much slower than memcpyHtoD using GeForce 8400M GS on Vista

I have a CUDA code that I run on several platforms (Vista, XP, Linux, which have different GPUs).

I have a loop where I iterate 10 times and in each iteration I transfer same amount of data back and forth between CPU and GPU (25MBs of data from CPU → GPU and 25MBs of data from GPU → CPU). Since I transfer same amount of data, I expect host->device and device->host copy times to be the same.

Host memory is pinned (always using cudaMallocHost).

Vista platform has GeForce 8400M GS, XP platform has GeForce GTX 8800, and Linux platform has GeForce GTX 280.

I noticed that for all the iterations, Cuda Compute Profiler returns almost the same timings on XP and Linux platforms for memcpyHtoD and memcpyDtoH operations as expected.

But on Vista platform the profiler returns unstable results for memcpyDtoH: memcpyDtoH speed (downloading data from GPU) is always slower than memcpyHtoD speed (uploading data to GPU) and the factor of slowness changes between 2x-5x in different iterations. On the other hand memcpyHtoD results are stable and profiler returns almost the same speed for each iteration.

To double check, I used cuda events to time these operations in my code and retrieved results that are same as above: no problems in XP & Linux and unstable memcpyDtoH results in XP.

Do you think this problem stem from using a different GPU or a operating system? What can I do to fix this problem?

Thanks!