Is there any way to copy data from device to host more efficiently in this case?

My code sample is as below. The kernel seems run very efficiently on GPU. But if I want to copy the result back to host memory, it takes much longer time. Also no matter I want to copy 1 variable of double type or 24 variables of double type, it takes around 88 seconds. Such a long time makes computation on GPU no advantage.Is there any way to improve?

__global__ void RunSTH_OnGPU(double* X_d,otherarguments){

	int bx = blockIdx.x;
	int tx = threadIdx.x;
	int Id_t = blockDim.x*bx + tx;
	if (Id_t < nset) {
		///do a lot of things by dynamic allocation of memory.
		X_d[Id_t]=something;
}
}

GPUbegin = clock();
RunSTH_OnGPU << <nblocks, nthreads >> >(X_d,otherarguments);
GPUend = clock();
timeSec = (float(GPUend) - float(GPUbegin)) / 1000.;///It takes 0.000seconds

int nc=24;
GPUbegin = clock();
error = cudaMemcpy(X_h, X_d, sizeof(double) * nc, cudaMemcpyDeviceToHost);
GPUend = clock();
timeSec = (float(GPUend) - float(GPUbegin)) / 1000.;///It takes 88 seconds

Is the timer saying that this cudaMemcpy portion took 88 seconds?!?!

What does nvprof say about the time spent by cudaMemcpy operations? NVVP also shows specific calls along the timeline, so you have better visual feedback.

Check the considerations on this thread, it may be of interest:
[url]https://devtalk.nvidia.com/default/topic/1019140/jetson-tx1/zero-copy-memory-vs-unified-memory-cuda-processing/[/url]

Also look for Njuffa’s zero copy code around the forum (old thread, I have it at home, but I’m not at home), it provides good information.

Ok, I’ll bite.

Your problem is not the time spent in cudaMemcpy(), but the time it takes to execute your kernel.

CUDA kernel launches are asynchronous. So you are only measuring the time it takes to launch the kernel.
Your cudaMemcpy() is slow because it has to wait for the kernel to finish first. No amount of optimisation of the copy operation will speed up your program. instead, you need to optimise the kernel you are launching to run faster.

The CUDA profiler is still the tool to turn to. Run nvvp, take a look at the timeline to see how time is spent in the kernel, not the memcpy, and then let it guide you through the necessary analysis.

Yes. I need to add “cudaDeviceSynchronize()” after the kernel call, if I need to measure time spent.

please study materials about cudaMemcpyAsync