Sorry, I passed the incorrect commandline flags to my test app (which I hadn’t used in years) and so I measured not what I thought I measured in #12. With the correct settings, I measure the following on a K20c:
1 x cudaMemcpyAsync()
h2d: bytes= 1572864 time= 270.84 usec rate=5807.28MB/sec
2 x cudaMemcpy2DAsync(), stride of two floats
h2d: bytes= 1572864 time= 26398.90 usec rate=59.58MB/sec
3x cudaMemcpy2DAsync(), stride of three floats
h2d: bytes= 1572864 time= 26395.80 usec rate=59.59MB/sec
Based on this, the strategy of performing straight copies of the three arrays to the device, followed by a kernel that rearranges the data from SOA into AOS format seems to be the best way to achieve the desired functionality. I do not understand the low performance of cudaMemcpy2DAsync() at this time.
The code for the three copy variants listed above is:
#if 0
cudaStat = cudaMemcpyAsync (devBuf, hostBuf, curBufSize, cudaMemcpyHostToDevice, s0);
#elif 0
cudaStat = cudaMemcpy2DAsync (devBuf+0, 8, hostBuf + 0, 4, 4, curBufSize/8,
cudaMemcpyHostToDevice, s0);
cudaStat = cudaMemcpy2DAsync (devBuf+4, 8, hostBuf + curBufSize/2, 4, 4, curBufSize/8,
cudaMemcpyHostToDevice, s0);
#else
cudaStat = cudaMemcpy2DAsync (devBuf+0, 12, hostBuf + 0, 4, 4, curBufSize/12,
cudaMemcpyHostToDevice, s0);
cudaStat = cudaMemcpy2DAsync (devBuf+4, 12, hostBuf + curBufSize/3, 4, 4, curBufSize/12,
cudaMemcpyHostToDevice, s0);
cudaStat = cudaMemcpy2DAsync (devBuf+8, 12, hostBuf+2*curBufSize/3, 4, 4, curBufSize/12,
cudaMemcpyHostToDevice, s0);
#endif