Transferring chunks of one data

Hello everyone,

I am new in CUDA programming (and also in this forum), let me explain my problem.

Let’s assume that I have a program (CPU side) that generate a huge matrix i.e 6400x30, what I want to do is transfer matrix’s chunks for GPU computation (by group of 50x30) in a way that each GPU’s thread will run a Kernel on 50x30. And when Kernel finish, it returns all chunks in one matrix so that CPU will perform post-computation.

NB: those groups of 50x30 are independent.

Thanks in advance.

  1. Allocate device array of desired size
  2. Copy slice of host array to device array
  3. Run kernel on device data
  4. Copy device data back to host
  5. Repeat until all sections of host data are processed

You can easily use thrust’s device_ and host_vectors for this task and even thrust routines to copy data between the host and device.

Oh wait, you said 2d arrays. That makes it a little bit trickier but all storage is done in 1d anyway so that shouldn’t be a problem.

John le Mutant,

perhaps introduce streams and events to your highly eloquent program flow/ pseudo code
from the analogy of a juggler, instead of throwing and catching each ball one at a time until all balls are thrown, the host can outdo itself by first throwing all balls, and subsequently catching them one by one, as they come down
streams and events would permit this
the host can forward issue all sections in different streams, tag each section with a stream event, and thereafter start waiting for them to finish, ‘catching’ them one by one as they do

MutantJohn, little_jimmy, thanks guys for your advices, I will choose the streams’ path, I think it will suit my problem.

Thanks again.