Hi,
I am using cuda stream to launch kernel in parallel. (cuda version 7.5).
My code looks like this:
//create stream array
//loop
//launch kernel one with stream[0]
//launch kernel two with stream[1]
…
//deviceSynchronize();
//end loop
nvrpof result:
lauch kernel 1: 90.76 duration: .02 gpu occupancy: 12% (37 grid, 128 block)
lauch kernel 2: 90.78 duration: .02 gpu occupancy: 27% (72 grid, 128 block)
So, none of kernel launches concurrently though nvprof says they are in different stream. I assume I have enough resources to launch multiple kernel in parallel. I dont see any performance improvement either. Not sure what might cause this behavior.
Thank you in advance.