I am running benchmarks on a number of CUDA applications, and I haven’t been able to find any non-trivial examples (e.g. the simpleStreams SDK example) of the use of asynchronous transfer with overlapping computation and memory transfer.
If there are lots of examples that I’m missing, I’d love it if someone could point me towards a few of them. If there aren’t many examples, is this something that isn’t done often because of the difficulty in programming it? Thanks!
-Chris
p.s. Am I right in assuming that the CUBLAS library still does not utilize streams, as discussed here: http://forums.nvidia.com/index.php?act=ST&…=71&t=61179 ?