More robust async streaming examples?

I am running benchmarks on a number of CUDA applications, and I haven’t been able to find any non-trivial examples (e.g. the simpleStreams SDK example) of the use of asynchronous transfer with overlapping computation and memory transfer.

If there are lots of examples that I’m missing, I’d love it if someone could point me towards a few of them. If there aren’t many examples, is this something that isn’t done often because of the difficulty in programming it? Thanks!

-Chris

p.s. Am I right in assuming that the CUBLAS library still does not utilize streams, as discussed here: http://forums.nvidia.com/index.php?act=ST&…=71&t=61179 ?

CUBLAS 3.1 has a streams interface. There are asynchronous versions of the set/get calls, and cublasSetKernelStream() has been added to set which stream kernels will be launched into.

CUBLAS 3.1 has a streams interface. There are asynchronous versions of the set/get calls, and cublasSetKernelStream() has been added to set which stream kernels will be launched into.

Ah – great. Thanks!

Ah – great. Thanks!