Hint: Keep all CUBLAS functions in one thread

I spent some time working this one out. In the main thread, cublasInit() and all space allocation on the GPU were done. The other thread did the actual copying to GPU memory and the CUBLAS matrix multiplies.

Kaboom! you can’t copy to the GPU when allocations were organized in a different thread.

Doing the lot in one thread runs just fine. Hope this helps someone, as the error messages don’t identify the problem.

PS Love the speedups CUDA brings to matrix multiplies!

Yeah, it’s very confusing and non-intuitive that different CPU threads run in the same CPU memory space but have different GPU memory spaces.