@txbob: Thanks for reporting it, it seems Nvidia fixed it!
Just to add some information for googlers from the future: We apparently got trapped by the same bug in CUBLAS (as documented here: https://github.com/Theano/Theano/pull/2033#issuecomment-53473673). A particular cublasSgemm call always failed with CUBLAS_STATUS_EXECUTION_FAILED (when run with CUDA_LAUNCH_BLOCKING=1), although all arguments seemed correct. Investigating the matter, we found another cublasSgemm call that gives the correct result, but is shown to perform invalid reads when executed with cuda-memcheck, just as observed by AlRich.
In our case, the call was:
cublasSgemm(handle=0x75f1bc0, transa='N', transb='T',
m=4096, n=512, k=5,
alpha=1.0f,
A=0xb00700000, lda=4096,
B=0xb00504800, ldb=512,
beta=0.0f,
C=0xb00ac0000, ldc=4096)
And the first error message by cuda-memcheck says (to help anybody googling for it):
========= Invalid __global__ read of size 4
========= at 0x000000e0 in sgemm_sm_heavy_nt_ldg
========= by thread (255,0,0) in block (0,0,0)
========= Address 0xb0050807c is out of bounds
Doing the math, this access is clearly out of range of the 512x5 matrix B.
With some more tests, we found that this only occurs on GPUs of the 700 series (including Tesla and TITAN) and Compute Capability 3.5. With nvprof we could see that this is because those GPUs use sgemm_sm35_ldg_nt_128x16x64x16x16 and sgemm_sm_heavy_nt_ldg for this specific dot product, while other GPUs use gemm_kernel1x1_core (to throw in two more keywords for search engines).
Testing some more, we found that the bug occurs in CUDA 5.0, 5.5, and 6.0, independently of the driver version. It does not occur in 4.2 (before the introduction of Compute Capability 3.5), and it seems to have been fixed in CUDA 6.5 released recently.