I have an app which runs fine and can be profiled, so long as “Kernel Launches and Memory Operations” is not checked. When it is, the app crashes after working ok for a few seconds (including kernel launches), with the following:
Problem Event Name: APPCRASH
Application Name: rmsi.exe
Application Version: 0.0.0.0
Application Timestamp: 54aea308
Fault Module Name: Nvda.Cuda.Injection.dll
Fault Module Version: 4.2.0.14314
Fault Module Timestamp: 54616179
Exception Code: c0000005
Exception Offset: 000000000027a1ab
OS Version: 6.1.7601.2.1.0.256.48
Locale ID: 6153
Additional Information 1: a282
Additional Information 2: a282d7fcfaec81da43b5940b95b62440
Additional Information 3: a181
Additional Information 4: a181d847302d43c85dd6d726f2c9cfec
and I get no kernel launch info (though all other info is available)
If I attempt to debug, I get the following exception at a call to CudaMemcpy3D (inside a loop which by then has already ran successfuly a number of times):
Unhandled exception at 0x000007FECD6FA1AB (Nvda.Cuda.Injection.dll) in rmsi.exe: 0xC0000005: Access violation writing location 0x0000000000000020.
I guess your “profiling” means Trace Application but not Profile CUDA Application in Nsight VSE, right? I have tried layered texture sample and CudaMemcpy3D sample on my side. No luck to get repro. Would you mind provide more information about your GPU card and NVIDIA driver version? If possible, attach your exe file will help us to reproduce and investigate your problem faster a lot.
Thanks for looking into it. Yes, I do mean “Trace Application”, sorry about that. Profile CUDA Application does not crash. I also tried starting collection manually: it ran fine until I started it, then crashed few seconds after did.
Here are the details of three machines where I can reproduce the problem:
Desktop:
Windows 7 64-bit
VS 2012 11.0.61030.00 Update 4
CUDA 6.5
Nsight 4.1.0.14204
GeForce GTX 780 Ti (GK110B)
driver 344.11
Laptop 1:
Windows 7 64-bit
VS 2012 11.0.61030.00 Update 4
CUDA 6.0
Nsight 4.2.0.14314
GeForce GT 730M (GK208)
driver 347.09
Laptop 2:
Windows 7 64-bit
VS 2012 11.0.61030.00 Update 4
Nsight 4.1.0.14204
CUDA 6.5
Quadro K2100M (GK106)
driver 340.62
Unfortunately, I can’t share the exe file. We’re trying to build a minimal example, but it’s difficult to isolate the problem.
We managed to create a minimal example out of the simpleLayeredTexture sample, attached (replace the existing simpleLayeredTexture.cu).
Basically, we call cudaMemcpy3D in a loop, apparently too fast for NSight. Adding a call to cudaDeviceSynchronize on every loop iteration fixes the crash.
for (unsigned int iter=0; iter < 1000; iter++){
// here come kernel calls modifying the buffer
cudaMemcpy3D(&cpyToDemoParams);
//cudaDeviceSynchronize();
}
Thanks for providing the sample code. It describes the problem much more clearly. Just as your observation, cudaDeviceSynchronize() is required for Nsight Analysis. It causes all profile data to be flushed before the application exits. Another equivalent function is cudaDeviceReset(). Without them, Nsight Analysis could not work correctly.
Although there is a little flaw in your program, we are sorry for the inconvenience which is caused by our tool. We have reproduced your problem with Nsight 4.2 and also verified that it has gone in Nsight 4.5. So could you please try the newer version in https://developer.nvidia.com/nsight-visual-studio-edition-early-access? Notice that it’s a early access edition but not final release. Thanks.
You can’t use the results of the GPU work if you don’t wait for it to complete. Many CUDA functions launch CUDA kernels or initiate memory transfers asynchronously, so the CPU function may (and probably will) return before the GPU work is complete. Take a look at the CUDA programming guide to see which functions behave like this. For those functions, the data you are expecting to be written by the GPU will not be in a determinate state until it finishes, so you must call one of the *Synchronize or *Query API functions to wait for the GPU work to finish before accessing the output. It makes no sense to have a CUDA program that launches work and exits without synchronizing, because nothing could safely be done with the results of the unfinished GPU work.
Also, the current NVIDIA tools can only guarantee complete and accurate results if the program waits for outstanding work to finish with the cudaDeviceReset or cudaDeviceSynchronize functions. See this post on the Parallel For-All Blog: