I posted another problem very similar to this one. This time we all have access to the source code in the SAMPLES folder.
I am running VS 2015 SP3, Win7/64 SP1, cuda 9.0, and dev driver 388.59.
I have a Quadro K620 and Titan XP in my machine.
I am using the matrixMul sample without modification.
When I run it from the IDE with wA=2048, hA=2048, wB=2048, hB=2048 it works just fine.
If you launch a CMD.exe or a VS 2015 CMD.EXE, the matrixMulDrv will again work just fine.
But, if you do this my machine reboots.
nvprof --kernels matrixMul --metrics all matrixMul -wA=2048 …
The --metrics all switch seems to be the problem. I’ve tried another Titan XP board and it also fails.
I’ve tried a Titan X board and it was fine. If I do something like --metrics ipc it does not fail.
It happens 6 or 7 seconds into the profiling. You can see some of the test results scrolling by. Then it hangs and reboots. Just try the command that I posted above with a Titan XP board.
You said your system didn’t crash, but do you have a K620 AND a Titan XP in your machine?
My machine is still crashing with the CUDA_VISIBLE_DEVICES=0.
I have no idea what you mean by “If still not work, can you check other driver also ?”.
It took me quite a while to come up with a configuration and an application (that you have the source code to)
the would allow you to reproduce the bug.
Could you please put a K620 and a Titan XP in your machine and run the tests too?
I did it and as I said, my machine still crashed. Could you please be more explicit with the commands you want me to run? Tell me exactly what you want me to enter. “–replay-mode application” is not a command.
nvprof -m all --replay-mode application matrixMul.exe
I put it in a batch file 50 times and then executed the batch. Without the --replay-mode switch the machine would crash the first time program would execute. However, with --replay-mode you could execute the program 8 or 9 times before the crash. It took a couple hours before the crash … but it still happens.
The K620 is my display GPU. I am running all of my code on the Titan XP. The output of nvidia-smi (shown earlier in this thread) shows that they are both in WDDM mode. My Titan XP does not have a video cable attached to it.
But as we can’t reproduce the issue also no log to check, so no progress for now.
Can you use cuda-memcheck to check if the sample has memory issues ?
Also for current stage, does it block your work ? As this is SDK sample, your own app also met the issue ?
I think the current WAR is not using --metrics all , but only collect the metrics you want.