matrixMul crashes pc with Titan XP using nvprof --metrics all switch

Hi

I posted another problem very similar to this one. This time we all have access to the source code in the SAMPLES folder.

I am running VS 2015 SP3, Win7/64 SP1, cuda 9.0, and dev driver 388.59.
I have a Quadro K620 and Titan XP in my machine.
I am using the matrixMul sample without modification.

When I run it from the IDE with wA=2048, hA=2048, wB=2048, hB=2048 it works just fine.

If you launch a CMD.exe or a VS 2015 CMD.EXE, the matrixMulDrv will again work just fine.

But, if you do this my machine reboots.

nvprof --kernels matrixMul --metrics all matrixMul -wA=2048 …

The --metrics all switch seems to be the problem. I’ve tried another Titan XP board and it also fails.
I’ve tried a Titan X board and it was fine. If I do something like --metrics ipc it does not fail.

NVidia, can you help me out on this one?

–Bob

Here is the exact command line call that reboots my machine.

nvprof --devices 0 --kernels MatrixMulCUDA --metrics all matrixMul.exe -wA=2048 -hA=2048 -wB=2048 -hB=2048

–Bob

Hi, bz

How long does the crash happen ?

At the begining or after a long period of profiling ?

It happens 6 or 7 seconds into the profiling. You can see some of the test results scrolling by. Then it hangs and reboots. Just try the command that I posted above with a Titan XP board.

Hi,bz

It sounds like an error.
I will find a GP102 to check.

Hi, bz

I use Titan Xp to have a test on win7 + cuda9.0 + 388.59
Although it will take long to finish the profile, the system didn’t crash.

One suspect is that you are under multi-gpu env. Can you

  1. set CUDA_VISIBLE_DEVICES=0
  2. nvprof -m all matrixMul.exe
    to have a check

If still not work, can you check other driver also ?

You said your system didn’t crash, but do you have a K620 AND a Titan XP in your machine?

My machine is still crashing with the CUDA_VISIBLE_DEVICES=0.

I have no idea what you mean by “If still not work, can you check other driver also ?”.

It took me quite a while to come up with a configuration and an application (that you have the source code to)
the would allow you to reproduce the bug.

Could you please put a K620 and a Titan XP in your machine and run the tests too?

By the way, here is the output of nvidia-smi.

Thu Dec 14 00:54:57 2017

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 388.13                 Driver Version: 388.13                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro K620        WDDM  | 00000000:03:00.0  On |                  N/A |
| 39%   50C    P8     1W /  30W |    309MiB /  2048MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  TITAN Xp           WDDM  | 00000000:04:00.0 Off |                  N/A |
| 23%   38C    P8    11W / 250W |    128MiB / 12288MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0       260    C+G   ...6)\Google\Chrome\Application\chrome.exe N/A      |
|    0      3564    C+G   ...ogram Files\Windows Sidebar\sidebar.exe N/A      |
|    0      3928    C+G   C:\Windows\system32\Dwm.exe                N/A      |
+-----------------------------------------------------------------------------+

Hi, bz

Can you use “–replay-mode application” to check ?

PS: what about the result of “nvprof -m all matrixMul.exe” without input parameter ?

You asked me to do this.

  1. set CUDA_VISIBLE_DEVICES=0
  2. nvprof -m all matrixMul.exe

I did it and as I said, my machine still crashed. Could you please be more explicit with the commands you want me to run? Tell me exactly what you want me to enter. “–replay-mode application” is not a command.

nvprof -m all --replay-mode application matrixMul.exe

This command was better … but not perfect.

nvprof -m all --replay-mode application matrixMul.exe

I put it in a batch file 50 times and then executed the batch. Without the --replay-mode switch the machine would crash the first time program would execute. However, with --replay-mode you could execute the program 8 or 9 times before the crash. It took a couple hours before the crash … but it still happens.

Hello? Can you tell me where we stand with respect to this issue?

Hi, bz

We have reproduce a profile error locally using your command. But still fail to reproduce the crash.

Dev now is working on the error.
Not sure if this error causes the crash in your machine. Hope they are the same root cause.

Hi, bz

What’s your display GPU ? Quadro K620 or Titan Xp ?

The K620 is my display GPU. I am running all of my code on the Titan XP. The output of nvidia-smi (shown earlier in this thread) shows that they are both in WDDM mode. My Titan XP does not have a video cable attached to it.

veraj,

How is it coming? Have you solved the problem?

–bz

Hi, bz

We can’t reproduce the issue you described, so basically no update.

As to the problem we found, it is related with using display GPU. Also not match your problem

What is the status of this issue? Have you given up on it?

Hi, bz

There is an internal bug tracking for this.

But as we can’t reproduce the issue also no log to check, so no progress for now.

Can you use cuda-memcheck to check if the sample has memory issues ?
Also for current stage, does it block your work ? As this is SDK sample, your own app also met the issue ?

I think the current WAR is not using --metrics all , but only collect the metrics you want.