matrixMul crashes pc with Titan XP using nvprof --metrics all switch
Hi I posted another problem very similar to this one. This time we all have access to the source code in the SAMPLES folder. I am running VS 2015 SP3, Win7/64 SP1, cuda 9.0, and dev driver 388.59. I have a Quadro K620 and Titan XP in my machine. I am using the matrixMul sample without modification. When I run it from the IDE with wA=2048, hA=2048, wB=2048, hB=2048 it works just fine. If you launch a CMD.exe or a VS 2015 CMD.EXE, the matrixMulDrv will again work just fine. But, if you do this my machine reboots. nvprof --kernels matrixMul --metrics all matrixMul -wA=2048 .... The --metrics all switch seems to be the problem. I've tried another Titan XP board and it also fails. I've tried a Titan X board and it was fine. If I do something like --metrics ipc it does not fail. NVidia, can you help me out on this one? --Bob
Hi

I posted another problem very similar to this one. This time we all have access to the source code in the SAMPLES folder.

I am running VS 2015 SP3, Win7/64 SP1, cuda 9.0, and dev driver 388.59.
I have a Quadro K620 and Titan XP in my machine.
I am using the matrixMul sample without modification.

When I run it from the IDE with wA=2048, hA=2048, wB=2048, hB=2048 it works just fine.

If you launch a CMD.exe or a VS 2015 CMD.EXE, the matrixMulDrv will again work just fine.

But, if you do this my machine reboots.

nvprof --kernels matrixMul --metrics all matrixMul -wA=2048 ....

The --metrics all switch seems to be the problem. I've tried another Titan XP board and it also fails.
I've tried a Titan X board and it was fine. If I do something like --metrics ipc it does not fail.

NVidia, can you help me out on this one?

--Bob

#1
Posted 12/09/2017 07:49 PM   
Here is the exact command line call that reboots my machine. nvprof --devices 0 --kernels MatrixMulCUDA --metrics all matrixMul.exe -wA=2048 -hA=2048 -wB=2048 -hB=2048 --Bob
Here is the exact command line call that reboots my machine.

nvprof --devices 0 --kernels MatrixMulCUDA --metrics all matrixMul.exe -wA=2048 -hA=2048 -wB=2048 -hB=2048

--Bob

#2
Posted 12/10/2017 03:26 PM   
Hi, bz How long does the crash happen ? At the begining or after a long period of profiling ?
Hi, bz

How long does the crash happen ?

At the begining or after a long period of profiling ?

#3
Posted 12/11/2017 10:22 AM   
It happens 6 or 7 seconds into the profiling. You can see some of the test results scrolling by. Then it hangs and reboots. Just try the command that I posted above with a Titan XP board.
It happens 6 or 7 seconds into the profiling. You can see some of the test results scrolling by. Then it hangs and reboots. Just try the command that I posted above with a Titan XP board.

#4
Posted 12/11/2017 01:29 PM   
Hi,bz It sounds like an error. I will find a GP102 to check.
Hi,bz

It sounds like an error.
I will find a GP102 to check.

#5
Posted 12/12/2017 02:36 AM   
Hi, bz I use Titan Xp to have a test on win7 + cuda9.0 + 388.59 Although it will take long to finish the profile, the system didn't crash. One suspect is that you are under multi-gpu env. Can you 1. set CUDA_VISIBLE_DEVICES=0 2. nvprof -m all matrixMul.exe to have a check If still not work, can you check other driver also ?
Hi, bz

I use Titan Xp to have a test on win7 + cuda9.0 + 388.59
Although it will take long to finish the profile, the system didn't crash.

One suspect is that you are under multi-gpu env. Can you
1. set CUDA_VISIBLE_DEVICES=0
2. nvprof -m all matrixMul.exe
to have a check

If still not work, can you check other driver also ?

#6
Posted 12/13/2017 10:24 AM   
You said your system didn't crash, but do you have a K620 AND a Titan XP in your machine? My machine is still crashing with the CUDA_VISIBLE_DEVICES=0. I have no idea what you mean by "If still not work, can you check other driver also ?". It took me quite a while to come up with a configuration and an application (that you have the source code to) the would allow you to reproduce the bug. Could you please put a K620 and a Titan XP in your machine and run the tests too?
You said your system didn't crash, but do you have a K620 AND a Titan XP in your machine?

My machine is still crashing with the CUDA_VISIBLE_DEVICES=0.

I have no idea what you mean by "If still not work, can you check other driver also ?".

It took me quite a while to come up with a configuration and an application (that you have the source code to)
the would allow you to reproduce the bug.

Could you please put a K620 and a Titan XP in your machine and run the tests too?

#7
Posted 12/14/2017 05:51 AM   
By the way, here is the output of nvidia-smi. Thu Dec 14 00:54:57 2017 [code] +-----------------------------------------------------------------------------+ | NVIDIA-SMI 388.13 Driver Version: 388.13 | |-------------------------------+----------------------+----------------------+ | GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Quadro K620 WDDM | 00000000:03:00.0 On | N/A | | 39% 50C P8 1W / 30W | 309MiB / 2048MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 TITAN Xp WDDM | 00000000:04:00.0 Off | N/A | | 23% 38C P8 11W / 250W | 128MiB / 12288MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 260 C+G ...6)\Google\Chrome\Application\chrome.exe N/A | | 0 3564 C+G ...ogram Files\Windows Sidebar\sidebar.exe N/A | | 0 3928 C+G C:\Windows\system32\Dwm.exe N/A | +-----------------------------------------------------------------------------+ [/code]
By the way, here is the output of nvidia-smi.

Thu Dec 14 00:54:57 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 388.13 Driver Version: 388.13 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro K620 WDDM | 00000000:03:00.0 On | N/A |
| 39% 50C P8 1W / 30W | 309MiB / 2048MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 TITAN Xp WDDM | 00000000:04:00.0 Off | N/A |
| 23% 38C P8 11W / 250W | 128MiB / 12288MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 260 C+G ...6)\Google\Chrome\Application\chrome.exe N/A |
| 0 3564 C+G ...ogram Files\Windows Sidebar\sidebar.exe N/A |
| 0 3928 C+G C:\Windows\system32\Dwm.exe N/A |
+-----------------------------------------------------------------------------+

#8
Posted 12/14/2017 05:55 AM   
Hi, bz Can you use "--replay-mode application" to check ? PS: what about the result of "nvprof -m all matrixMul.exe" without input parameter ?
Hi, bz


Can you use "--replay-mode application" to check ?

PS: what about the result of "nvprof -m all matrixMul.exe" without input parameter ?

#9
Posted 12/14/2017 06:00 AM   
You asked me to do this. 1. set CUDA_VISIBLE_DEVICES=0 2. nvprof -m all matrixMul.exe I did it and as I said, my machine still crashed. Could you please be more explicit with the commands you want me to run? Tell me exactly what you want me to enter. "--replay-mode application" is not a command.
You asked me to do this.

1. set CUDA_VISIBLE_DEVICES=0
2. nvprof -m all matrixMul.exe

I did it and as I said, my machine still crashed. Could you please be more explicit with the commands you want me to run? Tell me exactly what you want me to enter. "--replay-mode application" is not a command.

#10
Posted 12/14/2017 06:22 AM   
nvprof -m all --replay-mode application matrixMul.exe
nvprof -m all --replay-mode application matrixMul.exe

#11
Posted 12/14/2017 06:25 AM   
This command was better .. but not perfect. nvprof -m all --replay-mode application matrixMul.exe I put it in a batch file 50 times and then executed the batch. Without the --replay-mode switch the machine would crash the first time program would execute. However, with --replay-mode you could execute the program 8 or 9 times before the crash. It took a couple hours before the crash .. but it still happens.
This command was better .. but not perfect.

nvprof -m all --replay-mode application matrixMul.exe

I put it in a batch file 50 times and then executed the batch. Without the --replay-mode switch the machine would crash the first time program would execute. However, with --replay-mode you could execute the program 8 or 9 times before the crash. It took a couple hours before the crash .. but it still happens.

#12
Posted 12/14/2017 02:10 PM   
Hello? Can you tell me where we stand with respect to this issue?
Hello? Can you tell me where we stand with respect to this issue?

#13
Posted 12/15/2017 09:36 PM   
Hi, bz We have reproduce a profile error locally using your command. But still fail to reproduce the crash. Dev now is working on the error. Not sure if this error causes the crash in your machine. Hope they are the same root cause.
Hi, bz

We have reproduce a profile error locally using your command. But still fail to reproduce the crash.

Dev now is working on the error.
Not sure if this error causes the crash in your machine. Hope they are the same root cause.

#14
Posted 12/18/2017 02:32 AM   
Hi, bz What's your display GPU ? Quadro K620 or Titan Xp ?
Hi, bz

What's your display GPU ? Quadro K620 or Titan Xp ?

#15
Posted 12/19/2017 03:01 AM   
Scroll To Top

Add Reply