matrixMul crashes pc with Titan XP using nvprof --metrics all switch
The K620 is my display GPU. I am running all of my code on the Titan XP. The output of nvidia-smi (shown earlier in this thread) shows that they are both in WDDM mode. My Titan XP does not have a video cable attached to it.
The K620 is my display GPU. I am running all of my code on the Titan XP. The output of nvidia-smi (shown earlier in this thread) shows that they are both in WDDM mode. My Titan XP does not have a video cable attached to it.

#16
Posted 12/19/2017 06:31 AM   
veraj, How is it coming? Have you solved the problem? --bz
veraj,

How is it coming? Have you solved the problem?

--bz

#17
Posted 12/22/2017 04:19 AM   
Hi, bz We can't reproduce the issue you described, so basically no update. As to the problem we found, it is related with using display GPU. Also not match your problem
Hi, bz


We can't reproduce the issue you described, so basically no update.

As to the problem we found, it is related with using display GPU. Also not match your problem

#18
Posted 12/22/2017 06:00 AM   
What is the status of this issue? Have you given up on it?
What is the status of this issue? Have you given up on it?

#19
Posted 01/04/2018 12:31 AM   
Hi, bz There is an internal bug tracking for this. But as we can't reproduce the issue also no log to check, so no progress for now. Can you use cuda-memcheck to check if the sample has memory issues ? Also for current stage, does it block your work ? As this is SDK sample, your own app also met the issue ? I think the current WAR is not using --metrics all , but only collect the metrics you want.
Hi, bz


There is an internal bug tracking for this.

But as we can't reproduce the issue also no log to check, so no progress for now.


Can you use cuda-memcheck to check if the sample has memory issues ?
Also for current stage, does it block your work ? As this is SDK sample, your own app also met the issue ?


I think the current WAR is not using --metrics all , but only collect the metrics you want.

#20
Posted 01/04/2018 03:14 AM   
4 weeks ago I described the original rebooting problem in another note to this forum. No one from NVidia responded (probably because I couldn't make the code available). So, I took the time to find code in the NVidia SDK that DOES have the problem. I did this so that you could run tests. You said in your response that you have no log to check. You haven't asked me for a log file. What does your log file say? No, I didn't run cuda-memcheck. You have access to the same code that I do. Did you run cuda-memcheck on the SDK matrix multiplication sample that I reported to you? Yes, it does block my work. That has been going on for 4 weeks now. I already explained that in another bug I posted to this forum that my app has the problem. I have no idea what you are saying or asking with this sentence ... -- I think the current WAR is not using --metrics all , but only collect the metrics you want. --
4 weeks ago I described the original rebooting problem in another note to this forum. No one from NVidia responded (probably because I couldn't make the code available). So, I took the time to find code in the
NVidia SDK that DOES have the problem. I did this so that you could run tests.

You said in your response that you have no log to check. You haven't asked me for a log file.
What does your log file say?

No, I didn't run cuda-memcheck. You have access to the same code that I do. Did you run cuda-memcheck on the SDK
matrix multiplication sample that I reported to you?

Yes, it does block my work. That has been going on for 4 weeks now. I already explained that in another bug I posted to this forum that my app has the problem.

I have no idea what you are saying or asking with this sentence ...

-- I think the current WAR is not using --metrics all , but only collect the metrics you want. --

#21
Posted 01/04/2018 07:35 AM   
Hi bz, First of all I am extremely sorry that you have to go through this all trouble. So, the issue for which we had repro at our local end on GTX 1070 was due to TDR. After increasing TDR we are not seeing the issue. But since you are not driving the display through Tesla XP it is unlikely that increasing TDR will solve your issue but still you can give a try by increasing TDR to very high value. You can refer following link to increase it https://docs.microsoft.com/en-us/windows-hardware/drivers/display/tdr-registry-keys Quoting your sentence [i]I have no idea what you are saying or asking with this sentence ... -- I think the current WAR is not using --metrics all , but only collect the metrics you want. --[/i] We tried to repro the issue with exact same setup, driver, GPU and command. But we couldn't repro it locally hence we are finding it difficult to progress further. Can you tell us which specific metric you are looking to profile? and if you are not seeing the issue while profiling only that metric then you will be unblocked by the issue.
Hi bz,

First of all I am extremely sorry that you have to go through this all trouble.

So, the issue for which we had repro at our local end on GTX 1070 was due to TDR. After increasing TDR we are not seeing the issue.

But since you are not driving the display through Tesla XP it is unlikely that increasing TDR will solve your issue but still you can give a try by increasing TDR to very high value. You can refer following link to increase it https://docs.microsoft.com/en-us/windows-hardware/drivers/display/tdr-registry-keys


Quoting your sentence
I have no idea what you are saying or asking with this sentence ...

-- I think the current WAR is not using --metrics all , but only collect the metrics you want. --


We tried to repro the issue with exact same setup, driver, GPU and command. But we couldn't repro it locally hence we are finding it difficult to progress further.

Can you tell us which specific metric you are looking to profile? and if you are not seeing the issue while profiling only that metric then you will be unblocked by the issue.

#22
Posted 01/12/2018 02:08 PM   
I am interested in all 113 performance counters for the Titan XP. That's why I use the --metrics all switch. If you review this thread from the beginning, you will see that I have the problem on each of 2 Titan XP boards. If you look at my initial contact with you I described the configuration of my machine. "I am running VS 2015 SP3, Win7/64 SP1, cuda 9.0, and dev driver 388.59. I have a Quadro K620 and Titan XP in my machine. I am using the matrixMul sample without modification." So, to be clear, you have set up a PC with Win7/64 SP1, cuda 9.0, and dev driver 388.59, a K620 and a Titan XP? Why are you testing on a GTX 1070? A GTX 1070 isn't a Titan XP.
I am interested in all 113 performance counters for the Titan XP. That's why I use the --metrics all switch.

If you review this thread from the beginning, you will see that I have the problem on each of 2 Titan XP boards.

If you look at my initial contact with you I described the configuration of my machine.

"I am running VS 2015 SP3, Win7/64 SP1, cuda 9.0, and dev driver 388.59.
I have a Quadro K620 and Titan XP in my machine.
I am using the matrixMul sample without modification."

So, to be clear, you have set up a PC with Win7/64 SP1, cuda 9.0, and dev driver 388.59, a K620 and a Titan XP?

Why are you testing on a GTX 1070? A GTX 1070 isn't a Titan XP.

#23
Posted 01/12/2018 08:03 PM   
Hi bz, First of all I have gone through whole thread and very well aware of your configuration. We tried to set it up the exact same configuration on our end to repro the issue Configuration: VS 2015 SP3, Win7/64 SP1, cuda 9.0, and dev driver 388.59 with K620 and a Titan XP But still we couldn't repro the issue at our end. Have you tried by increasing TDR to high value?
Hi bz,

First of all I have gone through whole thread and very well aware of your configuration.

We tried to set it up the exact same configuration on our end to repro the issue

Configuration: VS 2015 SP3, Win7/64 SP1, cuda 9.0, and dev driver 388.59 with K620 and a Titan XP

But still we couldn't repro the issue at our end.

Have you tried by increasing TDR to high value?

#24
Posted 01/13/2018 05:14 AM   
I apologize for misinterpreting your statement. I thought your earlier response indicated you ran your tests on the GTX 1070 and not the Titan XP. As for the TDR experiment, I ran it 3 times. I added the TdrDelay to the registry. Then I used the values 60, 300, and 3600. As I understand it, the unit of time is seconds. In all 3 cases my machine reboot after about 20 seconds of run time. Was that the correct key? My registry already had the TdrLevel = 0 key. --Bob
I apologize for misinterpreting your statement. I thought your earlier response indicated you ran your tests on the GTX 1070 and not the Titan XP.

As for the TDR experiment, I ran it 3 times. I added the TdrDelay to the registry. Then I used the
values 60, 300, and 3600. As I understand it, the unit of time is seconds. In all 3 cases my machine reboot after about 20 seconds of run time.

Was that the correct key? My registry already had the TdrLevel = 0 key.

--Bob

#25
Posted 01/13/2018 06:17 AM   
Bob, It is unfortunate that you are still able to repro the issue even after increasing TDR. There is one more thing you can do to unblock yourself. Can you profile your app on linux platform ? In this way you can profile all performance metric. Until then we will also give it try to repro the issue at our end. As earlier mentioned since we don't have local repro of your issue we are finding it difficult to proceed further.
Bob,

It is unfortunate that you are still able to repro the issue even after increasing TDR. There is one more thing you can do to unblock yourself. Can you profile your app on linux platform ? In this way you can profile all performance metric. Until then we will also give it try to repro the issue at our end. As earlier mentioned since we don't have local repro of your issue we are finding it difficult to proceed further.

#26
Posted 01/15/2018 08:17 AM   
Bob, Sorry to trouble you by asking you to try so many experiment. You can also try by changing Tesla XP mode to TCC here is the link which explains how to do. http://docs.nvidia.com/gameworks/content/developertools/desktop/nsight/tesla_compute_cluster.htm To change the TCC mode, use the NVIDIA SMI utility. This is located by default at C:\Program Files\NVIDIA Corporation\NVSMI. Use the following syntax to change the TCC mode: nvidia-smi -g {GPU_ID} -dm {0|1} 0 = WDDM 1 = TCC
Bob,

Sorry to trouble you by asking you to try so many experiment. You can also try by changing Tesla XP mode to TCC

here is the link which explains how to do.

http://docs.nvidia.com/gameworks/content/developertools/desktop/nsight/tesla_compute_cluster.htm



To change the TCC mode, use the NVIDIA SMI utility. This is located by default at C:\Program Files\NVIDIA Corporation\NVSMI. Use the following syntax to change the TCC mode:

nvidia-smi -g {GPU_ID} -dm {0|1}
0 = WDDM

1 = TCC

#27
Posted 01/15/2018 08:26 AM   
Hi, I don't have a linux machine to profile on. I did change it to TCC. It still crashes.
Hi, I don't have a linux machine to profile on.
I did change it to TCC. It still crashes.

#28
Posted 01/16/2018 07:31 AM   
Scroll To Top

Add Reply