Kernel profiling missing
Hello, First of all, I am a big fan of NSight for visual studio. I am currently using NSight Visual Studio Edition 4.5 (.0.15036) in VS 2012 and running a series of compute-only kernels (no interop whatsoever) on a QUADRO K4000. My application has a series of 6 kernels that work in sequence (with events to protect from a previous stage). The results are returned back to the CPU at the end of the sequence through a set of several dozen buffers of varying size. The cost of returning the results ended up being 75% of the entire round trip cost. Everything works great up to this point. I added a seventh kernel that combines the various buffers into a minimal set of buffers (so the number of cudaMemcpyAsync calls can be at a minimum). The kernel simply multiplexes a set of buffers into a single buffer to be used for transfering back to the CPU. I call memcpyAsync and get the multiplexed buffer back to the CPU and demultiplex it and everything works great. The problem is the (very useful) profiling data in the "Timeline->Compute and Timeline->Memory" sections of a nvreport no longer are reporting any kernel activity. It still reports the api calls. The results are the same. The transfer cost has gone down. I am simply not able to see where the time is going in regards to kernel activity and memory accesses. Note: The only change made is to add the multiplexing kernel and to change how the data is transferred back to the CPU. (and demultiplex on the CPU) Any suggestions on how to resolve this?
Hello,
First of all, I am a big fan of NSight for visual studio. I am currently using NSight Visual Studio Edition 4.5 (.0.15036) in VS 2012 and running a series of compute-only kernels (no interop whatsoever) on a QUADRO K4000.

My application has a series of 6 kernels that work in sequence (with events to protect from a previous stage). The results are returned back to the CPU at the end of the sequence through a set of several dozen buffers of varying size. The cost of returning the results ended up being 75% of the entire round trip cost.

Everything works great up to this point.

I added a seventh kernel that combines the various buffers into a minimal set of buffers (so the number of cudaMemcpyAsync calls can be at a minimum). The kernel simply multiplexes a set of buffers into a single buffer to be used for transfering back to the CPU.

I call memcpyAsync and get the multiplexed buffer back to the CPU and demultiplex it and everything works great.

The problem is the (very useful) profiling data in the "Timeline->Compute and Timeline->Memory" sections of a nvreport no longer are reporting any kernel activity. It still reports the api calls. The results are the same. The transfer cost has gone down. I am simply not able to see where the time is going in regards to kernel activity and memory accesses.

Note: The only change made is to add the multiplexing kernel and to change how the data is transferred back to the CPU. (and demultiplex on the CPU)

Any suggestions on how to resolve this?

#1
Posted 05/08/2015 08:28 PM   
Hi krazanmp, Could you send your nvreport file to us? Its default location is C:\temp. Compress the problem report folder and email it to qzhang@nvidia.com. If there is a chance we can get your project, it would be greatly useful for us to repro and resolve your problem. Thanks, Qian
Hi krazanmp,

Could you send your nvreport file to us? Its default location is C:\temp. Compress the problem report folder and email it to qzhang@nvidia.com. If there is a chance we can get your project, it would be greatly useful for us to repro and resolve your problem.

Thanks, Qian

#2
Posted 05/12/2015 06:08 AM   
Is there any news on this, I have a similar problem with my VS NSight that I am currently trying to work around. VS2010, Nsight 4.6. Doing a Profile CUDA Application it displays all kernels correctly, giving execution timings and all relevant parameters. Trying to do a Trace Application to return a Timeline to look at how everything lines up however does not display anything in the Compute section. None of the kernels are displayed (there are some CudaLaunch runtime API calls, but the duration of those is not related to the kernel execution time it seems so it is hard to line them all up). Does the profiler need some of the build files in the working directory, as I have it set to launch an executable from a deployed folder so there are no temporary build files there. The only other thing is my cuda calls are from inside a dll, which the executable uses.
Is there any news on this, I have a similar problem with my VS NSight that I am currently trying to work around. VS2010, Nsight 4.6.

Doing a Profile CUDA Application it displays all kernels correctly, giving execution timings and all relevant parameters. Trying to do a Trace Application to return a Timeline to look at how everything lines up however does not display anything in the Compute section. None of the kernels are displayed (there are some CudaLaunch runtime API calls, but the duration of those is not related to the kernel execution time it seems so it is hard to line them all up).

Does the profiler need some of the build files in the working directory, as I have it set to launch an executable from a deployed folder so there are no temporary build files there. The only other thing is my cuda calls are from inside a dll, which the executable uses.

#3
Posted 05/26/2015 02:07 PM   
Hi Tiomat, We're still working on krazanmp's issue. Regarding your problem, Trace should work as long as Profiler works. No more other settings is needed. Could you please send your nvreport to qzhang@nvidia.com. The default location is C:\temp. It's better for us to receive both profile (which works fine) and trace (which doesn't work) nvreports. Thanks a lot. Qian
Hi Tiomat,

We're still working on krazanmp's issue. Regarding your problem, Trace should work as long as Profiler works. No more other settings is needed. Could you please send your nvreport to qzhang@nvidia.com. The default location is C:\temp. It's better for us to receive both profile (which works fine) and trace (which doesn't work) nvreports. Thanks a lot.

Qian

#4
Posted 05/27/2015 03:27 AM   
Qian, I have sent those reports. Tiomat
Qian,

I have sent those reports.

Tiomat

#5
Posted 05/27/2015 08:25 AM   
For reference, my issue is fixed. I had not put a cudaDeviceSynchronize in my application so the buffers were not flushed correctly leading to the missing information. Adding this call at the end of my functions solved this instantly. Tiomat
For reference, my issue is fixed. I had not put a cudaDeviceSynchronize in my application so the buffers were not flushed correctly leading to the missing information. Adding this call at the end of my functions solved this instantly.

Tiomat

#6
Posted 05/27/2015 09:17 AM   
Has there been any news on this? I added cudaDeviceSynchronize to the end of my functions. Unfortunately only some of the profile data is still being displayed (I am guessing like 5-10%). As you can imagine, this is EXTREMELY frustrating since I cannot make any coding decisions based on this feedback since it is not giving me the whole picture. For the record I am not convinced that the data isn't getting recorded, but simply not being displayed. I have tried to capture video of what is going on, but as I scroll the timeline what looks like the correct data flashes into existence on the extreme right side of the timeline panes and disappears again. This leads me to believe that the data exists but I am unable to actually view it (or at least view all of it at the same time). This issue is becoming a severe roadblock to my ability to improve performance of the application as a whole or even debug issues. I have been able to reduce kernel execution time with the current configuration pretty well by looking at the data that I am given and assume that it is representative of the majority of the kernel executions and its IO. I am now trying to ensure that I get as much computation/IO overlap as possible and not being able to see adjacent kernel execution blocks (and its necessary IO), it is impossible to ensure that I am getting the necessary overlap.
Has there been any news on this?

I added cudaDeviceSynchronize to the end of my functions. Unfortunately only some of the profile data is still being displayed (I am guessing like 5-10%). As you can imagine, this is EXTREMELY frustrating since I cannot make any coding decisions based on this feedback since it is not giving me the whole picture.

For the record I am not convinced that the data isn't getting recorded, but simply not being displayed. I have tried to capture video of what is going on, but as I scroll the timeline what looks like the correct data flashes into existence on the extreme right side of the timeline panes and disappears again. This leads me to believe that the data exists but I am unable to actually view it (or at least view all of it at the same time).

This issue is becoming a severe roadblock to my ability to improve performance of the application as a whole or even debug issues. I have been able to reduce kernel execution time with the current configuration pretty well by looking at the data that I am given and assume that it is representative of the majority of the kernel executions and its IO. I am now trying to ensure that I get as much computation/IO overlap as possible and not being able to see adjacent kernel execution blocks (and its necessary IO), it is impossible to ensure that I am getting the necessary overlap.

#7
Posted 07/09/2015 09:26 PM   
I believe I have solved my problem with nSight only showing 1 stream (Stream 0). The application I am running takes a LONG time to start up, and as such my behaviour for profiling it is to: [olist] [.]Launch the app for tracing through the visual studio Activity page[/.] [.]In the capture control pane, select cancel to stop profiling[/.] [.]Navigate in the application to the point of interest[/.] [.]In the capture control pane, select start[/.] [.]Execute the kernels at the point of interest[/.] [.]In the capture control pane, select stop[/.] [/olist] My conjecture is that if nSight isn't actively running when the streams are created, it has no idea that they exist, even if the kernels do and execute as expected. I still think this is a problem because (as in my case) the application may have MUCH other stuff going on and the behaviour in need of profiling may NOT be immediately after startup. This means the profiler may be running for many seconds/minutes before anything interesting may actually occur on the GPU resulting in large, empty charts with mostly useless data.
I believe I have solved my problem with nSight only showing 1 stream (Stream 0).

The application I am running takes a LONG time to start up, and as such my behaviour for profiling it is to:
  1. Launch the app for tracing through the visual studio Activity page
  2. In the capture control pane, select cancel to stop profiling
  3. Navigate in the application to the point of interest
  4. In the capture control pane, select start
  5. Execute the kernels at the point of interest
  6. In the capture control pane, select stop


My conjecture is that if nSight isn't actively running when the streams are created, it has no idea that they exist, even if the kernels do and execute as expected.

I still think this is a problem because (as in my case) the application may have MUCH other stuff going on and the behaviour in need of profiling may NOT be immediately after startup. This means the profiler may be running for many seconds/minutes before anything interesting may actually occur on the GPU resulting in large, empty charts with mostly useless data.

#8
Posted 07/23/2015 09:46 PM   
I finally have an update for this case. There are a few issues at play here. First of all, I just fixed a bug which could cause erratic behavior on the timeline. Some of the reports krazanmp sent me exhibited behavior where CUDA memcpy events would appear in the memcpy row, but would be unexpectedly missing from the stream row. Also, the memcpy and compute rows would appear as stacked rows (for displaying overlapping ranges) even though nothing ever overlapped. This was all due to events in the log appearing in an order which we assumed was impossible, and our row renderers didn't handle that case correctly. That is fixed now, so you should see the rendering problems go away in our next release (we expect Nsight 5.1 to have a beta release in January). That doesn't fix the other problem here, which is Nsight failing to log events. The problem is that waiting for asynchronous GPU events (like CUDA launches/memcpys/memsets) to complete and flushing their records to the log files requires some minimal CPU/GPU synchronization. There is a fine balance between flushing events too often and not flushing often enough: Too often slows down the app and makes the tool's measurements untrustworthy, and not often enough means stopping the capture by clicking Stop results in losing a lot more unflushed data. The obvious solution would be for Nsight to force a flush when capture stops, but this requires Nsight to create a background thread inside the app, which can react to the user clicking Stop. We never liked the idea of creating a thread and introducing more nondeterminism in how the tool affects the app's performance, but we have decided the benefits of the background thread greatly outweigh the costs. Unfortunately it's a lot of work to fix this and won't be done in January, but it is my highest priority feature for the release after that. In the meantime, there are ways to work around this. If an app calls cudaDeviceSynchronize, Nsight will immediately take advantage of the CPU/GPU sync point and flush all records. Calling cudaStreamSynchronize will force Nsight to flush events for just that stream. If an app only uses cudaEventSynchronize, as is the case here (at least from what I see in the reports), Nsight only flushes when the buffers fill up and need more space, which may not happen often (or ever). Clicking "Stop" in the UI's capture control will cause all unflushed data to be lost. Just for the purpose of working around that deficiency in Nsight, try adding some occasional calls to cudaDeviceSynchronize and make sure to wait for one of those before clicking Stop. Then the report will contain all events up to that sync call. Also, in Nsight 5.0, I at least fixed the case where the app exits normally. Now, Nsight adds a hook to ensure all unflushed data gets flushed at exit. This does not help the case described here (clicking Stop while the app is still running), but it does mean CUDA apps no longer need to call cudaDeviceSynchronize or cudaDeviceReset at the end to ensure all the data gets flushed -- now this happens automatically when tracing the app to completion. I will reply to this post again when we release the version that forces flushing all records when Stop is clicked in the UI.
I finally have an update for this case. There are a few issues at play here.

First of all, I just fixed a bug which could cause erratic behavior on the timeline. Some of the reports krazanmp sent me exhibited behavior where CUDA memcpy events would appear in the memcpy row, but would be unexpectedly missing from the stream row. Also, the memcpy and compute rows would appear as stacked rows (for displaying overlapping ranges) even though nothing ever overlapped. This was all due to events in the log appearing in an order which we assumed was impossible, and our row renderers didn't handle that case correctly. That is fixed now, so you should see the rendering problems go away in our next release (we expect Nsight 5.1 to have a beta release in January).

That doesn't fix the other problem here, which is Nsight failing to log events. The problem is that waiting for asynchronous GPU events (like CUDA launches/memcpys/memsets) to complete and flushing their records to the log files requires some minimal CPU/GPU synchronization. There is a fine balance between flushing events too often and not flushing often enough: Too often slows down the app and makes the tool's measurements untrustworthy, and not often enough means stopping the capture by clicking Stop results in losing a lot more unflushed data.

The obvious solution would be for Nsight to force a flush when capture stops, but this requires Nsight to create a background thread inside the app, which can react to the user clicking Stop. We never liked the idea of creating a thread and introducing more nondeterminism in how the tool affects the app's performance, but we have decided the benefits of the background thread greatly outweigh the costs. Unfortunately it's a lot of work to fix this and won't be done in January, but it is my highest priority feature for the release after that.

In the meantime, there are ways to work around this. If an app calls cudaDeviceSynchronize, Nsight will immediately take advantage of the CPU/GPU sync point and flush all records. Calling cudaStreamSynchronize will force Nsight to flush events for just that stream. If an app only uses cudaEventSynchronize, as is the case here (at least from what I see in the reports), Nsight only flushes when the buffers fill up and need more space, which may not happen often (or ever). Clicking "Stop" in the UI's capture control will cause all unflushed data to be lost. Just for the purpose of working around that deficiency in Nsight, try adding some occasional calls to cudaDeviceSynchronize and make sure to wait for one of those before clicking Stop. Then the report will contain all events up to that sync call.

Also, in Nsight 5.0, I at least fixed the case where the app exits normally. Now, Nsight adds a hook to ensure all unflushed data gets flushed at exit. This does not help the case described here (clicking Stop while the app is still running), but it does mean CUDA apps no longer need to call cudaDeviceSynchronize or cudaDeviceReset at the end to ensure all the data gets flushed -- now this happens automatically when tracing the app to completion.

I will reply to this post again when we release the version that forces flushing all records when Stop is clicked in the UI.

#9
Posted 12/04/2015 06:35 PM   
I've run into a similar issue: my kernel wouldn't show up when I ran memory experiments ('Memory Statistics - Local' for example), while the 'Overview' set worked fine. Turns out that all experiments work when I remove '--ptxas-options=-abi=no' from the compiler options.
I've run into a similar issue: my kernel wouldn't show up when I ran memory experiments ('Memory Statistics - Local' for example), while the 'Overview' set worked fine. Turns out that all experiments work when I remove '--ptxas-options=-abi=no' from the compiler options.

#10
Posted 04/14/2017 05:15 AM   
Hi csigg, Your issue is different from the ones above, because you are seeing a problem in CUDA Profiling (that's where all the experiments are), and the discussion above only applied to Trace. The code for those two modes is separate. I'll file a bug for your issue. I'd appreciate if you can provide me as much info as possible to help us quickly reproduce your problem. The best would be if you could send me source for a simplified app that exhibits this. If you that's too much to ask, can you describe your kernel? Does it call __device__ functions or other __global__ functions? If so, does it only make a few calls, or a large number? Does your kernel launch other kernels (i.e. CDP)? Does it use any device-side libraries? With info like this, I can try to create a simple program to reproduce the problem. The fact that removing -abi=no makes it go away certainly narrows it down. Thanks!
Hi csigg,

Your issue is different from the ones above, because you are seeing a problem in CUDA Profiling (that's where all the experiments are), and the discussion above only applied to Trace. The code for those two modes is separate. I'll file a bug for your issue. I'd appreciate if you can provide me as much info as possible to help us quickly reproduce your problem. The best would be if you could send me source for a simplified app that exhibits this. If you that's too much to ask, can you describe your kernel? Does it call __device__ functions or other __global__ functions? If so, does it only make a few calls, or a large number? Does your kernel launch other kernels (i.e. CDP)? Does it use any device-side libraries? With info like this, I can try to create a simple program to reproduce the problem. The fact that removing -abi=no makes it go away certainly narrows it down.

Thanks!

#11
Posted 04/14/2017 04:51 PM   
Scroll To Top

Add Reply