I finally have an update for this case. There are a few issues at play here.
First of all, I just fixed a bug which could cause erratic behavior on the timeline. Some of the reports krazanmp sent me exhibited behavior where CUDA memcpy events would appear in the memcpy row, but would be unexpectedly missing from the stream row. Also, the memcpy and compute rows would appear as stacked rows (for displaying overlapping ranges) even though nothing ever overlapped. This was all due to events in the log appearing in an order which we assumed was impossible, and our row renderers didn’t handle that case correctly. That is fixed now, so you should see the rendering problems go away in our next release (we expect Nsight 5.1 to have a beta release in January).
That doesn’t fix the other problem here, which is Nsight failing to log events. The problem is that waiting for asynchronous GPU events (like CUDA launches/memcpys/memsets) to complete and flushing their records to the log files requires some minimal CPU/GPU synchronization. There is a fine balance between flushing events too often and not flushing often enough: Too often slows down the app and makes the tool’s measurements untrustworthy, and not often enough means stopping the capture by clicking Stop results in losing a lot more unflushed data.
The obvious solution would be for Nsight to force a flush when capture stops, but this requires Nsight to create a background thread inside the app, which can react to the user clicking Stop. We never liked the idea of creating a thread and introducing more nondeterminism in how the tool affects the app’s performance, but we have decided the benefits of the background thread greatly outweigh the costs. Unfortunately it’s a lot of work to fix this and won’t be done in January, but it is my highest priority feature for the release after that.
In the meantime, there are ways to work around this. If an app calls cudaDeviceSynchronize, Nsight will immediately take advantage of the CPU/GPU sync point and flush all records. Calling cudaStreamSynchronize will force Nsight to flush events for just that stream. If an app only uses cudaEventSynchronize, as is the case here (at least from what I see in the reports), Nsight only flushes when the buffers fill up and need more space, which may not happen often (or ever). Clicking “Stop” in the UI’s capture control will cause all unflushed data to be lost. Just for the purpose of working around that deficiency in Nsight, try adding some occasional calls to cudaDeviceSynchronize and make sure to wait for one of those before clicking Stop. Then the report will contain all events up to that sync call.
Also, in Nsight 5.0, I at least fixed the case where the app exits normally. Now, Nsight adds a hook to ensure all unflushed data gets flushed at exit. This does not help the case described here (clicking Stop while the app is still running), but it does mean CUDA apps no longer need to call cudaDeviceSynchronize or cudaDeviceReset at the end to ensure all the data gets flushed – now this happens automatically when tracing the app to completion.
I will reply to this post again when we release the version that forces flushing all records when Stop is clicked in the UI.