Why would code run 1.7x faster when run with nvprof than without?

This is quite a conundrum. I (obviously) can’t get profiling information in the slow case to see what the difference is…

You are sure it’s not the other way around, i.e. 1.7x slower with the profiler? Because that might make some sense.

Is this on a Windows or a Linux platform?

I am totally certain. This is on Ubuntu.

I know this is a tough question to answer without code, but I of course can’t just paste 5000 lines of code - I have no idea what could be causing this, so I wouldn’t even know where to look…

Try doing some limited manual instrumentation of the code on the host side, using high-resolution timers. The question you would want to answer is where the additional time goes in the slow case. The slowdown may have nothing to do with the GPU portion of the code.

Also: I assume you have double checked that with or without profiling, the application consumes exactly the same input data, and produces exactly the same output data?

Good point, I wouldn’t have thought to check the CPU time, though it should be minimal. Push comes to shove I can just manually time the entire application.

And yes, I have checked that - all the way out to FP64 machine precision.

If anyone happens to encounter the same behavior, please comment, but I suspect it will be a while before I devote the time to thoroughly test this myself.

I don’t have a good hypothesis what could be going on relative to the GPU. It’s possible that use of the profiler could cause additional host/device synchronizations and that this might positively impact some hardware utilization imbalances, but at best I would expect this to be a minor effect on a Linux platform.

My best guess for now is that the time difference occurs in a non-GPU portion of the code. But it will be interesting to hear what you can track the additional time down to. A curious head-scratcher for sure!

This is a tough question to answer as you haven’t provided any details. What are you measuring? Kernel duration, process duration, etc. How are you measuring time? Is this GPU or CPU time? What are the reported times? What GPU is the code running on?

The profilers do change a few behaviors:

  • disable some power management when capturing PM counters
  • increase the GPU timer frequency from 1 MHz to 31.25 MHz
  • measure kernel execution time more precisely than is possible with CUDA events
  • increase CPU overhead
  • flush work to GPU faster (using a Windows, not Linux difference)

Hi Greg, this is the total time of the program from start to finish. My program outputs the current “program time” every 60 seconds, so this is the number I’m comparing. Running on a P100. But I think I know the source of the issue.

There is either something deeply and subtly wrong with my code, or with CUDA. I suspect that this is connected to another issue I have encountered in the past, and am encountering again now. I have a kernel which is doing some double precision atomic addition which, in a “normal” run, takes an average of 630 microseconds. It seems to spontaneously (details in a moment) jump up to an average of 145.03ms, where the minimum time is still 620.19 microseconds but the maximum is 180.41ms. The kernel reads some data from a large, cudaMallocManaged array, squares and adds a couple of values, and atomically adds to a small (~1kb) managed array.

This kernel, as well as its entire compilation unit, is always the same. I have four versions of similar programs which each run slightly different simulations, but make use of the problematic kernel in precisely the exact same way. Occasionally one of these four versions exhibits the erratic slowness. These versions require different total amounts of memory (managed allocation with cudaMallocManaged), but the issue manifests at a certain problem size which is much smaller than the total 16GB available on the P100 (on the order of one GB total).

I thought I resolved the issue when, instead of cudaMallocManged’ing the array which is being atomically added to, I declared it as a managed array (it’s only ~1kb in size), but it evidently has reappeared. What made the issue reappear was once again totally unrelated (at face value) - all I did was add/remove some global device managed variables (as in, one or two double variables). Again the slowness is only for one of the four code versions.

I would have to imagine that this is the issue which nvprof is having an effect on as well, but that would of course require a manual check.

In sum: it seems to be one kernel which is doing double precision atomics, and the factor of 300 slow-down is very erratically sensitive to seemingly unrelated things, like the addition of a global device _managed double variable, or the use of nvprof, etc.

This suggests to me that these things are not the root cause of the issue observed, but that they add enough noise to the execution environment that they may trigger it.

Based on the bits of information revealed so far I am unable to develop a clear mental model as to what the root cause could be, some sort of thrashing behavior maybe? Just as a sanity check: When you run the application under control of cuda-memcheck, are there any issues reported? Also, is the host code free of use of uninitialized data and out-of-bounds accesses (use valgrind or similar to check)?

One of my favorite war stories is how an application would occasionally misbehave, depending on what totally unrelated nonsense environment variables I would set, e.g. failed with FOO=BARBAR, but worked with FOO=FOOFOO. After a lengthy debugging process, it turned out there was an out-of-bounds access in the host code, which by pure chance would hit the process’s own environment, and the subtly incorrect information picked up from there would be passed down various software layers to the point where it caused trouble.

No errors with cuda-memcheck. I should check the host code as you advise - will valgrind work with CUDA code out-of-the-box?

One other thing, which I may have been negligent in not mentioning before - nvprof does report some amount of memory thrashing - 1 Mb total of 262 4kb instances in the “normal” case, and 50 Mb total consisting of 13,000 4kb instances in the problematic case. I’ve disregarded this before simply out of ignorance; what is memory thrashing, and could it be relevant?

There’s a possibility of false positives for data that is transferred by DMA to the host memory. At least that used to be an issue when I last used valgrind on a CUDA-enabled app five years ago.

I have also seen reports in these forums of harmless “overreads” (read access out of bounds) in some of the libraries that ship with CUDA. I seem to recall CUFFT, CUBLAS, and NPP might have had such issues at one time or other; current status unkown.

This definitely looks worthy of a follow-up. Offhand I don’t know what the profiler may refer to as “memory thrashing”, have you checked the built-in help to see whether it gives a crisp definition?

since you are using UM on linux on a P100, UM paging is in effect, and thrashing may refer to excessive page faults. This is easy to see with the visual profiler.

If it were me I would experiment by taking managed memory out of the picture - convert to ordinary cudaMalloc and see if things stabilize. Alternatively, the cudaMemAdvise hint system may help stabilize the thrashing.

Are you hammering on the data from both host and device code at the same time?

Should page faulting occur if the total memory footprint is far smaller than the RAM available on the GPU?

I unfortunately am running on a remote server and don’t know how to get the visual profiler working, not to mention working remotely (that is, I’m unsure I have the time to do so). I’m also wary of your suggestion to convert to cudaMalloc becoming very time consuming (I’m a scientist, not a computer scientist - I don’t mean to be difficult, I just have to manage my efforts appropriately).

I don’t believe I ever simultaneously access data from both host code and device code. I certainly have never intended to.

Valgrind results momentarily…

Valgrind returned 35 errors of “possible” leaks, mostly saying

==56560== 136 bytes in 1 blocks are possibly lost in loss record 97 of 144
==56560==    at 0x4C2F988: calloc (vg_replace_malloc.c:711)
==56560==    by 0xE0341F2: ??? (in /usr/lib/x86_64-linux-gnu/libcuda.so.387.26)
==56560==    by 0xE034849: ??? (in /usr/lib/x86_64-linux-gnu/libcuda.so.387.26)

etc., or referring to things like cudart::globalState::loadDriverInternal(). The summary:

==56560== LEAK SUMMARY:
==56560==    definitely lost: 0 bytes in 0 blocks
==56560==    indirectly lost: 0 bytes in 0 blocks
==56560==      possibly lost: 18,960 bytes in 104 blocks
==56560==    still reachable: 476,827 bytes in 475 blocks
==56560==         suppressed: 0 bytes in 0 blocks

I see no references to locations in my code in the output. Nor does this mean anything to me :)

Page faulting will occur(*) anytime the host requests data resident on the device, or the device requests data resident on the host. If host and device are not simultaneously hammering on the same page, I don’t expect thrashing to occur. But nevertheless it is (or may be) occurring.

I can’t conclusively tell you exactly what is happening in this case.

(*) the exception would be in the case where the UM runtime system has decided to convert the allocation to a pinned allocation, due to observation of previous thrashing. It might be in the stable cases this has happened. It’s just conjecture.

The thrashing seems to be disconnected, maybe - or at least not the definite issue. The array which is being atomically added to was before just a “managed double”; making it a “device managed double” resolves the speed issue without affecting the discrepancy in thrashing counts. However, running sans nvprof still incurs the slow down. So bizarre…

If it makes a difference, I also observe that the problem cases can get slower in the middle of execution - meaning the first half of the runtime of my program proceeds at “normal” speed only to subsequently slow down (c.f. the huge discrepancy between the minimum and maximum durations of the problem kernel).

If the valgrind output snippets you posted are representative of the total output, I’d say there is nothing to worry about, especially since we are not looking for (potential) memory leaks right now. Even then I wouldn’t jump into action until it says non-zero bytes have been “definitely lost”.

Use of cuda-memcheck and valgrind is a good sanity check when “strange” things are going on in an application for which there is no ready explanation.