Problem with performance with different Visual/CUDA versions

Dear All

  Suppose the following Scenarios for the runtime of a program (GeForce 740M, dedicated to processing, base frequency 810MHz, max 890MHz)

Visual 2012, CUDA 6.5, 72miliseconds of Runtime 
Visual 2013, CUDA 7.5, 78milisecinds of Runtime

Driver Version 359
Same settings in both compilers.

Same computer the two scenarios.

My suspicion:


Perhaps, in the first case its is running in 890MHz and in second running in base frequency 810MHz. By some reason in the second case is not switching for high frequency.


I am doing nothing to configure the frequencies of the clock.

Can someone say how to solve the problem if the suspicions are supposed right or give some other reason?

Thanks

Luis Gonçalves

External Media

Welcome to the brave new world of dynamic device clocking. 78 usec vs 72 usec is a noticeable, but not massive difference in performance. Since different toolchains are being used, the most straightforward working hypothesis is that there are differences in the generated machine code. Have you examined it with cuobjdump --dump-sass? You can use nvidia-smi to examine whether different clock boosting is responsible for the performance difference. Probably less likely, but a valid hypothesis.

Is the comparison using the physically identical GeForce 740M in the same physically identical system, or are we talking about two physically separate GeForce 740M cards here, or one GU placed in two different systems? Physically separate GPUs of the same type could show different clock boosting behavior even under identical environmental conditions either due to manufacturing variations or VBIOS versions. A specific given GPU may show different clock boosting behavior under different environmental conditions (e.g. ambient air temperature, air pressure, humidity, which affect cooling and thus device temperature) or across driver changes. I am not sure whether clock boosting could be affected by the electrical supply delivered by a system’s PSU, but I think it is least possible in theory.

It is also possible that device temperature and power consumption are affected by different machine code being generated, which can in turn impact clock boosting. Research on the interaction between code execution and power draw and temperature (including effects like local voltage droop and thermal hot spots) and how this can be addressed by compiler code generation is still in its infancy. To my knowledge, today’s compilers usually only include simple heuristics to choose an energetically more efficient instruction when multiple equivalent operations are possible (e.g. shift versus add, multiply versus shift-add). Clealr such static analysis doesn’t cover the complex interactions of a code stream with all relevant hardware components and thus electrical and thermal characteristics.

The scenarios are all equal with the differences I said (same dedicated GPU, the GPU is not for graphics display. Ant it is in a portable ASUS K56C). The runtimes are in miliseconds and not microseconds.

I found several nvidia-smi.exe in my portable and I choose the newest. I ran intensive computation for 10 min on the GPU and the maximum temperature was 73 degrees Celcius

I think that in 78ms the GPU do not reach a temperature to downgrade the clock.

“nvidia-smi.exe -q” did not gave clock information.

I set

Nvidia Control Panel → 3D-Settings → Power Management → Maximum Performance

10 min runtime - 76 degrees Celsius max

Visual 2012, CUDA 6.5, 72miliseconds of Runtime
Visual 2013, CUDA 7.5, 78milisecinds of Runtime

Seems that the short programs ran already at high performance. Notice that those runtimes are after several program iterations where the kernels are already in GPU memory.

Make sure you use the nvidia-smi from CUDA 7.5, it is weird that nvidia-smi -q is not showing clock information. There should be a “Clocks” and a “Max Clocks” section. As far as I know, that information should be available with all GPUs. A temperature of 76 degrees Celsius seems very reasonable. Note that clock throttling in response to overheating (usually around 95 degree Celsius) is different from a lack of clock boosting.

Based on what is known so far, it would seem the most likely cause of the performance difference is different code generated by the two tool chains. If this is code you build yourself, you may want to inspect whether there are salient differences in the statistics produced with -Xptxas -v. Also, you may want to compare the SASS (machine code) by disassembling the kernel with cuobjdump --dump-sass. Running both versions with the CUDA profiler may point to the approximate reason for the performance difference (e.g. higher number of replays, higher cache miss ratio).

I think that at least in Windows the nvidia-smi.exe is from the driver and the version is 359.00. Not from the CUDA despite CUDA package installs also the driver. But it continues to be actualized independent from CUDA.

Thanks for the clues.

Yes, nvidia-smi comes with the driver. My fairly recent version (on Windows 7) identifies itself as follows:

NVIDIA-SMI 354.42 Driver Version: 354.42

As I said, I am surprised that when you run nvidia-smi -q there are no sections “Clocks” and “Max Clocks”. Seems very odd.

Perhaps because I have Windows 10 instead of Windows 7 and so different drive version. And because of that I have also some limitations on monitoring with nvidia-smi.

Also I saw in some place that the base frequency is 810MHz (also bellow) and higher frequency is 890Mhz

See output of “nvidia-smi.exe -q”

[url]http://luisgo.dyndns.org/nv/nvidia-smi.txt[/url]

You can see that the Performance State is P8 (perhaps idle state). When I run the 10 min program gives P0 that I suppose is the higher performance state.

See below the System Information on the Nvidia Control Panel

External Media

External Media

As far as I understand, power management and clock boosting are not the same thing: GPUs must enter P0 (“maximum 3D/compute power”) state before clock boosting can kick in. It used to be the case that a GPU enters P0 as soon as a CUDA context is established, not sure whether that is still the case.

In any event there should not be much delay between your CUDA-app starting to run and the card going into P0 state. Below P0 there are various lower-power states, but I am not very familiar with their structure. P8 is definitely one of the lower-power states, maybe the “2-display only” state.

Based on the discussion so far I think it is a pretty safe bet that the observed performance differences have nothing to do with differences in clock boosting, but are driven by code generation differences. I would consider performance regressions > 5% in application-critical kernels actionable, that is, worth filing a bug report for. So I would focus on demonstrating that machine code changes are indeed at the root of the observed behavior.