Titan XP (Pascal) warms up for double precision despite the clock frequency set
Hello, I have observed behaviour that I am not able explain and avoid. Since I am trying to correctly meausure performance, I wonder if there any settings of technique which allows to eliminate this phenomena. For double precision computations with floating point divisions the time of subsequent kernel runs gradually decreases during the first 300-400 ms. At the beginning the kernel takes about 3.7 ms, at the end 3.2 ms. After this warm-up the performance stays close to the maximum. For the same kernels in single precision, and for similar double precision kernels but without division, the performance is maximum from the beginning. I measure each kernel separately using C++ std::chrono library. All double precision kernels are compute bound on Titan XP, single precision kernels are bandwidth bound. This phenomenon occurs only on GTX Titan XP device. Previously I have been running the same code on GTX Titan with Kepler processor and kernel times were almost constant after setting of fixed GPU clocks. I am using the following settings for Titan XP: nvidia-smi -pm 1 nvidia-smi --power-limit=300 nvidia-smi --application-clocks=5705,1911 nvidia-settings -a "[gpu:0]/GPUFanControlState=1" -a "[fan:0]/GPUTargetFanSpeed=100" # 0 - adaptive # 1 - prefer maximum performance # 2 - auto nvidia-settings -a "[gpu:0]/GpuPowerMizerMode=1" Operating system: Ubuntu 16.04.3, CUDA 9.0 with driver 384.81 I would be greatful for any advice.
Hello,


I have observed behaviour that I am not able explain and avoid.
Since I am trying to correctly meausure performance, I wonder if there any settings of technique which allows to eliminate this phenomena.

For double precision computations with floating point divisions the time of subsequent kernel runs gradually decreases during the first 300-400 ms. At the beginning the kernel takes about 3.7 ms, at the end 3.2 ms. After this warm-up the performance stays close to the maximum.

For the same kernels in single precision, and for similar double precision kernels but without division, the performance is maximum from the beginning.

I measure each kernel separately using C++ std::chrono library.

All double precision kernels are compute bound on Titan XP, single precision kernels are bandwidth bound.

This phenomenon occurs only on GTX Titan XP device.
Previously I have been running the same code on GTX Titan with Kepler processor and kernel times were almost constant after setting of fixed GPU clocks.

I am using the following settings for Titan XP:

nvidia-smi -pm 1
nvidia-smi --power-limit=300
nvidia-smi --application-clocks=5705,1911

nvidia-settings -a "[gpu:0]/GPUFanControlState=1" -a "[fan:0]/GPUTargetFanSpeed=100"

# 0 - adaptive
# 1 - prefer maximum performance
# 2 - auto
nvidia-settings -a "[gpu:0]/GpuPowerMizerMode=1"


Operating system: Ubuntu 16.04.3, CUDA 9.0 with driver 384.81


I would be greatful for any advice.

#1
Posted 11/14/2017 01:27 PM   
Since GPU don't have any division instructions, divisions are just regular code composed of multiplies and adds and assorted other instructions, some are inlined (like 32-bit integer division) but most are called subroutines. So that would appear to be a bit of a red herring. However, the use of divisions typically increases register pressure and dynamic instruction count, and may therefore decrease occupancy and make an application more compute bound. Kernel execution time [i]reducing[/i] in the millisecond time frame you describe is typically a consequence of the GPU's control software gradually boosting clocks to the maximum possible. However, you are specifying application clocks, so to my understanding that should hold the clocks constant. Have you verified, with nvidia-smi, that this is actually the case? I wonder whether non-Tesla cards in the Pascal family may simply ignore application clock settings and always use auto boost to maintain maximum power efficiency. Application clocks were introduced for Tesla cards so all GPUs in a cluster can be operated at the same clock; I was under the impression they are [i]not[/i] supported for consumer cards. I have no idea what the "power mizer" settings do (e.g. what is the difference between 'auto' and 'adaptive'?). Some "warmup effect" will exist on any complex processor and memory subsyetm, but the first few calls to a kernel should train up caches, TLBs, branch predictors, etc, and lead to steady-state performance. It is always a good idea to wait until the code is in steady state before measuring the performance. For example, some benchmarks run ten times, and report only the fastest time. Note that due to throttling (most frequently [i]thermal[/i] throttling), a more common problem when measuring performance of GPUs is that the performance [i]drops[/i] after the first few minutes. You appear to counteract that by running the fan at 100% and dialing in the highest supported power limit.
Since GPU don't have any division instructions, divisions are just regular code composed of multiplies and adds and assorted other instructions, some are inlined (like 32-bit integer division) but most are called subroutines. So that would appear to be a bit of a red herring.

However, the use of divisions typically increases register pressure and dynamic instruction count, and may therefore decrease occupancy and make an application more compute bound.

Kernel execution time reducing in the millisecond time frame you describe is typically a consequence of the GPU's control software gradually boosting clocks to the maximum possible. However, you are specifying application clocks, so to my understanding that should hold the clocks constant.

Have you verified, with nvidia-smi, that this is actually the case? I wonder whether non-Tesla cards in the Pascal family may simply ignore application clock settings and always use auto boost to maintain maximum power efficiency. Application clocks were introduced for Tesla cards so all GPUs in a cluster can be operated at the same clock; I was under the impression they are not supported for consumer cards. I have no idea what the "power mizer" settings do (e.g. what is the difference between 'auto' and 'adaptive'?).

Some "warmup effect" will exist on any complex processor and memory subsyetm, but the first few calls to a kernel should train up caches, TLBs, branch predictors, etc, and lead to steady-state performance. It is always a good idea to wait until the code is in steady state before measuring the performance. For example, some benchmarks run ten times, and report only the fastest time.

Note that due to throttling (most frequently thermal throttling), a more common problem when measuring performance of GPUs is that the performance drops after the first few minutes. You appear to counteract that by running the fan at 100% and dialing in the highest supported power limit.

#2
Posted 11/14/2017 03:33 PM   
@njuffa: Thank you very much for many interesting hints. Today I have checked some of them, though only with partial success. > Have you verified, with nvidia-smi, that this is actually the case? Yes, but at first I only checked GPU clocks just after running the script from the first post nad got something like this (all clocks at proper values): [code] nvidia-smi -q ==============NVSMI LOG============== Timestamp : Wed Nov 15 10:24:05 2017 Driver Version : 384.81 Attached GPUs : 1 GPU 00000000:02:00.0 Product Name : TITAN Xp Product Brand : GeForce Display Mode : Enabled Display Active : Enabled Persistence Mode : Enabled Accounting Mode : Disabled Accounting Mode Buffer Size : 1920 Driver Model Current : N/A Pending : N/A Serial Number : 0323217042377 GPU UUID : GPU-03f651d1-dc62-d37e-6a48-f9c6714f2cf5 Minor Number : 0 VBIOS Version : 86.02.3D.00.01 MultiGPU Board : No Board ID : 0x200 GPU Part Number : 900-1G611-2530-000 Inforom Version Image Version : G001.0000.01.04 OEM Object : 1.1 ECC Object : N/A Power Management Object : N/A GPU Operation Mode Current : N/A Pending : N/A GPU Virtualization Mode Virtualization mode : None PCI Bus : 0x02 Device : 0x00 Domain : 0x0000 Device Id : 0x1B0210DE Bus Id : 00000000:02:00.0 Sub System Id : 0x11DF10DE GPU Link Info PCIe Generation Max : 3 Current : 3 Link Width Max : 16x Current : 16x Bridge Chip Type : N/A Firmware : N/A Replays since reset : 0 Tx Throughput : 6000 KB/s Rx Throughput : 0 KB/s Fan Speed : 100 % Performance State : P0 Clocks Throttle Reasons Idle : Not Active Applications Clocks Setting : Not Active SW Power Cap : Not Active HW Slowdown : Not Active Sync Boost : Not Active SW Thermal Slowdown : Not Active FB Memory Usage Total : 12188 MiB Used : 460 MiB Free : 11728 MiB BAR1 Memory Usage Total : 256 MiB Used : 5 MiB Free : 251 MiB Compute Mode : Default Utilization Gpu : 1 % Memory : 1 % Encoder : 0 % Decoder : 0 % Encoder Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 Ecc Mode Current : N/A Pending : N/A ECC Errors Volatile Single Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Double Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Aggregate Single Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Double Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Retired Pages Single Bit ECC : N/A Double Bit ECC : N/A Pending : N/A Temperature GPU Current Temp : 47 C GPU Shutdown Temp : 99 C GPU Slowdown Temp : 96 C GPU Max Operating Temp : N/A Memory Current Temp : N/A Memory Max Operating Temp : N/A Power Readings Power Management : Supported Power Draw : 90.59 W Power Limit : 300.00 W Default Power Limit : 250.00 W Enforced Power Limit : 300.00 W Min Power Limit : 125.00 W Max Power Limit : 300.00 W Clocks Graphics : 1911 MHz SM : 1911 MHz Memory : 5702 MHz Video : 1708 MHz Applications Clocks Graphics : 1911 MHz Memory : 5705 MHz Default Applications Clocks Graphics : 1404 MHz Memory : 5705 MHz Max Clocks Graphics : 1911 MHz SM : 1911 MHz Memory : 5705 MHz Video : 1708 MHz Max Customer Boost Clocks Graphics : N/A Clock Policy Auto Boost : N/A Auto Boost Default : N/A Processes Process ID : 1269 Type : G Name : /usr/lib/xorg/Xorg Used GPU Memory : 458 MiB nvidia-settings -q GPUCurrentPerfLevel -q GPUAdaptiveClockState -q GPUCurrentClockFreqs -q GPUCurrentClockFreqsString -q GPUPerfModes -q GPUCoreTemp -q GPUPowerMizerMode -q GPUPowerMizerDefaultMode -q GPUPerfModes Attribute 'GPUCurrentPerfLevel' (tad:0.0): 3. 'GPUCurrentPerfLevel' is an integer attribute. 'GPUCurrentPerfLevel' is a read-only attribute. 'GPUCurrentPerfLevel' can use the following target types: X Screen, GPU. Attribute 'GPUAdaptiveClockState' (tad:0.0): 1. 'GPUAdaptiveClockState' is a boolean attribute; valid values are: 1 (on/true) and 0 (off/false). 'GPUAdaptiveClockState' is a read-only attribute. 'GPUAdaptiveClockState' can use the following target types: X Screen, GPU. Attribute 'GPUCurrentClockFreqs' (tad:0.0): 1911,5702. 'GPUCurrentClockFreqs' is a packed integer attribute. 'GPUCurrentClockFreqs' is a read-only attribute. 'GPUCurrentClockFreqs' can use the following target types: X Screen, GPU. Attribute 'GPUCurrentClockFreqsString' (tad:0.0): nvclock=1911, nvclockmin=139, nvclockmax=1911, nvclockeditable=1, memclock=5702, memclockmin=5705, memclockmax=5705, memclockeditable=1, memTransferRate=11404, memTransferRatemin=11410, memTransferRatemax=11410, memTransferRateeditable=1 Attribute 'GPUPerfModes' (tad:0.0): perf=0, nvclock=139, nvclockmin=139, nvclockmax=607, nvclockeditable=1, memclock=405, memclockmin=405, memclockmax=405, memclockeditable=1, memTransferRate=810, memTransferRatemin=810, memTransferRatemax=810, memTransferRateeditable=1 ; perf=1, nvclock=139, nvclockmin=139, nvclockmax=1911, nvclockeditable=1, memclock=810, memclockmin=810, memclockmax=810, memclockeditable=1, memTransferRate=1620, memTransferRatemin=1620, memTransferRatemax=1620, memTransferRateeditable=1 ; perf=2, nvclock=139, nvclockmin=139, nvclockmax=1911, nvclockeditable=1, memclock=5505, memclockmin=5505, memclockmax=5505, memclockeditable=1, memTransferRate=11010, memTransferRatemin=11010, memTransferRatemax=11010, memTransferRateeditable=1 ; perf=3, nvclock=139, nvclockmin=139, nvclockmax=1911, nvclockeditable=1, memclock=5705, memclockmin=5705, memclockmax=5705, memclockeditable=1, memTransferRate=11410, memTransferRatemin=11410, memTransferRatemax=11410, memTransferRateeditable=1 Attribute 'GPUCoreTemp' (tad:0.0): 47. 'GPUCoreTemp' is an integer attribute. 'GPUCoreTemp' is a read-only attribute. 'GPUCoreTemp' can use the following target types: X Screen, GPU. Attribute 'GPUPerfModes' (tad:0.0): perf=0, nvclock=139, nvclockmin=139, nvclockmax=607, nvclockeditable=1, memclock=405, memclockmin=405, memclockmax=405, memclockeditable=1, memTransferRate=810, memTransferRatemin=810, memTransferRatemax=810, memTransferRateeditable=1 ; perf=1, nvclock=139, nvclockmin=139, nvclockmax=1911, nvclockeditable=1, memclock=810, memclockmin=810, memclockmax=810, memclockeditable=1, memTransferRate=1620, memTransferRatemin=1620, memTransferRatemax=1620, memTransferRateeditable=1 ; perf=2, nvclock=139, nvclockmin=139, nvclockmax=1911, nvclockeditable=1, memclock=5505, memclockmin=5505, memclockmax=5505, memclockeditable=1, memTransferRate=11010, memTransferRatemin=11010, memTransferRatemax=11010, memTransferRateeditable=1 ; perf=3, nvclock=139, nvclockmin=139, nvclockmax=1911, nvclockeditable=1, memclock=5705, memclockmin=5705, memclockmax=5705, memclockeditable=1, memTransferRate=11410, memTransferRatemin=11410, memTransferRatemax=11410, memTransferRateeditable=1 [/code] Today I did more thorough test and found some strange behaviour, though unfortunately it does not seem to be the cause. First, I have observed that during heavy computations the GPU --decreases-- "Performance State" from P0 (at iddle) to P2 (at heavy load) and decreases memory clock from 5705 MHz to 5508 MHz. Graphic and video clocks stay the same. However, this change of clock frequency rather is not a reason of gradual performance increase, because the clock frequency decrease occurs before the first kernel run. I have checked that using NVML library: [code] void printClocks (nvmlDevice_t device, const nvmlClockId_t clockID) { unsigned int clockMHz ; NVML_CHECK (nvmlDeviceGetClock (device, NVML_CLOCK_SM, clockID, &clockMHz)) logger << "SM = " << clockMHz << " " ; NVML_CHECK (nvmlDeviceGetClock (device, NVML_CLOCK_MEM, clockID, &clockMHz)) logger << "Mem = " << clockMHz << " " ; NVML_CHECK (nvmlDeviceGetClock (device, NVML_CLOCK_GRAPHICS, clockID, &clockMHz)) logger << "Graphics = " << clockMHz << " " ; } void showParams() { nvmlDevice_t device ; NVML_CHECK (nvmlDeviceGetHandleByIndex (0, &device)) ; logger << "Clocks current: " ; printClocks (device, NVML_CLOCK_ID_CURRENT) ; logger << " ; target : " ; printClocks (device, NVML_CLOCK_ID_APP_CLOCK_TARGET) ; logger << " ; default: " ; printClocks (device, NVML_CLOCK_ID_APP_CLOCK_DEFAULT) ; logger << "\n" ; } [/code] and calling showParams() before each kernel run gives [code] Clocks current: SM = 1911 Mem = 5508 Graphics = 1911 ; target : SM = 1911 Mem = 5705 Graphics = 1911 ; default: SM = 1404 Mem = 5705 Graphics = 1404 Clocks current: SM = 1911 Mem = 5508 Graphics = 1911 ; target : SM = 1911 Mem = 5705 Graphics = 1911 ; default: SM = 1404 Mem = 5705 Graphics = 1404 Clocks current: SM = 1911 Mem = 5508 Graphics = 1911 ; target : SM = 1911 Mem = 5705 Graphics = 1911 ; default: SM = 1404 Mem = 5705 Graphics = 1404 Clocks current: SM = 1911 Mem = 5508 Graphics = 1911 ; target : SM = 1911 Mem = 5705 Graphics = 1911 ; default: SM = 1404 Mem = 5705 Graphics = 1404 Clocks current: SM = 1911 Mem = 5508 Graphics = 1911 ; target : SM = 1911 Mem = 5705 Graphics = 1911 ; default: SM = 1404 Mem = 5705 Graphics = 1404 Clocks current: SM = 1911 Mem = 5508 Graphics = 1911 ; target : SM = 1911 Mem = 5705 Graphics = 1911 ; default: SM = 1404 Mem = 5705 Graphics = 1404 ... [/code] Second, I was not able to change the memory frequency. I tried both [code] nvidia-smi --application-clocks=810,139 [/code] and NVML [code] NVML_CHECK( nvmlDeviceSetApplicationsClocks (device, 810, 139) ) ; [/code] This seems consistent with your impression that "application clocks are not supported for consumer cards". However, setting graphic clock works as expected, only memory clocks remains constant at 5508 MHz. Thus, since all clocks seem to remain the same for all kernel runs, I still do not understand why the performance of my kernels increases up to about 120th launch (the third column contains kernel time in ns). [code] fastKernel : 0 : 3.328e+06 fastKernel : 1 : 3.321e+06 fastKernel : 2 : 3.314e+06 fastKernel : 3 : 3.311e+06 fastKernel : 4 : 3.311e+06 fastKernel : 5 : 3.311e+06 fastKernel : 6 : 3.307e+06 fastKernel : 7 : 3.307e+06 fastKernel : 8 : 3.312e+06 fastKernel : 9 : 3.308e+06 fastKernel : 10 : 3.313e+06 fastKernel : 11 : 3.304e+06 fastKernel : 12 : 3.305e+06 fastKernel : 13 : 3.306e+06 fastKernel : 14 : 3.301e+06 fastKernel : 15 : 3.303e+06 fastKernel : 16 : 3.3e+06 fastKernel : 17 : 3.301e+06 fastKernel : 18 : 3.297e+06 fastKernel : 19 : 3.3e+06 fastKernel : 20 : 3.295e+06 fastKernel : 21 : 3.294e+06 fastKernel : 22 : 3.294e+06 fastKernel : 23 : 3.295e+06 fastKernel : 24 : 3.289e+06 fastKernel : 25 : 3.289e+06 fastKernel : 26 : 3.286e+06 fastKernel : 27 : 3.284e+06 fastKernel : 28 : 3.29e+06 fastKernel : 29 : 3.285e+06 fastKernel : 30 : 3.283e+06 fastKernel : 31 : 3.282e+06 fastKernel : 32 : 3.283e+06 fastKernel : 33 : 3.284e+06 fastKernel : 34 : 3.279e+06 fastKernel : 35 : 3.281e+06 fastKernel : 36 : 3.282e+06 fastKernel : 37 : 3.276e+06 fastKernel : 38 : 3.278e+06 fastKernel : 39 : 3.274e+06 fastKernel : 40 : 3.273e+06 fastKernel : 41 : 3.275e+06 fastKernel : 42 : 3.275e+06 fastKernel : 43 : 3.269e+06 fastKernel : 44 : 3.274e+06 fastKernel : 45 : 3.27e+06 fastKernel : 46 : 3.268e+06 fastKernel : 47 : 3.269e+06 fastKernel : 48 : 3.265e+06 fastKernel : 49 : 3.263e+06 fastKernel : 50 : 3.263e+06 fastKernel : 51 : 3.663e+06 fastKernel : 52 : 3.258e+06 fastKernel : 53 : 3.258e+06 fastKernel : 54 : 3.257e+06 fastKernel : 55 : 3.258e+06 fastKernel : 56 : 3.251e+06 fastKernel : 57 : 3.251e+06 fastKernel : 58 : 3.251e+06 fastKernel : 59 : 3.248e+06 fastKernel : 60 : 3.248e+06 fastKernel : 61 : 3.243e+06 fastKernel : 62 : 3.239e+06 fastKernel : 63 : 3.239e+06 fastKernel : 64 : 3.237e+06 fastKernel : 65 : 3.242e+06 fastKernel : 66 : 3.233e+06 fastKernel : 67 : 3.235e+06 fastKernel : 68 : 3.235e+06 fastKernel : 69 : 3.231e+06 fastKernel : 70 : 3.23e+06 fastKernel : 71 : 3.232e+06 fastKernel : 72 : 3.227e+06 fastKernel : 73 : 3.223e+06 fastKernel : 74 : 3.227e+06 fastKernel : 75 : 3.224e+06 fastKernel : 76 : 3.225e+06 fastKernel : 77 : 3.219e+06 fastKernel : 78 : 3.216e+06 fastKernel : 79 : 3.216e+06 fastKernel : 80 : 3.217e+06 fastKernel : 81 : 3.212e+06 fastKernel : 82 : 3.213e+06 fastKernel : 83 : 3.213e+06 fastKernel : 84 : 3.212e+06 fastKernel : 85 : 3.21e+06 fastKernel : 86 : 3.211e+06 fastKernel : 87 : 3.205e+06 fastKernel : 88 : 3.205e+06 fastKernel : 89 : 3.205e+06 fastKernel : 90 : 3.202e+06 fastKernel : 91 : 3.204e+06 fastKernel : 92 : 3.198e+06 fastKernel : 93 : 3.202e+06 fastKernel : 94 : 3.196e+06 fastKernel : 95 : 3.196e+06 fastKernel : 96 : 3.197e+06 fastKernel : 97 : 3.588e+06 fastKernel : 98 : 3.229e+06 fastKernel : 99 : 3.194e+06 fastKernel : 100 : 3.191e+06 fastKernel : 101 : 3.188e+06 fastKernel : 102 : 3.19e+06 fastKernel : 103 : 3.191e+06 fastKernel : 104 : 3.193e+06 fastKernel : 105 : 3.194e+06 fastKernel : 106 : 3.194e+06 fastKernel : 107 : 3.194e+06 fastKernel : 108 : 3.185e+06 fastKernel : 109 : 3.191e+06 fastKernel : 110 : 3.183e+06 fastKernel : 111 : 3.455e+06 fastKernel : 112 : 3.221e+06 fastKernel : 113 : 3.182e+06 fastKernel : 114 : 3.183e+06 fastKernel : 115 : 3.183e+06 fastKernel : 116 : 3.178e+06 fastKernel : 117 : 3.182e+06 fastKernel : 118 : 3.18e+06 fastKernel : 119 : 3.179e+06 fastKernel : 120 : 3.179e+06 fastKernel : 121 : 3.179e+06 fastKernel : 122 : 3.179e+06 fastKernel : 123 : 3.178e+06 fastKernel : 124 : 3.177e+06 fastKernel : 125 : 3.178e+06 fastKernel : 126 : 3.178e+06 fastKernel : 127 : 3.174e+06 fastKernel : 128 : 3.174e+06 fastKernel : 129 : 3.176e+06 fastKernel : 130 : 3.176e+06 fastKernel : 131 : 3.174e+06 fastKernel : 132 : 3.175e+06 fastKernel : 133 : 3.178e+06 fastKernel : 134 : 3.176e+06 fastKernel : 135 : 3.173e+06 fastKernel : 136 : 3.174e+06 fastKernel : 137 : 3.179e+06 fastKernel : 138 : 3.174e+06 fastKernel : 139 : 3.172e+06 fastKernel : 140 : 3.172e+06 fastKernel : 141 : 3.174e+06 fastKernel : 142 : 3.173e+06 fastKernel : 143 : 3.177e+06 fastKernel : 144 : 3.176e+06 fastKernel : 145 : 3.172e+06 fastKernel : 146 : 3.173e+06 fastKernel : 147 : 3.175e+06 fastKernel : 148 : 3.177e+06 fastKernel : 149 : 3.176e+06 fastKernel : 150 : 3.21e+06 fastKernel : 151 : 3.572e+06 fastKernel : 152 : 3.21e+06 fastKernel : 153 : 3.178e+06 fastKernel : 154 : 3.173e+06 fastKernel : 155 : 3.177e+06 fastKernel : 156 : 3.174e+06 fastKernel : 157 : 3.175e+06 fastKernel : 158 : 3.173e+06 fastKernel : 159 : 3.173e+06 fastKernel : 160 : 3.175e+06 fastKernel : 161 : 3.175e+06 fastKernel : 162 : 3.174e+06 fastKernel : 163 : 3.173e+06 fastKernel : 164 : 3.173e+06 fastKernel : 165 : 3.175e+06 fastKernel : 166 : 3.173e+06 fastKernel : 167 : 3.176e+06 fastKernel : 168 : 3.175e+06 fastKernel : 169 : 3.173e+06 fastKernel : 170 : 3.172e+06 fastKernel : 171 : 3.175e+06 fastKernel : 172 : 3.173e+06 fastKernel : 173 : 3.172e+06 fastKernel : 174 : 3.176e+06 fastKernel : 175 : 3.174e+06 fastKernel : 176 : 3.173e+06 fastKernel : 177 : 3.174e+06 fastKernel : 178 : 3.172e+06 fastKernel : 179 : 3.173e+06 fastKernel : 180 : 3.172e+06 fastKernel : 181 : 3.175e+06 fastKernel : 182 : 3.171e+06 fastKernel : 183 : 3.176e+06 fastKernel : 184 : 3.174e+06 fastKernel : 185 : 3.172e+06 fastKernel : 186 : 3.173e+06 fastKernel : 187 : 3.177e+06 fastKernel : 188 : 3.171e+06 fastKernel : 189 : 3.176e+06 fastKernel : 190 : 3.175e+06 fastKernel : 191 : 3.174e+06 fastKernel : 192 : 3.173e+06 fastKernel : 193 : 3.172e+06 fastKernel : 194 : 3.173e+06 fastKernel : 195 : 3.176e+06 fastKernel : 196 : 3.177e+06 fastKernel : 197 : 3.177e+06 fastKernel : 198 : 3.174e+06 fastKernel : 199 : 3.173e+06 [/code] Unless my clock measurements with NVML are wrong, the clocks seem to stay the same for each kernel run. The performance change must be then caused by other factors. > I have no idea what the "power mizer" settings do (e.g. what is the difference between 'auto' and 'adaptive'?). In fact, I have not analysed this in detail. I only observed that in 'auto' and 'adaptive' modes the GPU clocks slow down at iddle. Setting to 'prefer maximum performance' keeps clocks high despite GPU utilisation. > Some "warmup effect" will exist on any complex processor Yes, thank you, I am aware of that. However, for CPUs usually only at most a few first runs were affected by this phenomena. In my case, the performance changes for the first 100-150 first kernel runs, thus probably some other factors may cause this. > You appear to counteract that by running the fan at 100% and dialing in the highest supported power limit. Yes, frequency of clocks seems to remain stable in long term during my performance measurements. However, previously I missed that during kernel run the memory clock seems to remain different than requested. Once again thank you for advice :)
@njuffa: Thank you very much for many interesting hints. Today I have checked some of them, though only with partial success.


> Have you verified, with nvidia-smi, that this is actually the case?

Yes, but at first I only checked GPU clocks just after running the script from the first post nad got something like this (all clocks at proper values):

nvidia-smi -q

==============NVSMI LOG==============

Timestamp : Wed Nov 15 10:24:05 2017
Driver Version : 384.81

Attached GPUs : 1
GPU 00000000:02:00.0
Product Name : TITAN Xp
Product Brand : GeForce
Display Mode : Enabled
Display Active : Enabled
Persistence Mode : Enabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 1920
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0323217042377
GPU UUID : GPU-03f651d1-dc62-d37e-6a48-f9c6714f2cf5
Minor Number : 0
VBIOS Version : 86.02.3D.00.01
MultiGPU Board : No
Board ID : 0x200
GPU Part Number : 900-1G611-2530-000
Inforom Version
Image Version : G001.0000.01.04
OEM Object : 1.1
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization mode : None
PCI
Bus : 0x02
Device : 0x00
Domain : 0x0000
Device Id : 0x1B0210DE
Bus Id : 00000000:02:00.0
Sub System Id : 0x11DF10DE
GPU Link Info
PCIe Generation
Max : 3
Current : 3
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays since reset : 0
Tx Throughput : 6000 KB/s
Rx Throughput : 0 KB/s
Fan Speed : 100 %
Performance State : P0
Clocks Throttle Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
FB Memory Usage
Total : 12188 MiB
Used : 460 MiB
Free : 11728 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 5 MiB
Free : 251 MiB
Compute Mode : Default
Utilization
Gpu : 1 %
Memory : 1 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Aggregate
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending : N/A
Temperature
GPU Current Temp : 47 C
GPU Shutdown Temp : 99 C
GPU Slowdown Temp : 96 C
GPU Max Operating Temp : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 90.59 W
Power Limit : 300.00 W
Default Power Limit : 250.00 W
Enforced Power Limit : 300.00 W
Min Power Limit : 125.00 W
Max Power Limit : 300.00 W
Clocks
Graphics : 1911 MHz
SM : 1911 MHz
Memory : 5702 MHz
Video : 1708 MHz
Applications Clocks
Graphics : 1911 MHz
Memory : 5705 MHz
Default Applications Clocks
Graphics : 1404 MHz
Memory : 5705 MHz
Max Clocks
Graphics : 1911 MHz
SM : 1911 MHz
Memory : 5705 MHz
Video : 1708 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes
Process ID : 1269
Type : G
Name : /usr/lib/xorg/Xorg
Used GPU Memory : 458 MiB

nvidia-settings -q GPUCurrentPerfLevel -q GPUAdaptiveClockState -q GPUCurrentClockFreqs -q GPUCurrentClockFreqsString -q GPUPerfModes -q GPUCoreTemp -q GPUPowerMizerMode -q GPUPowerMizerDefaultMode -q GPUPerfModes

Attribute 'GPUCurrentPerfLevel' (tad:0.0): 3.
'GPUCurrentPerfLevel' is an integer attribute.
'GPUCurrentPerfLevel' is a read-only attribute.
'GPUCurrentPerfLevel' can use the following target types: X Screen, GPU.

Attribute 'GPUAdaptiveClockState' (tad:0.0): 1.
'GPUAdaptiveClockState' is a boolean attribute; valid values are: 1 (on/true) and 0 (off/false).
'GPUAdaptiveClockState' is a read-only attribute.
'GPUAdaptiveClockState' can use the following target types: X Screen, GPU.

Attribute 'GPUCurrentClockFreqs' (tad:0.0): 1911,5702.
'GPUCurrentClockFreqs' is a packed integer attribute.
'GPUCurrentClockFreqs' is a read-only attribute.
'GPUCurrentClockFreqs' can use the following target types: X Screen, GPU.

Attribute 'GPUCurrentClockFreqsString' (tad:0.0): nvclock=1911, nvclockmin=139, nvclockmax=1911, nvclockeditable=1, memclock=5702, memclockmin=5705, memclockmax=5705, memclockeditable=1, memTransferRate=11404, memTransferRatemin=11410, memTransferRatemax=11410, memTransferRateeditable=1

Attribute 'GPUPerfModes' (tad:0.0): perf=0, nvclock=139, nvclockmin=139, nvclockmax=607, nvclockeditable=1, memclock=405, memclockmin=405, memclockmax=405, memclockeditable=1, memTransferRate=810, memTransferRatemin=810, memTransferRatemax=810, memTransferRateeditable=1 ; perf=1, nvclock=139, nvclockmin=139, nvclockmax=1911, nvclockeditable=1, memclock=810, memclockmin=810, memclockmax=810, memclockeditable=1,
memTransferRate=1620, memTransferRatemin=1620, memTransferRatemax=1620, memTransferRateeditable=1 ; perf=2, nvclock=139, nvclockmin=139, nvclockmax=1911, nvclockeditable=1, memclock=5505, memclockmin=5505, memclockmax=5505, memclockeditable=1, memTransferRate=11010, memTransferRatemin=11010, memTransferRatemax=11010, memTransferRateeditable=1 ; perf=3, nvclock=139, nvclockmin=139, nvclockmax=1911, nvclockeditable=1,
memclock=5705, memclockmin=5705, memclockmax=5705, memclockeditable=1, memTransferRate=11410, memTransferRatemin=11410, memTransferRatemax=11410, memTransferRateeditable=1

Attribute 'GPUCoreTemp' (tad:0.0): 47.
'GPUCoreTemp' is an integer attribute.
'GPUCoreTemp' is a read-only attribute.
'GPUCoreTemp' can use the following target types: X Screen, GPU.



Attribute 'GPUPerfModes' (tad:0.0): perf=0, nvclock=139, nvclockmin=139, nvclockmax=607, nvclockeditable=1, memclock=405, memclockmin=405, memclockmax=405, memclockeditable=1, memTransferRate=810, memTransferRatemin=810, memTransferRatemax=810, memTransferRateeditable=1 ; perf=1, nvclock=139, nvclockmin=139, nvclockmax=1911, nvclockeditable=1, memclock=810, memclockmin=810, memclockmax=810, memclockeditable=1,
memTransferRate=1620, memTransferRatemin=1620, memTransferRatemax=1620, memTransferRateeditable=1 ; perf=2, nvclock=139, nvclockmin=139, nvclockmax=1911, nvclockeditable=1, memclock=5505, memclockmin=5505, memclockmax=5505, memclockeditable=1, memTransferRate=11010, memTransferRatemin=11010, memTransferRatemax=11010, memTransferRateeditable=1 ; perf=3, nvclock=139, nvclockmin=139, nvclockmax=1911, nvclockeditable=1,
memclock=5705, memclockmin=5705, memclockmax=5705, memclockeditable=1, memTransferRate=11410, memTransferRatemin=11410, memTransferRatemax=11410, memTransferRateeditable=1


Today I did more thorough test and found some strange behaviour, though unfortunately it does not seem to be the cause.

First, I have observed that during heavy computations the GPU --decreases-- "Performance State" from P0 (at iddle) to P2 (at heavy load) and decreases memory clock from 5705 MHz to 5508 MHz. Graphic and video clocks stay the same.
However, this change of clock frequency rather is not a reason of gradual performance increase, because
the clock frequency decrease occurs before the first kernel run. I have checked that using NVML library:

void printClocks (nvmlDevice_t device, const nvmlClockId_t clockID)
{
unsigned int clockMHz ;
NVML_CHECK
(nvmlDeviceGetClock (device, NVML_CLOCK_SM, clockID, &clockMHz))
logger << "SM = " << clockMHz << " " ;

NVML_CHECK
(nvmlDeviceGetClock (device, NVML_CLOCK_MEM, clockID, &clockMHz))
logger << "Mem = " << clockMHz << " " ;

NVML_CHECK
(nvmlDeviceGetClock (device, NVML_CLOCK_GRAPHICS, clockID, &clockMHz))
logger << "Graphics = " << clockMHz << " " ;
}

void showParams()
{
nvmlDevice_t device ;

NVML_CHECK (nvmlDeviceGetHandleByIndex (0, &device)) ;

logger << "Clocks current: " ;
printClocks (device, NVML_CLOCK_ID_CURRENT) ;
logger << " ; target : " ;
printClocks (device, NVML_CLOCK_ID_APP_CLOCK_TARGET) ;
logger << " ; default: " ;
printClocks (device, NVML_CLOCK_ID_APP_CLOCK_DEFAULT) ;

logger << "\n" ;
}


and calling showParams() before each kernel run gives

Clocks current: SM = 1911   Mem = 5508   Graphics = 1911    ; target : SM = 1911   Mem = 5705   Graphics = 1911    ; default: SM = 1404   Mem = 5705   Graphics = 1404   
Clocks current: SM = 1911 Mem = 5508 Graphics = 1911 ; target : SM = 1911 Mem = 5705 Graphics = 1911 ; default: SM = 1404 Mem = 5705 Graphics = 1404
Clocks current: SM = 1911 Mem = 5508 Graphics = 1911 ; target : SM = 1911 Mem = 5705 Graphics = 1911 ; default: SM = 1404 Mem = 5705 Graphics = 1404
Clocks current: SM = 1911 Mem = 5508 Graphics = 1911 ; target : SM = 1911 Mem = 5705 Graphics = 1911 ; default: SM = 1404 Mem = 5705 Graphics = 1404
Clocks current: SM = 1911 Mem = 5508 Graphics = 1911 ; target : SM = 1911 Mem = 5705 Graphics = 1911 ; default: SM = 1404 Mem = 5705 Graphics = 1404
Clocks current: SM = 1911 Mem = 5508 Graphics = 1911 ; target : SM = 1911 Mem = 5705 Graphics = 1911 ; default: SM = 1404 Mem = 5705 Graphics = 1404
...



Second, I was not able to change the memory frequency. I tried both

nvidia-smi --application-clocks=810,139


and NVML

NVML_CHECK( nvmlDeviceSetApplicationsClocks (device, 810, 139) ) ;


This seems consistent with your impression that "application clocks are not supported for consumer cards". However, setting graphic clock works as expected, only memory clocks remains constant at 5508 MHz.

Thus, since all clocks seem to remain the same for all kernel runs, I still do not understand why the performance of my kernels increases up to about 120th launch (the third column contains kernel time in ns).

fastKernel : 0 : 3.328e+06
fastKernel : 1 : 3.321e+06
fastKernel : 2 : 3.314e+06
fastKernel : 3 : 3.311e+06
fastKernel : 4 : 3.311e+06
fastKernel : 5 : 3.311e+06
fastKernel : 6 : 3.307e+06
fastKernel : 7 : 3.307e+06
fastKernel : 8 : 3.312e+06
fastKernel : 9 : 3.308e+06
fastKernel : 10 : 3.313e+06
fastKernel : 11 : 3.304e+06
fastKernel : 12 : 3.305e+06
fastKernel : 13 : 3.306e+06
fastKernel : 14 : 3.301e+06
fastKernel : 15 : 3.303e+06
fastKernel : 16 : 3.3e+06
fastKernel : 17 : 3.301e+06
fastKernel : 18 : 3.297e+06
fastKernel : 19 : 3.3e+06
fastKernel : 20 : 3.295e+06
fastKernel : 21 : 3.294e+06
fastKernel : 22 : 3.294e+06
fastKernel : 23 : 3.295e+06
fastKernel : 24 : 3.289e+06
fastKernel : 25 : 3.289e+06
fastKernel : 26 : 3.286e+06
fastKernel : 27 : 3.284e+06
fastKernel : 28 : 3.29e+06
fastKernel : 29 : 3.285e+06
fastKernel : 30 : 3.283e+06
fastKernel : 31 : 3.282e+06
fastKernel : 32 : 3.283e+06
fastKernel : 33 : 3.284e+06
fastKernel : 34 : 3.279e+06
fastKernel : 35 : 3.281e+06
fastKernel : 36 : 3.282e+06
fastKernel : 37 : 3.276e+06
fastKernel : 38 : 3.278e+06
fastKernel : 39 : 3.274e+06
fastKernel : 40 : 3.273e+06
fastKernel : 41 : 3.275e+06
fastKernel : 42 : 3.275e+06
fastKernel : 43 : 3.269e+06
fastKernel : 44 : 3.274e+06
fastKernel : 45 : 3.27e+06
fastKernel : 46 : 3.268e+06
fastKernel : 47 : 3.269e+06
fastKernel : 48 : 3.265e+06
fastKernel : 49 : 3.263e+06
fastKernel : 50 : 3.263e+06
fastKernel : 51 : 3.663e+06
fastKernel : 52 : 3.258e+06
fastKernel : 53 : 3.258e+06
fastKernel : 54 : 3.257e+06
fastKernel : 55 : 3.258e+06
fastKernel : 56 : 3.251e+06
fastKernel : 57 : 3.251e+06
fastKernel : 58 : 3.251e+06
fastKernel : 59 : 3.248e+06
fastKernel : 60 : 3.248e+06
fastKernel : 61 : 3.243e+06
fastKernel : 62 : 3.239e+06
fastKernel : 63 : 3.239e+06
fastKernel : 64 : 3.237e+06
fastKernel : 65 : 3.242e+06
fastKernel : 66 : 3.233e+06
fastKernel : 67 : 3.235e+06
fastKernel : 68 : 3.235e+06
fastKernel : 69 : 3.231e+06
fastKernel : 70 : 3.23e+06
fastKernel : 71 : 3.232e+06
fastKernel : 72 : 3.227e+06
fastKernel : 73 : 3.223e+06
fastKernel : 74 : 3.227e+06
fastKernel : 75 : 3.224e+06
fastKernel : 76 : 3.225e+06
fastKernel : 77 : 3.219e+06
fastKernel : 78 : 3.216e+06
fastKernel : 79 : 3.216e+06
fastKernel : 80 : 3.217e+06
fastKernel : 81 : 3.212e+06
fastKernel : 82 : 3.213e+06
fastKernel : 83 : 3.213e+06
fastKernel : 84 : 3.212e+06
fastKernel : 85 : 3.21e+06
fastKernel : 86 : 3.211e+06
fastKernel : 87 : 3.205e+06
fastKernel : 88 : 3.205e+06
fastKernel : 89 : 3.205e+06
fastKernel : 90 : 3.202e+06
fastKernel : 91 : 3.204e+06
fastKernel : 92 : 3.198e+06
fastKernel : 93 : 3.202e+06
fastKernel : 94 : 3.196e+06
fastKernel : 95 : 3.196e+06
fastKernel : 96 : 3.197e+06
fastKernel : 97 : 3.588e+06
fastKernel : 98 : 3.229e+06
fastKernel : 99 : 3.194e+06
fastKernel : 100 : 3.191e+06
fastKernel : 101 : 3.188e+06
fastKernel : 102 : 3.19e+06
fastKernel : 103 : 3.191e+06
fastKernel : 104 : 3.193e+06
fastKernel : 105 : 3.194e+06
fastKernel : 106 : 3.194e+06
fastKernel : 107 : 3.194e+06
fastKernel : 108 : 3.185e+06
fastKernel : 109 : 3.191e+06
fastKernel : 110 : 3.183e+06
fastKernel : 111 : 3.455e+06
fastKernel : 112 : 3.221e+06
fastKernel : 113 : 3.182e+06
fastKernel : 114 : 3.183e+06
fastKernel : 115 : 3.183e+06
fastKernel : 116 : 3.178e+06
fastKernel : 117 : 3.182e+06
fastKernel : 118 : 3.18e+06
fastKernel : 119 : 3.179e+06
fastKernel : 120 : 3.179e+06
fastKernel : 121 : 3.179e+06
fastKernel : 122 : 3.179e+06
fastKernel : 123 : 3.178e+06
fastKernel : 124 : 3.177e+06
fastKernel : 125 : 3.178e+06
fastKernel : 126 : 3.178e+06
fastKernel : 127 : 3.174e+06
fastKernel : 128 : 3.174e+06
fastKernel : 129 : 3.176e+06
fastKernel : 130 : 3.176e+06
fastKernel : 131 : 3.174e+06
fastKernel : 132 : 3.175e+06
fastKernel : 133 : 3.178e+06
fastKernel : 134 : 3.176e+06
fastKernel : 135 : 3.173e+06
fastKernel : 136 : 3.174e+06
fastKernel : 137 : 3.179e+06
fastKernel : 138 : 3.174e+06
fastKernel : 139 : 3.172e+06
fastKernel : 140 : 3.172e+06
fastKernel : 141 : 3.174e+06
fastKernel : 142 : 3.173e+06
fastKernel : 143 : 3.177e+06
fastKernel : 144 : 3.176e+06
fastKernel : 145 : 3.172e+06
fastKernel : 146 : 3.173e+06
fastKernel : 147 : 3.175e+06
fastKernel : 148 : 3.177e+06
fastKernel : 149 : 3.176e+06
fastKernel : 150 : 3.21e+06
fastKernel : 151 : 3.572e+06
fastKernel : 152 : 3.21e+06
fastKernel : 153 : 3.178e+06
fastKernel : 154 : 3.173e+06
fastKernel : 155 : 3.177e+06
fastKernel : 156 : 3.174e+06
fastKernel : 157 : 3.175e+06
fastKernel : 158 : 3.173e+06
fastKernel : 159 : 3.173e+06
fastKernel : 160 : 3.175e+06
fastKernel : 161 : 3.175e+06
fastKernel : 162 : 3.174e+06
fastKernel : 163 : 3.173e+06
fastKernel : 164 : 3.173e+06
fastKernel : 165 : 3.175e+06
fastKernel : 166 : 3.173e+06
fastKernel : 167 : 3.176e+06
fastKernel : 168 : 3.175e+06
fastKernel : 169 : 3.173e+06
fastKernel : 170 : 3.172e+06
fastKernel : 171 : 3.175e+06
fastKernel : 172 : 3.173e+06
fastKernel : 173 : 3.172e+06
fastKernel : 174 : 3.176e+06
fastKernel : 175 : 3.174e+06
fastKernel : 176 : 3.173e+06
fastKernel : 177 : 3.174e+06
fastKernel : 178 : 3.172e+06
fastKernel : 179 : 3.173e+06
fastKernel : 180 : 3.172e+06
fastKernel : 181 : 3.175e+06
fastKernel : 182 : 3.171e+06
fastKernel : 183 : 3.176e+06
fastKernel : 184 : 3.174e+06
fastKernel : 185 : 3.172e+06
fastKernel : 186 : 3.173e+06
fastKernel : 187 : 3.177e+06
fastKernel : 188 : 3.171e+06
fastKernel : 189 : 3.176e+06
fastKernel : 190 : 3.175e+06
fastKernel : 191 : 3.174e+06
fastKernel : 192 : 3.173e+06
fastKernel : 193 : 3.172e+06
fastKernel : 194 : 3.173e+06
fastKernel : 195 : 3.176e+06
fastKernel : 196 : 3.177e+06
fastKernel : 197 : 3.177e+06
fastKernel : 198 : 3.174e+06
fastKernel : 199 : 3.173e+06


Unless my clock measurements with NVML are wrong, the clocks seem to stay the same for each kernel run. The performance change must be then caused by other factors.


> I have no idea what the "power mizer" settings do (e.g. what is the difference between 'auto' and 'adaptive'?).

In fact, I have not analysed this in detail. I only observed that in 'auto' and 'adaptive' modes the GPU clocks slow down at iddle. Setting to 'prefer maximum performance' keeps clocks high despite GPU utilisation.


> Some "warmup effect" will exist on any complex processor

Yes, thank you, I am aware of that. However, for CPUs usually only at most a few first runs were affected by this phenomena.
In my case, the performance changes for the first 100-150 first kernel runs, thus probably some other factors may cause this.


> You appear to counteract that by running the fan at 100% and dialing in the highest supported power limit.

Yes, frequency of clocks seems to remain stable in long term during my performance measurements.
However, previously I missed that during kernel run the memory clock seems to remain different than requested.


Once again thank you for advice :)

#3
Posted 11/15/2017 03:48 PM   
I don't have a ready explanation for the minimal speedup (4.9% between slowest and fastest instance, from what in can see) over the first 120 kernel invocation. I wouldn't consider the observed differences practically relevant. (1) Are all those kernel instances doing the exact same amount of processing (same amount of data, same sequence of operation)? (2) Does the code contain data-dependent branches that may impact branch efficiency (mispredicts, thread divergence)? (3) Does the code contain calls to standard math functions (many of which contain data dependent branches internally)? For the exact same processing on the exact same data, we would expect to reach steady state performance after half a dozen invocations or so. If the data differs between invocations, all bets are off and one would have to look at the details on how this potentially impacts code performance. Maybe your code contains some data-dependent control flows, but as you are iterating over data the data itself converges to a steady state?
I don't have a ready explanation for the minimal speedup (4.9% between slowest and fastest instance, from what in can see) over the first 120 kernel invocation. I wouldn't consider the observed differences practically relevant.

(1) Are all those kernel instances doing the exact same amount of processing (same amount of data, same sequence of operation)?

(2) Does the code contain data-dependent branches that may impact branch efficiency (mispredicts, thread divergence)?

(3) Does the code contain calls to standard math functions (many of which contain data dependent branches internally)?

For the exact same processing on the exact same data, we would expect to reach steady state performance after half a dozen invocations or so. If the data differs between invocations, all bets are off and one would have to look at the details on how this potentially impacts code performance. Maybe your code contains some data-dependent control flows, but as you are iterating over data the data itself converges to a steady state?

#4
Posted 11/15/2017 05:00 PM   
The L2 cache is not flushed between kernel calls, so there could be some caching influence, although I acknowledge it does not fully describe the behavior.
The L2 cache is not flushed between kernel calls, so there could be some caching influence, although I acknowledge it does not fully describe the behavior.

#5
Posted 11/15/2017 06:31 PM   
But reaching steady state on any sort of cache shouldn't take 120 iterations, provided every kernel instance does pretty much the same kind of processing. Generally speaking, CPUs have more complex control structures than GPU, which take longer to train and reach a steady state, but even there half a dozen repetitions usually get you to steady state (I read that use of AI-type learning is to replace simple state machines in CPUs, in which case I would expect it to take longer to reach steady state; but that approach is not in any shipping parts, best I know).
But reaching steady state on any sort of cache shouldn't take 120 iterations, provided every kernel instance does pretty much the same kind of processing.

Generally speaking, CPUs have more complex control structures than GPU, which take longer to train and reach a steady state, but even there half a dozen repetitions usually get you to steady state (I read that use of AI-type learning is to replace simple state machines in CPUs, in which case I would expect it to take longer to reach steady state; but that approach is not in any shipping parts, best I know).

#6
Posted 11/15/2017 06:39 PM   
Thank you once again for all hints. It may be possible that this behaviour may be caused by some of the below points: > (1) Are all those kernel instances doing the exact same amount of processing (same amount of data, same sequence of operation)? > > (2) Does the code contain data-dependent branches that may impact branch efficiency (mispredicts, thread divergence)? > > (3) Does the code contain calls to standard math functions (many of which contain data dependent branches internally)? In fact, I have not been looking into this code for months and completely forgot, how to properly run performance tests. Apologize for the confusion, I suggested that after the hardware change (I have Titan Xp for a few days) the results are different than expected. I assumed that nothing has changed since last tests, but this may not be true. I will check this next week and report the results. Once again apologise and thank you for your effort. PS I agree that in practice a few percent difference in performance is not especially important, but I am trying to generate results for scientific paper available at https://arxiv.org/abs/1611.02445 and was expecting rather results similar to shown in Fig. 13.
Thank you once again for all hints.

It may be possible that this behaviour may be caused by some of the below points:

> (1) Are all those kernel instances doing the exact same amount of processing (same amount of data, same sequence of operation)?
>
> (2) Does the code contain data-dependent branches that may impact branch efficiency (mispredicts, thread divergence)?
>
> (3) Does the code contain calls to standard math functions (many of which contain data dependent branches internally)?

In fact, I have not been looking into this code for months and completely forgot, how to properly run performance tests.
Apologize for the confusion, I suggested that after the hardware change (I have Titan Xp for a few days) the results are different than expected.
I assumed that nothing has changed since last tests, but this may not be true.

I will check this next week and report the results.

Once again apologise and thank you for your effort.


PS

I agree that in practice a few percent difference in performance is not especially important, but I am trying to generate results for scientific paper available at https://arxiv.org/abs/1611.02445 and was expecting rather results similar to shown in Fig. 13.

#7
Posted 11/16/2017 08:26 AM   
Well, you can have the same kind of graph, but the little plus-signs depicting data points are going to be distributed somewhat differently with the Titan Xp :-) I wonder whether the GDDR5X memory on the Titan Xp might have something to with your observations. Almost all other GPUs use GDDR5 (without the 'X'). I am not a hardware guy, but I seem to recall that with these modern memories the memory controller needs to tune its receivers (reception of a high frequency signal!), and might have to retune them on occasion if signal quality declines due to changing temperatures etc. No data can be received during the re-tuning process. This is a [i]very speculative[/i] thought: less frequent re-tuning might be necessary after the program has been hammering the GPU memory for a little while.
Well, you can have the same kind of graph, but the little plus-signs depicting data points are going to be distributed somewhat differently with the Titan Xp :-)

I wonder whether the GDDR5X memory on the Titan Xp might have something to with your observations. Almost all other GPUs use GDDR5 (without the 'X').

I am not a hardware guy, but I seem to recall that with these modern memories the memory controller needs to tune its receivers (reception of a high frequency signal!), and might have to retune them on occasion if signal quality declines due to changing temperatures etc. No data can be received during the re-tuning process. This is a very speculative thought: less frequent re-tuning might be necessary after the program has been hammering the GPU memory for a little while.

#8
Posted 11/16/2017 08:40 AM   
Thank you for you help, problem detected and solved. The reason is as suggested by njuffa: when I am iterating over data, the soulution converges to steady state. The problem may be simply avoided by starting computations from the previously computed data. When I simply use the data produced by kernels without divisions as an input data to kernels with divisions, the performance stays almost constant. Unfortunately, saving and restoring data was time consuming and removed at some point, what caused behaviour reported in the first post :( Once again sorry for confusion. Some details: The division procedure works *slower* when a large number of division are of 0.0 by something [s]by 1.0[/s] (previously I made bad conclusions), because at the beginning of the division procedure nvvp reports 6% of inactive threads (lines 2-28). When the majoriy of divisions is of values different than 0.0 [s]by 1.0 (usually between 0.999995 and 1.00005)[/s], no inactive threads are reported and kernel works faster. [code] __cuda_sm20_div_f64_slowpath_v2: { LOP32I.AND R47, R43, 0x40000000; PBK `(.L_135); } ISETP.LT.U32.AND P0, PT, R47, c[0x2][0x38], PT; MOV32I R66, 0x1ff00000; MOV R56, RZ; SEL R57, R66, c[0x2][0x3c], !P0; DMUL R64, R42, R56; { LOP32I.AND R47, R44, 0x7f800000; MUFU.RCP64H R58, R65; } ISETP.LT.U32.AND P0, PT, R47, c[0x2][0x40], PT; LOP.XOR R45, R45, R44; MOV R59, R58; MOV32I R58, 0x1; LOP.XOR R44, R45, R44; DFMA R60, R64, -R58, c[0x2][0x0]; SEL R62, R66, c[0x2][0x3c], !P0; LOP.XOR R45, R45, R44; MOV R63, R62; DFMA R60, R60, R60, R60; MOV R62, RZ; DMUL R62, R44, R62; DFMA R58, R58, R60, R58; DMUL R60, R62, R58; DFMA R62, R64, -R60, R62; DFMA R58, R58, R62, R60; DSETP.LEU.AND P0, PT, |R58|, RZ, PT; @P0 BRA `(.L_136); ISETP.GT.U32.AND P0, PT, R47, c[0x2][0x44], PT; DMUL R62, R56, R58; SEL R60, R66, c[0x2][0x3c], !P0; MOV R61, R60; MOV R60, RZ; DMUL R58, R58, R60; DMUL R60, R60, R62; DMUL R62, R56, R58; DFMA R58, R42.reuse, R60, -R44.reuse; DFMA R56, R42, R62, -R44; DSETP.GT.AND P0, PT, |R58|, |R56|, PT; SEL R47, R63, R61, P0; FSETP.GTU.AND P1, PT, |R47|, 1.469367938527859385e-39, PT; SEL R56, R62, R60, P0; { MOV R57, R47; @P1 BRK; } FSETP.LT.AND P0, PT, |R45|, 1.5046327690525280102e-36, PT; MOV32I R58, 0x3ff00000; LOP32I.AND R62, R56, 0xfffffffe; SEL R58, R58, c[0x2][0x48], !P0; MOV R59, R58; MOV R58, RZ; LOP32I.OR R60, R56, 0x1; MOV R56, R62; MOV R57, R47.reuse; DMUL R42, R42, R58.reuse; DMUL R44, R44, R58; MOV R61, R47; DFMA R58, R56, R42.reuse, -R44.reuse; DFMA R56, R60, R42, -R44; DSETP.GT.AND P0, PT, |R58|, |R56|, PT; SEL R58, R60, R62, P0; LOP32I.AND R56, R58, 0x1; IADD32I R61.CC, R58, 0x1; ISETP.EQ.U32.AND P0, PT, R56, 0x1, PT; IADD.X R60, RZ, R47; IADD32I R56.CC, R58, -0x1; IADD32I.X R57, R47.reuse, -0x1; SEL R61, R58, R61, !P0; SEL R60, R47, R60, !P0; SEL R47, R57, R47, !P0; SEL R58, R56, R58, !P0; MOV R56, R61; MOV R57, R60; MOV R59, R47; DFMA R56, R42.reuse, R56, -R44.reuse; DFMA R42, R42, R58, -R44; DSETP.GT.AND P0, PT, |R56|, |R42|, PT; SEL R56, R58, R61, P0; { SEL R57, R47, R60, P0; BRK; } .L_136: DSETP.EQ.AND P0, PT, R58, RZ, PT; @P0 BRA `(.L_137); { MOV R56, RZ; MUFU.RCP64H R57, R43; } DSETP.GT.AND P1, PT, |R56|, RZ, PT; @!P1 DSETP.NEU.AND P0, PT, |R42|, +INF , PT; @!P1 SEL R42, R42, R56, P0; @!P1 SEL R47, R43, R57, P0; @!P1 MOV R56, R42; @!P1 MOV R57, R47; DMUL R56, R44, R56; BRK; .L_137: DMUL R56, R44, R42; BRK; .L_135: MOV R42, R56; { MOV R43, R57; RET; } .L_138: BRA `(.L_138); NOP; NOP; NOP; [/code] I have not analysed this code in detail, only observed that it is different than that generated by Cuda 7.5 for Kepler device.
Thank you for you help, problem detected and solved.

The reason is as suggested by njuffa: when I am iterating over data, the soulution converges to steady state.
The problem may be simply avoided by starting computations from the previously computed data.
When I simply use the data produced by kernels without divisions as an input data to kernels with divisions, the performance stays almost constant.

Unfortunately, saving and restoring data was time consuming and removed at some point, what caused behaviour reported in the first post :(
Once again sorry for confusion.


Some details:

The division procedure works *slower* when a large number of division are of 0.0 by something by 1.0 (previously I made bad conclusions), because at the beginning of the division procedure nvvp reports 6% of inactive threads (lines 2-28). When the majoriy of divisions is of values different than 0.0 by 1.0 (usually between 0.999995 and 1.00005), no inactive threads are reported and kernel works faster.

__cuda_sm20_div_f64_slowpath_v2:
{ LOP32I.AND R47, R43, 0x40000000;
PBK `(.L_135); }
ISETP.LT.U32.AND P0, PT, R47, c[0x2][0x38], PT;
MOV32I R66, 0x1ff00000;
MOV R56, RZ;
SEL R57, R66, c[0x2][0x3c], !P0;
DMUL R64, R42, R56;
{ LOP32I.AND R47, R44, 0x7f800000;
MUFU.RCP64H R58, R65; }
ISETP.LT.U32.AND P0, PT, R47, c[0x2][0x40], PT;
LOP.XOR R45, R45, R44;
MOV R59, R58;
MOV32I R58, 0x1;
LOP.XOR R44, R45, R44;
DFMA R60, R64, -R58, c[0x2][0x0];
SEL R62, R66, c[0x2][0x3c], !P0;
LOP.XOR R45, R45, R44;
MOV R63, R62;
DFMA R60, R60, R60, R60;
MOV R62, RZ;
DMUL R62, R44, R62;
DFMA R58, R58, R60, R58;
DMUL R60, R62, R58;
DFMA R62, R64, -R60, R62;
DFMA R58, R58, R62, R60;
DSETP.LEU.AND P0, PT, |R58|, RZ, PT;
@P0 BRA `(.L_136);
ISETP.GT.U32.AND P0, PT, R47, c[0x2][0x44], PT;
DMUL R62, R56, R58;
SEL R60, R66, c[0x2][0x3c], !P0;
MOV R61, R60;
MOV R60, RZ;
DMUL R58, R58, R60;
DMUL R60, R60, R62;
DMUL R62, R56, R58;
DFMA R58, R42.reuse, R60, -R44.reuse;
DFMA R56, R42, R62, -R44;
DSETP.GT.AND P0, PT, |R58|, |R56|, PT;
SEL R47, R63, R61, P0;
FSETP.GTU.AND P1, PT, |R47|, 1.469367938527859385e-39, PT;
SEL R56, R62, R60, P0;
{ MOV R57, R47;
@P1 BRK; }
FSETP.LT.AND P0, PT, |R45|, 1.5046327690525280102e-36, PT;
MOV32I R58, 0x3ff00000;
LOP32I.AND R62, R56, 0xfffffffe;
SEL R58, R58, c[0x2][0x48], !P0;
MOV R59, R58;
MOV R58, RZ;
LOP32I.OR R60, R56, 0x1;
MOV R56, R62;
MOV R57, R47.reuse;
DMUL R42, R42, R58.reuse;
DMUL R44, R44, R58;
MOV R61, R47;
DFMA R58, R56, R42.reuse, -R44.reuse;
DFMA R56, R60, R42, -R44;
DSETP.GT.AND P0, PT, |R58|, |R56|, PT;
SEL R58, R60, R62, P0;
LOP32I.AND R56, R58, 0x1;
IADD32I R61.CC, R58, 0x1;
ISETP.EQ.U32.AND P0, PT, R56, 0x1, PT;
IADD.X R60, RZ, R47;
IADD32I R56.CC, R58, -0x1;
IADD32I.X R57, R47.reuse, -0x1;
SEL R61, R58, R61, !P0;
SEL R60, R47, R60, !P0;
SEL R47, R57, R47, !P0;
SEL R58, R56, R58, !P0;
MOV R56, R61;
MOV R57, R60;
MOV R59, R47;
DFMA R56, R42.reuse, R56, -R44.reuse;
DFMA R42, R42, R58, -R44;
DSETP.GT.AND P0, PT, |R56|, |R42|, PT;
SEL R56, R58, R61, P0;
{ SEL R57, R47, R60, P0;
BRK; }
.L_136:
DSETP.EQ.AND P0, PT, R58, RZ, PT;
@P0 BRA `(.L_137);
{ MOV R56, RZ;
MUFU.RCP64H R57, R43; }
DSETP.GT.AND P1, PT, |R56|, RZ, PT;
@!P1 DSETP.NEU.AND P0, PT, |R42|, +INF , PT;
@!P1 SEL R42, R42, R56, P0;
@!P1 SEL R47, R43, R57, P0;
@!P1 MOV R56, R42;
@!P1 MOV R57, R47;
DMUL R56, R44, R56;
BRK;
.L_137:
DMUL R56, R44, R42;
BRK;
.L_135:
MOV R42, R56;
{ MOV R43, R57;
RET; }
.L_138:
BRA `(.L_138);
NOP;
NOP;
NOP;


I have not analysed this code in detail, only observed that it is different than that generated by Cuda 7.5 for Kepler device.

#9
Posted 11/22/2017 01:19 PM   
It seems this has been root caused to data-dependent code path selection inside the double-precision division subroutine. Not much you can do about that other than try to avoid any division that may be unnecessary. Given that on the GPU, all divisions are implemented by software, it is conceivable that implementation details vary between CUDA versions, or between different architectures (compute capabilities). Divisions with a dividend of zero may fall into the slow path (special case handling). Whether that is avoidable, I could not say offhand; it has been almost a decade since I last looked at the details of CUDA's double-precision division. You may want to consider filing an enhancement request with NVIDIA to improve the division performance for the case of zero dividend. This may or may not be technically feasible, but it would not hurt to file such a request.
It seems this has been root caused to data-dependent code path selection inside the double-precision division subroutine. Not much you can do about that other than try to avoid any division that may be unnecessary.

Given that on the GPU, all divisions are implemented by software, it is conceivable that implementation details vary between CUDA versions, or between different architectures (compute capabilities). Divisions with a dividend of zero may fall into the slow path (special case handling). Whether that is avoidable, I could not say offhand; it has been almost a decade since I last looked at the details of CUDA's double-precision division.

You may want to consider filing an enhancement request with NVIDIA to improve the division performance for the case of zero dividend. This may or may not be technically feasible, but it would not hurt to file such a request.

#10
Posted 11/22/2017 05:30 PM   
[quote=""]It seems this has been root caused to data-dependent code path selection inside the double-precision division subroutine. Not much you can do about that other than try to avoid any division that may be unnecessary. [/quote] Yes, thank you, since I focus on memory bandwidth optimisations, the computational performance is not so important for me. I only had to know, why the performance of some of my kernels increases at the beginning. [quote=""]Given that on the GPU, all divisions are implemented by software, it is conceivable that implementation details vary between CUDA versions, or between different architectures (compute capabilities). Divisions with a dividend of zero may fall into the slow path (special case handling). Whether that is avoidable, I could not say offhand; it has been almost a decade since I last looked at the details of CUDA's double-precision division. [/quote] Yes, I am aware that this is a feature, not a bug :) [quote=""]You may want to consider filing an enhancement request with NVIDIA to improve the division performance for the case of zero dividend. This may or may not be technically feasible, but it would not hurt to file such a request.[/quote] I will try to do this next week. As for now, I have prepared a short code that illustrates this behaviour: [code] #include <iostream> #include <sstream> #include <cuda_runtime.h> using namespace std ; void cudaCheck( cudaError_t cudaCode, std::string file, size_t line ) { if ( cudaSuccess != (cudaCode) ) { std::stringstream sstr ; sstr << "Error at " << file << ":" << line << " : " << cudaGetErrorString(cudaCode) << "\n" ; cout << sstr.str() ; throw sstr.str() ; } } #define CUDA_CHECK(code) cudaCheck( (code), __FILE__, __LINE__ ) ; typedef double DTYPE ; __global__ void divKern (DTYPE * in, DTYPE * out, DTYPE divisor, unsigned N) { int i = blockDim.x * blockIdx.x + threadIdx.x; if (i < N) { out [i] = in [i] / divisor; } } void perfTest (DTYPE dividend) { cout << "\n\nPerformance measurement for dividend = " << dividend << "\n\n" ; constexpr unsigned N_ELEM = 10000000 ; constexpr unsigned SIZE = N_ELEM * sizeof (DTYPE) ; DTYPE * hIn, * hOut ; hIn = (double*) malloc (SIZE) ; hOut = (double*) malloc (SIZE) ; for (unsigned i=0 ; i < N_ELEM ; i++) { hIn [i] = dividend ; } DTYPE * dIn, * dOut ; CUDA_CHECK (cudaMalloc (&dIn , SIZE)) ; CUDA_CHECK (cudaMalloc (&dOut, SIZE)) ; CUDA_CHECK (cudaMemcpy (dIn, hIn, SIZE, cudaMemcpyHostToDevice)) ; constexpr unsigned threadsPerBlock = 64 ; constexpr unsigned blocksPerGrid = (N_ELEM + threadsPerBlock - 1) / threadsPerBlock ; for (unsigned t=0 ; t < 20 ; t++) { cudaEvent_t start, stop; float time; CUDA_CHECK (cudaEventCreate (&start)) ; CUDA_CHECK (cudaEventCreate (&stop)) ; CUDA_CHECK (cudaEventRecord (start, 0)) ; divKern <<< blocksPerGrid, threadsPerBlock >>> (dIn, dOut, 1.0, N_ELEM) ; CUDA_CHECK (cudaEventRecord (stop, 0)) ; CUDA_CHECK (cudaEventSynchronize (stop)) ; CUDA_CHECK (cudaEventElapsedTime (&time, start, stop)) ; CUDA_CHECK (cudaEventDestroy (start)) ; CUDA_CHECK (cudaEventDestroy (stop)) ; CUDA_CHECK (cudaPeekAtLastError()) ; CUDA_CHECK (cudaDeviceSynchronize()) ; cout << t << " Kernel time : " << 1000 * time << " us \n" ; } CUDA_CHECK (cudaMemcpy(hOut, dOut, SIZE, cudaMemcpyDeviceToHost)) ; for (unsigned i=0 ; i < 10 ; i++) { cout << "hOut [" << i << "] = " << hOut [i] << "\n" ; } CUDA_CHECK (cudaFree (dIn)) ; CUDA_CHECK (cudaFree (dOut)) ; free (hIn) ; free (hOut) ; } int main (int argc, char ** argv) { cout << "Division performance test.\n" ; CUDA_CHECK (cudaSetDevice(0)) ; int i = -1 ; CUDA_CHECK (cudaGetDevice(&i)) ; cout << "Using CUDA device #" << i << "\n" ; perfTest (0.01) ; perfTest (0) ; CUDA_CHECK (cudaDeviceSynchronize()) ; cudaDeviceReset() ; return 0 ; } [/code] Results show, that division performance is halved for Titan XP GPU, when dividend is 0: [code] nvcc -std=c++11 -Xcompiler -ggdb -lineinfo --generate-code arch=compute_61,code=sm_61 -O3 --fmad=true --use_fast_math -lcuda divTest.cu -o divTest Division performance test. Using CUDA device #0 Performance measurement for dividend = 0.01 0 Kernel time : 386.208 us 1 Kernel time : 382.528 us 2 Kernel time : 380.928 us 3 Kernel time : 379.744 us 4 Kernel time : 380.928 us 5 Kernel time : 379.904 us 6 Kernel time : 380.544 us 7 Kernel time : 379.904 us 8 Kernel time : 379.904 us 9 Kernel time : 382.592 us 10 Kernel time : 384 us 11 Kernel time : 380.928 us 12 Kernel time : 379.584 us 13 Kernel time : 379.904 us 14 Kernel time : 379.488 us 15 Kernel time : 377.856 us 16 Kernel time : 377.376 us 17 Kernel time : 376.832 us 18 Kernel time : 377.536 us 19 Kernel time : 376.832 us hOut [0] = 0.01 hOut [1] = 0.01 hOut [2] = 0.01 hOut [3] = 0.01 hOut [4] = 0.01 hOut [5] = 0.01 hOut [6] = 0.01 hOut [7] = 0.01 hOut [8] = 0.01 hOut [9] = 0.01 Performance measurement for dividend = 0 0 Kernel time : 795.488 us 1 Kernel time : 756.736 us 2 Kernel time : 756.736 us 3 Kernel time : 758.752 us 4 Kernel time : 756.704 us 5 Kernel time : 755.712 us 6 Kernel time : 755.392 us 7 Kernel time : 755.456 us 8 Kernel time : 755.936 us 9 Kernel time : 755.712 us 10 Kernel time : 755.712 us 11 Kernel time : 755.648 us 12 Kernel time : 755.712 us 13 Kernel time : 755.712 us 14 Kernel time : 753.664 us 15 Kernel time : 752.64 us 16 Kernel time : 756.288 us 17 Kernel time : 756.32 us 18 Kernel time : 755.712 us 19 Kernel time : 754.688 us hOut [0] = 0 hOut [1] = 0 hOut [2] = 0 hOut [3] = 0 hOut [4] = 0 hOut [5] = 0 hOut [6] = 0 hOut [7] = 0 hOut [8] = 0 hOut [9] = 0 [/code]
said:It seems this has been root caused to data-dependent code path selection inside the double-precision division subroutine. Not much you can do about that other than try to avoid any division that may be unnecessary.

Yes, thank you, since I focus on memory bandwidth optimisations, the computational performance is not so important for me. I only had to know, why the performance of some of my kernels increases at the beginning.

said:Given that on the GPU, all divisions are implemented by software, it is conceivable that implementation details vary between CUDA versions, or between different architectures (compute capabilities). Divisions with a dividend of zero may fall into the slow path (special case handling). Whether that is avoidable, I could not say offhand; it has been almost a decade since I last looked at the details of CUDA's double-precision division.

Yes, I am aware that this is a feature, not a bug :)

said:You may want to consider filing an enhancement request with NVIDIA to improve the division performance for the case of zero dividend. This may or may not be technically feasible, but it would not hurt to file such a request.


I will try to do this next week.
As for now, I have prepared a short code that illustrates this behaviour:

#include <iostream>
#include <sstream>
#include <cuda_runtime.h>

using namespace std ;


void cudaCheck( cudaError_t cudaCode, std::string file, size_t line )
{
if ( cudaSuccess != (cudaCode) )
{
std::stringstream sstr ;
sstr << "Error at " << file << ":" << line
<< " : " << cudaGetErrorString(cudaCode) << "\n" ;
cout << sstr.str() ;
throw sstr.str() ;
}
}

#define CUDA_CHECK(code) cudaCheck( (code), __FILE__, __LINE__ ) ;


typedef double DTYPE ;



__global__ void divKern (DTYPE * in, DTYPE * out, DTYPE divisor, unsigned N)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;

if (i < N)
{
out [i] = in [i] / divisor;
}
}


void perfTest (DTYPE dividend)
{
cout << "\n\nPerformance measurement for dividend = " << dividend << "\n\n" ;

constexpr unsigned N_ELEM = 10000000 ;
constexpr unsigned SIZE = N_ELEM * sizeof (DTYPE) ;

DTYPE * hIn, * hOut ;
hIn = (double*) malloc (SIZE) ;
hOut = (double*) malloc (SIZE) ;

for (unsigned i=0 ; i < N_ELEM ; i++)
{
hIn [i] = dividend ;
}

DTYPE * dIn, * dOut ;
CUDA_CHECK (cudaMalloc (&dIn , SIZE)) ;
CUDA_CHECK (cudaMalloc (&dOut, SIZE)) ;

CUDA_CHECK (cudaMemcpy (dIn, hIn, SIZE, cudaMemcpyHostToDevice)) ;

constexpr unsigned threadsPerBlock = 64 ;
constexpr unsigned blocksPerGrid =
(N_ELEM + threadsPerBlock - 1) / threadsPerBlock ;


for (unsigned t=0 ; t < 20 ; t++)
{
cudaEvent_t start, stop;
float time;
CUDA_CHECK (cudaEventCreate (&start)) ;
CUDA_CHECK (cudaEventCreate (&stop)) ;
CUDA_CHECK (cudaEventRecord (start, 0)) ;

divKern <<< blocksPerGrid, threadsPerBlock >>> (dIn, dOut, 1.0, N_ELEM) ;

CUDA_CHECK (cudaEventRecord (stop, 0)) ;
CUDA_CHECK (cudaEventSynchronize (stop)) ;
CUDA_CHECK (cudaEventElapsedTime (&time, start, stop)) ;
CUDA_CHECK (cudaEventDestroy (start)) ;
CUDA_CHECK (cudaEventDestroy (stop)) ;

CUDA_CHECK (cudaPeekAtLastError()) ;
CUDA_CHECK (cudaDeviceSynchronize()) ;

cout << t << " Kernel time : " << 1000 * time << " us \n" ;
}

CUDA_CHECK (cudaMemcpy(hOut, dOut, SIZE, cudaMemcpyDeviceToHost)) ;

for (unsigned i=0 ; i < 10 ; i++)
{
cout << "hOut [" << i << "] = " << hOut [i] << "\n" ;
}

CUDA_CHECK (cudaFree (dIn)) ;
CUDA_CHECK (cudaFree (dOut)) ;

free (hIn) ;
free (hOut) ;
}


int main (int argc, char ** argv)
{
cout << "Division performance test.\n" ;

CUDA_CHECK (cudaSetDevice(0)) ;
int i = -1 ;
CUDA_CHECK (cudaGetDevice(&i)) ;
cout << "Using CUDA device #" << i << "\n" ;

perfTest (0.01) ;
perfTest (0) ;

CUDA_CHECK (cudaDeviceSynchronize()) ;
cudaDeviceReset() ;

return 0 ;
}


Results show, that division performance is halved for Titan XP GPU, when dividend is 0:

nvcc -std=c++11 -Xcompiler -ggdb -lineinfo --generate-code arch=compute_61,code=sm_61 -O3 --fmad=true --use_fast_math -lcuda  divTest.cu  -o divTest
Division performance test.
Using CUDA device #0


Performance measurement for dividend = 0.01

0 Kernel time : 386.208 us
1 Kernel time : 382.528 us
2 Kernel time : 380.928 us
3 Kernel time : 379.744 us
4 Kernel time : 380.928 us
5 Kernel time : 379.904 us
6 Kernel time : 380.544 us
7 Kernel time : 379.904 us
8 Kernel time : 379.904 us
9 Kernel time : 382.592 us
10 Kernel time : 384 us
11 Kernel time : 380.928 us
12 Kernel time : 379.584 us
13 Kernel time : 379.904 us
14 Kernel time : 379.488 us
15 Kernel time : 377.856 us
16 Kernel time : 377.376 us
17 Kernel time : 376.832 us
18 Kernel time : 377.536 us
19 Kernel time : 376.832 us
hOut [0] = 0.01
hOut [1] = 0.01
hOut [2] = 0.01
hOut [3] = 0.01
hOut [4] = 0.01
hOut [5] = 0.01
hOut [6] = 0.01
hOut [7] = 0.01
hOut [8] = 0.01
hOut [9] = 0.01


Performance measurement for dividend = 0

0 Kernel time : 795.488 us
1 Kernel time : 756.736 us
2 Kernel time : 756.736 us
3 Kernel time : 758.752 us
4 Kernel time : 756.704 us
5 Kernel time : 755.712 us
6 Kernel time : 755.392 us
7 Kernel time : 755.456 us
8 Kernel time : 755.936 us
9 Kernel time : 755.712 us
10 Kernel time : 755.712 us
11 Kernel time : 755.648 us
12 Kernel time : 755.712 us
13 Kernel time : 755.712 us
14 Kernel time : 753.664 us
15 Kernel time : 752.64 us
16 Kernel time : 756.288 us
17 Kernel time : 756.32 us
18 Kernel time : 755.712 us
19 Kernel time : 754.688 us
hOut [0] = 0
hOut [1] = 0
hOut [2] = 0
hOut [3] = 0
hOut [4] = 0
hOut [5] = 0
hOut [6] = 0
hOut [7] = 0
hOut [8] = 0
hOut [9] = 0

#11
Posted 11/23/2017 10:48 AM   
The data from your test app definitely suggest the case of a zero dividend is going down the slow path. I'd say, file the enhancement request and see what NVIDIA comes back with.
The data from your test app definitely suggest the case of a zero dividend is going down the slow path. I'd say, file the enhancement request and see what NVIDIA comes back with.

#12
Posted 11/23/2017 06:11 PM   
Scroll To Top

Add Reply