Titan XP (Pascal) warms up for double precision despite the clock frequency set
Hello, I have observed behaviour that I am not able explain and avoid. Since I am trying to correctly meausure performance, I wonder if there any settings of technique which allows to eliminate this phenomena. For double precision computations with floating point divisions the time of subsequent kernel runs gradually decreases during the first 300-400 ms. At the beginning the kernel takes about 3.7 ms, at the end 3.2 ms. After this warm-up the performance stays close to the maximum. For the same kernels in single precision, and for similar double precision kernels but without division, the performance is maximum from the beginning. I measure each kernel separately using C++ std::chrono library. All double precision kernels are compute bound on Titan XP, single precision kernels are bandwidth bound. This phenomenon occurs only on GTX Titan XP device. Previously I have been running the same code on GTX Titan with Kepler processor and kernel times were almost constant after setting of fixed GPU clocks. I am using the following settings for Titan XP: nvidia-smi -pm 1 nvidia-smi --power-limit=300 nvidia-smi --application-clocks=5705,1911 nvidia-settings -a "[gpu:0]/GPUFanControlState=1" -a "[fan:0]/GPUTargetFanSpeed=100" # 0 - adaptive # 1 - prefer maximum performance # 2 - auto nvidia-settings -a "[gpu:0]/GpuPowerMizerMode=1" Operating system: Ubuntu 16.04.3, CUDA 9.0 with driver 384.81 I would be greatful for any advice.
Hello,


I have observed behaviour that I am not able explain and avoid.
Since I am trying to correctly meausure performance, I wonder if there any settings of technique which allows to eliminate this phenomena.

For double precision computations with floating point divisions the time of subsequent kernel runs gradually decreases during the first 300-400 ms. At the beginning the kernel takes about 3.7 ms, at the end 3.2 ms. After this warm-up the performance stays close to the maximum.

For the same kernels in single precision, and for similar double precision kernels but without division, the performance is maximum from the beginning.

I measure each kernel separately using C++ std::chrono library.

All double precision kernels are compute bound on Titan XP, single precision kernels are bandwidth bound.

This phenomenon occurs only on GTX Titan XP device.
Previously I have been running the same code on GTX Titan with Kepler processor and kernel times were almost constant after setting of fixed GPU clocks.

I am using the following settings for Titan XP:

nvidia-smi -pm 1
nvidia-smi --power-limit=300
nvidia-smi --application-clocks=5705,1911

nvidia-settings -a "[gpu:0]/GPUFanControlState=1" -a "[fan:0]/GPUTargetFanSpeed=100"

# 0 - adaptive
# 1 - prefer maximum performance
# 2 - auto
nvidia-settings -a "[gpu:0]/GpuPowerMizerMode=1"


Operating system: Ubuntu 16.04.3, CUDA 9.0 with driver 384.81


I would be greatful for any advice.

#1
Posted 11/14/2017 01:27 PM   
Since GPU don't have any division instructions, divisions are just regular code composed of multiplies and adds and assorted other instructions, some are inlined (like 32-bit integer division) but most are called subroutines. So that would appear to be a bit of a red herring. However, the use of divisions typically increases register pressure and dynamic instruction count, and may therefore decrease occupancy and make an application more compute bound. Kernel execution time [i]reducing[/i] in the millisecond time frame you describe is typically a consequence of the GPU's control software gradually boosting clocks to the maximum possible. However, you are specifying application clocks, so to my understanding that should hold the clocks constant. Have you verified, with nvidia-smi, that this is actually the case? I wonder whether non-Tesla cards in the Pascal family may simply ignore application clock settings and always use auto boost to maintain maximum power efficiency. Application clocks were introduced for Tesla cards so all GPUs in a cluster can be operated at the same clock; I was under the impression they are [i]not[/i] supported for consumer cards. I have no idea what the "power mizer" settings do (e.g. what is the difference between 'auto' and 'adaptive'?). Some "warmup effect" will exist on any complex processor and memory subsyetm, but the first few calls to a kernel should train up caches, TLBs, branch predictors, etc, and lead to steady-state performance. It is always a good idea to wait until the code is in steady state before measuring the performance. For example, some benchmarks run ten times, and report only the fastest time. Note that due to throttling (most frequently [i]thermal[/i] throttling), a more common problem when measuring performance of GPUs is that the performance [i]drops[/i] after the first few minutes. You appear to counteract that by running the fan at 100% and dialing in the highest supported power limit.
Since GPU don't have any division instructions, divisions are just regular code composed of multiplies and adds and assorted other instructions, some are inlined (like 32-bit integer division) but most are called subroutines. So that would appear to be a bit of a red herring.

However, the use of divisions typically increases register pressure and dynamic instruction count, and may therefore decrease occupancy and make an application more compute bound.

Kernel execution time reducing in the millisecond time frame you describe is typically a consequence of the GPU's control software gradually boosting clocks to the maximum possible. However, you are specifying application clocks, so to my understanding that should hold the clocks constant.

Have you verified, with nvidia-smi, that this is actually the case? I wonder whether non-Tesla cards in the Pascal family may simply ignore application clock settings and always use auto boost to maintain maximum power efficiency. Application clocks were introduced for Tesla cards so all GPUs in a cluster can be operated at the same clock; I was under the impression they are not supported for consumer cards. I have no idea what the "power mizer" settings do (e.g. what is the difference between 'auto' and 'adaptive'?).

Some "warmup effect" will exist on any complex processor and memory subsyetm, but the first few calls to a kernel should train up caches, TLBs, branch predictors, etc, and lead to steady-state performance. It is always a good idea to wait until the code is in steady state before measuring the performance. For example, some benchmarks run ten times, and report only the fastest time.

Note that due to throttling (most frequently thermal throttling), a more common problem when measuring performance of GPUs is that the performance drops after the first few minutes. You appear to counteract that by running the fan at 100% and dialing in the highest supported power limit.

#2
Posted 11/14/2017 03:33 PM   
@njuffa: Thank you very much for many interesting hints. Today I have checked some of them, though only with partial success. > Have you verified, with nvidia-smi, that this is actually the case? Yes, but at first I only checked GPU clocks just after running the script from the first post nad got something like this (all clocks at proper values): [code] nvidia-smi -q ==============NVSMI LOG============== Timestamp : Wed Nov 15 10:24:05 2017 Driver Version : 384.81 Attached GPUs : 1 GPU 00000000:02:00.0 Product Name : TITAN Xp Product Brand : GeForce Display Mode : Enabled Display Active : Enabled Persistence Mode : Enabled Accounting Mode : Disabled Accounting Mode Buffer Size : 1920 Driver Model Current : N/A Pending : N/A Serial Number : 0323217042377 GPU UUID : GPU-03f651d1-dc62-d37e-6a48-f9c6714f2cf5 Minor Number : 0 VBIOS Version : 86.02.3D.00.01 MultiGPU Board : No Board ID : 0x200 GPU Part Number : 900-1G611-2530-000 Inforom Version Image Version : G001.0000.01.04 OEM Object : 1.1 ECC Object : N/A Power Management Object : N/A GPU Operation Mode Current : N/A Pending : N/A GPU Virtualization Mode Virtualization mode : None PCI Bus : 0x02 Device : 0x00 Domain : 0x0000 Device Id : 0x1B0210DE Bus Id : 00000000:02:00.0 Sub System Id : 0x11DF10DE GPU Link Info PCIe Generation Max : 3 Current : 3 Link Width Max : 16x Current : 16x Bridge Chip Type : N/A Firmware : N/A Replays since reset : 0 Tx Throughput : 6000 KB/s Rx Throughput : 0 KB/s Fan Speed : 100 % Performance State : P0 Clocks Throttle Reasons Idle : Not Active Applications Clocks Setting : Not Active SW Power Cap : Not Active HW Slowdown : Not Active Sync Boost : Not Active SW Thermal Slowdown : Not Active FB Memory Usage Total : 12188 MiB Used : 460 MiB Free : 11728 MiB BAR1 Memory Usage Total : 256 MiB Used : 5 MiB Free : 251 MiB Compute Mode : Default Utilization Gpu : 1 % Memory : 1 % Encoder : 0 % Decoder : 0 % Encoder Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 Ecc Mode Current : N/A Pending : N/A ECC Errors Volatile Single Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Double Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Aggregate Single Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Double Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Retired Pages Single Bit ECC : N/A Double Bit ECC : N/A Pending : N/A Temperature GPU Current Temp : 47 C GPU Shutdown Temp : 99 C GPU Slowdown Temp : 96 C GPU Max Operating Temp : N/A Memory Current Temp : N/A Memory Max Operating Temp : N/A Power Readings Power Management : Supported Power Draw : 90.59 W Power Limit : 300.00 W Default Power Limit : 250.00 W Enforced Power Limit : 300.00 W Min Power Limit : 125.00 W Max Power Limit : 300.00 W Clocks Graphics : 1911 MHz SM : 1911 MHz Memory : 5702 MHz Video : 1708 MHz Applications Clocks Graphics : 1911 MHz Memory : 5705 MHz Default Applications Clocks Graphics : 1404 MHz Memory : 5705 MHz Max Clocks Graphics : 1911 MHz SM : 1911 MHz Memory : 5705 MHz Video : 1708 MHz Max Customer Boost Clocks Graphics : N/A Clock Policy Auto Boost : N/A Auto Boost Default : N/A Processes Process ID : 1269 Type : G Name : /usr/lib/xorg/Xorg Used GPU Memory : 458 MiB nvidia-settings -q GPUCurrentPerfLevel -q GPUAdaptiveClockState -q GPUCurrentClockFreqs -q GPUCurrentClockFreqsString -q GPUPerfModes -q GPUCoreTemp -q GPUPowerMizerMode -q GPUPowerMizerDefaultMode -q GPUPerfModes Attribute 'GPUCurrentPerfLevel' (tad:0.0): 3. 'GPUCurrentPerfLevel' is an integer attribute. 'GPUCurrentPerfLevel' is a read-only attribute. 'GPUCurrentPerfLevel' can use the following target types: X Screen, GPU. Attribute 'GPUAdaptiveClockState' (tad:0.0): 1. 'GPUAdaptiveClockState' is a boolean attribute; valid values are: 1 (on/true) and 0 (off/false). 'GPUAdaptiveClockState' is a read-only attribute. 'GPUAdaptiveClockState' can use the following target types: X Screen, GPU. Attribute 'GPUCurrentClockFreqs' (tad:0.0): 1911,5702. 'GPUCurrentClockFreqs' is a packed integer attribute. 'GPUCurrentClockFreqs' is a read-only attribute. 'GPUCurrentClockFreqs' can use the following target types: X Screen, GPU. Attribute 'GPUCurrentClockFreqsString' (tad:0.0): nvclock=1911, nvclockmin=139, nvclockmax=1911, nvclockeditable=1, memclock=5702, memclockmin=5705, memclockmax=5705, memclockeditable=1, memTransferRate=11404, memTransferRatemin=11410, memTransferRatemax=11410, memTransferRateeditable=1 Attribute 'GPUPerfModes' (tad:0.0): perf=0, nvclock=139, nvclockmin=139, nvclockmax=607, nvclockeditable=1, memclock=405, memclockmin=405, memclockmax=405, memclockeditable=1, memTransferRate=810, memTransferRatemin=810, memTransferRatemax=810, memTransferRateeditable=1 ; perf=1, nvclock=139, nvclockmin=139, nvclockmax=1911, nvclockeditable=1, memclock=810, memclockmin=810, memclockmax=810, memclockeditable=1, memTransferRate=1620, memTransferRatemin=1620, memTransferRatemax=1620, memTransferRateeditable=1 ; perf=2, nvclock=139, nvclockmin=139, nvclockmax=1911, nvclockeditable=1, memclock=5505, memclockmin=5505, memclockmax=5505, memclockeditable=1, memTransferRate=11010, memTransferRatemin=11010, memTransferRatemax=11010, memTransferRateeditable=1 ; perf=3, nvclock=139, nvclockmin=139, nvclockmax=1911, nvclockeditable=1, memclock=5705, memclockmin=5705, memclockmax=5705, memclockeditable=1, memTransferRate=11410, memTransferRatemin=11410, memTransferRatemax=11410, memTransferRateeditable=1 Attribute 'GPUCoreTemp' (tad:0.0): 47. 'GPUCoreTemp' is an integer attribute. 'GPUCoreTemp' is a read-only attribute. 'GPUCoreTemp' can use the following target types: X Screen, GPU. Attribute 'GPUPerfModes' (tad:0.0): perf=0, nvclock=139, nvclockmin=139, nvclockmax=607, nvclockeditable=1, memclock=405, memclockmin=405, memclockmax=405, memclockeditable=1, memTransferRate=810, memTransferRatemin=810, memTransferRatemax=810, memTransferRateeditable=1 ; perf=1, nvclock=139, nvclockmin=139, nvclockmax=1911, nvclockeditable=1, memclock=810, memclockmin=810, memclockmax=810, memclockeditable=1, memTransferRate=1620, memTransferRatemin=1620, memTransferRatemax=1620, memTransferRateeditable=1 ; perf=2, nvclock=139, nvclockmin=139, nvclockmax=1911, nvclockeditable=1, memclock=5505, memclockmin=5505, memclockmax=5505, memclockeditable=1, memTransferRate=11010, memTransferRatemin=11010, memTransferRatemax=11010, memTransferRateeditable=1 ; perf=3, nvclock=139, nvclockmin=139, nvclockmax=1911, nvclockeditable=1, memclock=5705, memclockmin=5705, memclockmax=5705, memclockeditable=1, memTransferRate=11410, memTransferRatemin=11410, memTransferRatemax=11410, memTransferRateeditable=1 [/code] Today I did more thorough test and found some strange behaviour, though unfortunately it does not seem to be the cause. First, I have observed that during heavy computations the GPU --decreases-- "Performance State" from P0 (at iddle) to P2 (at heavy load) and decreases memory clock from 5705 MHz to 5508 MHz. Graphic and video clocks stay the same. However, this change of clock frequency rather is not a reason of gradual performance increase, because the clock frequency decrease occurs before the first kernel run. I have checked that using NVML library: [code] void printClocks (nvmlDevice_t device, const nvmlClockId_t clockID) { unsigned int clockMHz ; NVML_CHECK (nvmlDeviceGetClock (device, NVML_CLOCK_SM, clockID, &clockMHz)) logger << "SM = " << clockMHz << " " ; NVML_CHECK (nvmlDeviceGetClock (device, NVML_CLOCK_MEM, clockID, &clockMHz)) logger << "Mem = " << clockMHz << " " ; NVML_CHECK (nvmlDeviceGetClock (device, NVML_CLOCK_GRAPHICS, clockID, &clockMHz)) logger << "Graphics = " << clockMHz << " " ; } void showParams() { nvmlDevice_t device ; NVML_CHECK (nvmlDeviceGetHandleByIndex (0, &device)) ; logger << "Clocks current: " ; printClocks (device, NVML_CLOCK_ID_CURRENT) ; logger << " ; target : " ; printClocks (device, NVML_CLOCK_ID_APP_CLOCK_TARGET) ; logger << " ; default: " ; printClocks (device, NVML_CLOCK_ID_APP_CLOCK_DEFAULT) ; logger << "\n" ; } [/code] and calling showParams() before each kernel run gives [code] Clocks current: SM = 1911 Mem = 5508 Graphics = 1911 ; target : SM = 1911 Mem = 5705 Graphics = 1911 ; default: SM = 1404 Mem = 5705 Graphics = 1404 Clocks current: SM = 1911 Mem = 5508 Graphics = 1911 ; target : SM = 1911 Mem = 5705 Graphics = 1911 ; default: SM = 1404 Mem = 5705 Graphics = 1404 Clocks current: SM = 1911 Mem = 5508 Graphics = 1911 ; target : SM = 1911 Mem = 5705 Graphics = 1911 ; default: SM = 1404 Mem = 5705 Graphics = 1404 Clocks current: SM = 1911 Mem = 5508 Graphics = 1911 ; target : SM = 1911 Mem = 5705 Graphics = 1911 ; default: SM = 1404 Mem = 5705 Graphics = 1404 Clocks current: SM = 1911 Mem = 5508 Graphics = 1911 ; target : SM = 1911 Mem = 5705 Graphics = 1911 ; default: SM = 1404 Mem = 5705 Graphics = 1404 Clocks current: SM = 1911 Mem = 5508 Graphics = 1911 ; target : SM = 1911 Mem = 5705 Graphics = 1911 ; default: SM = 1404 Mem = 5705 Graphics = 1404 ... [/code] Second, I was not able to change the memory frequency. I tried both [code] nvidia-smi --application-clocks=810,139 [/code] and NVML [code] NVML_CHECK( nvmlDeviceSetApplicationsClocks (device, 810, 139) ) ; [/code] This seems consistent with your impression that "application clocks are not supported for consumer cards". However, setting graphic clock works as expected, only memory clocks remains constant at 5508 MHz. Thus, since all clocks seem to remain the same for all kernel runs, I still do not understand why the performance of my kernels increases up to about 120th launch (the third column contains kernel time in ns). [code] fastKernel : 0 : 3.328e+06 fastKernel : 1 : 3.321e+06 fastKernel : 2 : 3.314e+06 fastKernel : 3 : 3.311e+06 fastKernel : 4 : 3.311e+06 fastKernel : 5 : 3.311e+06 fastKernel : 6 : 3.307e+06 fastKernel : 7 : 3.307e+06 fastKernel : 8 : 3.312e+06 fastKernel : 9 : 3.308e+06 fastKernel : 10 : 3.313e+06 fastKernel : 11 : 3.304e+06 fastKernel : 12 : 3.305e+06 fastKernel : 13 : 3.306e+06 fastKernel : 14 : 3.301e+06 fastKernel : 15 : 3.303e+06 fastKernel : 16 : 3.3e+06 fastKernel : 17 : 3.301e+06 fastKernel : 18 : 3.297e+06 fastKernel : 19 : 3.3e+06 fastKernel : 20 : 3.295e+06 fastKernel : 21 : 3.294e+06 fastKernel : 22 : 3.294e+06 fastKernel : 23 : 3.295e+06 fastKernel : 24 : 3.289e+06 fastKernel : 25 : 3.289e+06 fastKernel : 26 : 3.286e+06 fastKernel : 27 : 3.284e+06 fastKernel : 28 : 3.29e+06 fastKernel : 29 : 3.285e+06 fastKernel : 30 : 3.283e+06 fastKernel : 31 : 3.282e+06 fastKernel : 32 : 3.283e+06 fastKernel : 33 : 3.284e+06 fastKernel : 34 : 3.279e+06 fastKernel : 35 : 3.281e+06 fastKernel : 36 : 3.282e+06 fastKernel : 37 : 3.276e+06 fastKernel : 38 : 3.278e+06 fastKernel : 39 : 3.274e+06 fastKernel : 40 : 3.273e+06 fastKernel : 41 : 3.275e+06 fastKernel : 42 : 3.275e+06 fastKernel : 43 : 3.269e+06 fastKernel : 44 : 3.274e+06 fastKernel : 45 : 3.27e+06 fastKernel : 46 : 3.268e+06 fastKernel : 47 : 3.269e+06 fastKernel : 48 : 3.265e+06 fastKernel : 49 : 3.263e+06 fastKernel : 50 : 3.263e+06 fastKernel : 51 : 3.663e+06 fastKernel : 52 : 3.258e+06 fastKernel : 53 : 3.258e+06 fastKernel : 54 : 3.257e+06 fastKernel : 55 : 3.258e+06 fastKernel : 56 : 3.251e+06 fastKernel : 57 : 3.251e+06 fastKernel : 58 : 3.251e+06 fastKernel : 59 : 3.248e+06 fastKernel : 60 : 3.248e+06 fastKernel : 61 : 3.243e+06 fastKernel : 62 : 3.239e+06 fastKernel : 63 : 3.239e+06 fastKernel : 64 : 3.237e+06 fastKernel : 65 : 3.242e+06 fastKernel : 66 : 3.233e+06 fastKernel : 67 : 3.235e+06 fastKernel : 68 : 3.235e+06 fastKernel : 69 : 3.231e+06 fastKernel : 70 : 3.23e+06 fastKernel : 71 : 3.232e+06 fastKernel : 72 : 3.227e+06 fastKernel : 73 : 3.223e+06 fastKernel : 74 : 3.227e+06 fastKernel : 75 : 3.224e+06 fastKernel : 76 : 3.225e+06 fastKernel : 77 : 3.219e+06 fastKernel : 78 : 3.216e+06 fastKernel : 79 : 3.216e+06 fastKernel : 80 : 3.217e+06 fastKernel : 81 : 3.212e+06 fastKernel : 82 : 3.213e+06 fastKernel : 83 : 3.213e+06 fastKernel : 84 : 3.212e+06 fastKernel : 85 : 3.21e+06 fastKernel : 86 : 3.211e+06 fastKernel : 87 : 3.205e+06 fastKernel : 88 : 3.205e+06 fastKernel : 89 : 3.205e+06 fastKernel : 90 : 3.202e+06 fastKernel : 91 : 3.204e+06 fastKernel : 92 : 3.198e+06 fastKernel : 93 : 3.202e+06 fastKernel : 94 : 3.196e+06 fastKernel : 95 : 3.196e+06 fastKernel : 96 : 3.197e+06 fastKernel : 97 : 3.588e+06 fastKernel : 98 : 3.229e+06 fastKernel : 99 : 3.194e+06 fastKernel : 100 : 3.191e+06 fastKernel : 101 : 3.188e+06 fastKernel : 102 : 3.19e+06 fastKernel : 103 : 3.191e+06 fastKernel : 104 : 3.193e+06 fastKernel : 105 : 3.194e+06 fastKernel : 106 : 3.194e+06 fastKernel : 107 : 3.194e+06 fastKernel : 108 : 3.185e+06 fastKernel : 109 : 3.191e+06 fastKernel : 110 : 3.183e+06 fastKernel : 111 : 3.455e+06 fastKernel : 112 : 3.221e+06 fastKernel : 113 : 3.182e+06 fastKernel : 114 : 3.183e+06 fastKernel : 115 : 3.183e+06 fastKernel : 116 : 3.178e+06 fastKernel : 117 : 3.182e+06 fastKernel : 118 : 3.18e+06 fastKernel : 119 : 3.179e+06 fastKernel : 120 : 3.179e+06 fastKernel : 121 : 3.179e+06 fastKernel : 122 : 3.179e+06 fastKernel : 123 : 3.178e+06 fastKernel : 124 : 3.177e+06 fastKernel : 125 : 3.178e+06 fastKernel : 126 : 3.178e+06 fastKernel : 127 : 3.174e+06 fastKernel : 128 : 3.174e+06 fastKernel : 129 : 3.176e+06 fastKernel : 130 : 3.176e+06 fastKernel : 131 : 3.174e+06 fastKernel : 132 : 3.175e+06 fastKernel : 133 : 3.178e+06 fastKernel : 134 : 3.176e+06 fastKernel : 135 : 3.173e+06 fastKernel : 136 : 3.174e+06 fastKernel : 137 : 3.179e+06 fastKernel : 138 : 3.174e+06 fastKernel : 139 : 3.172e+06 fastKernel : 140 : 3.172e+06 fastKernel : 141 : 3.174e+06 fastKernel : 142 : 3.173e+06 fastKernel : 143 : 3.177e+06 fastKernel : 144 : 3.176e+06 fastKernel : 145 : 3.172e+06 fastKernel : 146 : 3.173e+06 fastKernel : 147 : 3.175e+06 fastKernel : 148 : 3.177e+06 fastKernel : 149 : 3.176e+06 fastKernel : 150 : 3.21e+06 fastKernel : 151 : 3.572e+06 fastKernel : 152 : 3.21e+06 fastKernel : 153 : 3.178e+06 fastKernel : 154 : 3.173e+06 fastKernel : 155 : 3.177e+06 fastKernel : 156 : 3.174e+06 fastKernel : 157 : 3.175e+06 fastKernel : 158 : 3.173e+06 fastKernel : 159 : 3.173e+06 fastKernel : 160 : 3.175e+06 fastKernel : 161 : 3.175e+06 fastKernel : 162 : 3.174e+06 fastKernel : 163 : 3.173e+06 fastKernel : 164 : 3.173e+06 fastKernel : 165 : 3.175e+06 fastKernel : 166 : 3.173e+06 fastKernel : 167 : 3.176e+06 fastKernel : 168 : 3.175e+06 fastKernel : 169 : 3.173e+06 fastKernel : 170 : 3.172e+06 fastKernel : 171 : 3.175e+06 fastKernel : 172 : 3.173e+06 fastKernel : 173 : 3.172e+06 fastKernel : 174 : 3.176e+06 fastKernel : 175 : 3.174e+06 fastKernel : 176 : 3.173e+06 fastKernel : 177 : 3.174e+06 fastKernel : 178 : 3.172e+06 fastKernel : 179 : 3.173e+06 fastKernel : 180 : 3.172e+06 fastKernel : 181 : 3.175e+06 fastKernel : 182 : 3.171e+06 fastKernel : 183 : 3.176e+06 fastKernel : 184 : 3.174e+06 fastKernel : 185 : 3.172e+06 fastKernel : 186 : 3.173e+06 fastKernel : 187 : 3.177e+06 fastKernel : 188 : 3.171e+06 fastKernel : 189 : 3.176e+06 fastKernel : 190 : 3.175e+06 fastKernel : 191 : 3.174e+06 fastKernel : 192 : 3.173e+06 fastKernel : 193 : 3.172e+06 fastKernel : 194 : 3.173e+06 fastKernel : 195 : 3.176e+06 fastKernel : 196 : 3.177e+06 fastKernel : 197 : 3.177e+06 fastKernel : 198 : 3.174e+06 fastKernel : 199 : 3.173e+06 [/code] Unless my clock measurements with NVML are wrong, the clocks seem to stay the same for each kernel run. The performance change must be then caused by other factors. > I have no idea what the "power mizer" settings do (e.g. what is the difference between 'auto' and 'adaptive'?). In fact, I have not analysed this in detail. I only observed that in 'auto' and 'adaptive' modes the GPU clocks slow down at iddle. Setting to 'prefer maximum performance' keeps clocks high despite GPU utilisation. > Some "warmup effect" will exist on any complex processor Yes, thank you, I am aware of that. However, for CPUs usually only at most a few first runs were affected by this phenomena. In my case, the performance changes for the first 100-150 first kernel runs, thus probably some other factors may cause this. > You appear to counteract that by running the fan at 100% and dialing in the highest supported power limit. Yes, frequency of clocks seems to remain stable in long term during my performance measurements. However, previously I missed that during kernel run the memory clock seems to remain different than requested. Once again thank you for advice :)
@njuffa: Thank you very much for many interesting hints. Today I have checked some of them, though only with partial success.


> Have you verified, with nvidia-smi, that this is actually the case?

Yes, but at first I only checked GPU clocks just after running the script from the first post nad got something like this (all clocks at proper values):

nvidia-smi -q

==============NVSMI LOG==============

Timestamp : Wed Nov 15 10:24:05 2017
Driver Version : 384.81

Attached GPUs : 1
GPU 00000000:02:00.0
Product Name : TITAN Xp
Product Brand : GeForce
Display Mode : Enabled
Display Active : Enabled
Persistence Mode : Enabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 1920
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0323217042377
GPU UUID : GPU-03f651d1-dc62-d37e-6a48-f9c6714f2cf5
Minor Number : 0
VBIOS Version : 86.02.3D.00.01
MultiGPU Board : No
Board ID : 0x200
GPU Part Number : 900-1G611-2530-000
Inforom Version
Image Version : G001.0000.01.04
OEM Object : 1.1
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization mode : None
PCI
Bus : 0x02
Device : 0x00
Domain : 0x0000
Device Id : 0x1B0210DE
Bus Id : 00000000:02:00.0
Sub System Id : 0x11DF10DE
GPU Link Info
PCIe Generation
Max : 3
Current : 3
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays since reset : 0
Tx Throughput : 6000 KB/s
Rx Throughput : 0 KB/s
Fan Speed : 100 %
Performance State : P0
Clocks Throttle Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
FB Memory Usage
Total : 12188 MiB
Used : 460 MiB
Free : 11728 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 5 MiB
Free : 251 MiB
Compute Mode : Default
Utilization
Gpu : 1 %
Memory : 1 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Aggregate
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending : N/A
Temperature
GPU Current Temp : 47 C
GPU Shutdown Temp : 99 C
GPU Slowdown Temp : 96 C
GPU Max Operating Temp : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 90.59 W
Power Limit : 300.00 W
Default Power Limit : 250.00 W
Enforced Power Limit : 300.00 W
Min Power Limit : 125.00 W
Max Power Limit : 300.00 W
Clocks
Graphics : 1911 MHz
SM : 1911 MHz
Memory : 5702 MHz
Video : 1708 MHz
Applications Clocks
Graphics : 1911 MHz
Memory : 5705 MHz
Default Applications Clocks
Graphics : 1404 MHz
Memory : 5705 MHz
Max Clocks
Graphics : 1911 MHz
SM : 1911 MHz
Memory : 5705 MHz
Video : 1708 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes
Process ID : 1269
Type : G
Name : /usr/lib/xorg/Xorg
Used GPU Memory : 458 MiB

nvidia-settings -q GPUCurrentPerfLevel -q GPUAdaptiveClockState -q GPUCurrentClockFreqs -q GPUCurrentClockFreqsString -q GPUPerfModes -q GPUCoreTemp -q GPUPowerMizerMode -q GPUPowerMizerDefaultMode -q GPUPerfModes

Attribute 'GPUCurrentPerfLevel' (tad:0.0): 3.
'GPUCurrentPerfLevel' is an integer attribute.
'GPUCurrentPerfLevel' is a read-only attribute.
'GPUCurrentPerfLevel' can use the following target types: X Screen, GPU.

Attribute 'GPUAdaptiveClockState' (tad:0.0): 1.
'GPUAdaptiveClockState' is a boolean attribute; valid values are: 1 (on/true) and 0 (off/false).
'GPUAdaptiveClockState' is a read-only attribute.
'GPUAdaptiveClockState' can use the following target types: X Screen, GPU.

Attribute 'GPUCurrentClockFreqs' (tad:0.0): 1911,5702.
'GPUCurrentClockFreqs' is a packed integer attribute.
'GPUCurrentClockFreqs' is a read-only attribute.
'GPUCurrentClockFreqs' can use the following target types: X Screen, GPU.

Attribute 'GPUCurrentClockFreqsString' (tad:0.0): nvclock=1911, nvclockmin=139, nvclockmax=1911, nvclockeditable=1, memclock=5702, memclockmin=5705, memclockmax=5705, memclockeditable=1, memTransferRate=11404, memTransferRatemin=11410, memTransferRatemax=11410, memTransferRateeditable=1

Attribute 'GPUPerfModes' (tad:0.0): perf=0, nvclock=139, nvclockmin=139, nvclockmax=607, nvclockeditable=1, memclock=405, memclockmin=405, memclockmax=405, memclockeditable=1, memTransferRate=810, memTransferRatemin=810, memTransferRatemax=810, memTransferRateeditable=1 ; perf=1, nvclock=139, nvclockmin=139, nvclockmax=1911, nvclockeditable=1, memclock=810, memclockmin=810, memclockmax=810, memclockeditable=1,
memTransferRate=1620, memTransferRatemin=1620, memTransferRatemax=1620, memTransferRateeditable=1 ; perf=2, nvclock=139, nvclockmin=139, nvclockmax=1911, nvclockeditable=1, memclock=5505, memclockmin=5505, memclockmax=5505, memclockeditable=1, memTransferRate=11010, memTransferRatemin=11010, memTransferRatemax=11010, memTransferRateeditable=1 ; perf=3, nvclock=139, nvclockmin=139, nvclockmax=1911, nvclockeditable=1,
memclock=5705, memclockmin=5705, memclockmax=5705, memclockeditable=1, memTransferRate=11410, memTransferRatemin=11410, memTransferRatemax=11410, memTransferRateeditable=1

Attribute 'GPUCoreTemp' (tad:0.0): 47.
'GPUCoreTemp' is an integer attribute.
'GPUCoreTemp' is a read-only attribute.
'GPUCoreTemp' can use the following target types: X Screen, GPU.



Attribute 'GPUPerfModes' (tad:0.0): perf=0, nvclock=139, nvclockmin=139, nvclockmax=607, nvclockeditable=1, memclock=405, memclockmin=405, memclockmax=405, memclockeditable=1, memTransferRate=810, memTransferRatemin=810, memTransferRatemax=810, memTransferRateeditable=1 ; perf=1, nvclock=139, nvclockmin=139, nvclockmax=1911, nvclockeditable=1, memclock=810, memclockmin=810, memclockmax=810, memclockeditable=1,
memTransferRate=1620, memTransferRatemin=1620, memTransferRatemax=1620, memTransferRateeditable=1 ; perf=2, nvclock=139, nvclockmin=139, nvclockmax=1911, nvclockeditable=1, memclock=5505, memclockmin=5505, memclockmax=5505, memclockeditable=1, memTransferRate=11010, memTransferRatemin=11010, memTransferRatemax=11010, memTransferRateeditable=1 ; perf=3, nvclock=139, nvclockmin=139, nvclockmax=1911, nvclockeditable=1,
memclock=5705, memclockmin=5705, memclockmax=5705, memclockeditable=1, memTransferRate=11410, memTransferRatemin=11410, memTransferRatemax=11410, memTransferRateeditable=1


Today I did more thorough test and found some strange behaviour, though unfortunately it does not seem to be the cause.

First, I have observed that during heavy computations the GPU --decreases-- "Performance State" from P0 (at iddle) to P2 (at heavy load) and decreases memory clock from 5705 MHz to 5508 MHz. Graphic and video clocks stay the same.
However, this change of clock frequency rather is not a reason of gradual performance increase, because
the clock frequency decrease occurs before the first kernel run. I have checked that using NVML library:

void printClocks (nvmlDevice_t device, const nvmlClockId_t clockID)
{
unsigned int clockMHz ;
NVML_CHECK
(nvmlDeviceGetClock (device, NVML_CLOCK_SM, clockID, &clockMHz))
logger << "SM = " << clockMHz << " " ;

NVML_CHECK
(nvmlDeviceGetClock (device, NVML_CLOCK_MEM, clockID, &clockMHz))
logger << "Mem = " << clockMHz << " " ;

NVML_CHECK
(nvmlDeviceGetClock (device, NVML_CLOCK_GRAPHICS, clockID, &clockMHz))
logger << "Graphics = " << clockMHz << " " ;
}

void showParams()
{
nvmlDevice_t device ;

NVML_CHECK (nvmlDeviceGetHandleByIndex (0, &device)) ;

logger << "Clocks current: " ;
printClocks (device, NVML_CLOCK_ID_CURRENT) ;
logger << " ; target : " ;
printClocks (device, NVML_CLOCK_ID_APP_CLOCK_TARGET) ;
logger << " ; default: " ;
printClocks (device, NVML_CLOCK_ID_APP_CLOCK_DEFAULT) ;

logger << "\n" ;
}


and calling showParams() before each kernel run gives

Clocks current: SM = 1911   Mem = 5508   Graphics = 1911    ; target : SM = 1911   Mem = 5705   Graphics = 1911    ; default: SM = 1404   Mem = 5705   Graphics = 1404   
Clocks current: SM = 1911 Mem = 5508 Graphics = 1911 ; target : SM = 1911 Mem = 5705 Graphics = 1911 ; default: SM = 1404 Mem = 5705 Graphics = 1404
Clocks current: SM = 1911 Mem = 5508 Graphics = 1911 ; target : SM = 1911 Mem = 5705 Graphics = 1911 ; default: SM = 1404 Mem = 5705 Graphics = 1404
Clocks current: SM = 1911 Mem = 5508 Graphics = 1911 ; target : SM = 1911 Mem = 5705 Graphics = 1911 ; default: SM = 1404 Mem = 5705 Graphics = 1404
Clocks current: SM = 1911 Mem = 5508 Graphics = 1911 ; target : SM = 1911 Mem = 5705 Graphics = 1911 ; default: SM = 1404 Mem = 5705 Graphics = 1404
Clocks current: SM = 1911 Mem = 5508 Graphics = 1911 ; target : SM = 1911 Mem = 5705 Graphics = 1911 ; default: SM = 1404 Mem = 5705 Graphics = 1404
...



Second, I was not able to change the memory frequency. I tried both

nvidia-smi --application-clocks=810,139


and NVML

NVML_CHECK( nvmlDeviceSetApplicationsClocks (device, 810, 139) ) ;


This seems consistent with your impression that "application clocks are not supported for consumer cards". However, setting graphic clock works as expected, only memory clocks remains constant at 5508 MHz.

Thus, since all clocks seem to remain the same for all kernel runs, I still do not understand why the performance of my kernels increases up to about 120th launch (the third column contains kernel time in ns).

fastKernel : 0 : 3.328e+06
fastKernel : 1 : 3.321e+06
fastKernel : 2 : 3.314e+06
fastKernel : 3 : 3.311e+06
fastKernel : 4 : 3.311e+06
fastKernel : 5 : 3.311e+06
fastKernel : 6 : 3.307e+06
fastKernel : 7 : 3.307e+06
fastKernel : 8 : 3.312e+06
fastKernel : 9 : 3.308e+06
fastKernel : 10 : 3.313e+06
fastKernel : 11 : 3.304e+06
fastKernel : 12 : 3.305e+06
fastKernel : 13 : 3.306e+06
fastKernel : 14 : 3.301e+06
fastKernel : 15 : 3.303e+06
fastKernel : 16 : 3.3e+06
fastKernel : 17 : 3.301e+06
fastKernel : 18 : 3.297e+06
fastKernel : 19 : 3.3e+06
fastKernel : 20 : 3.295e+06
fastKernel : 21 : 3.294e+06
fastKernel : 22 : 3.294e+06
fastKernel : 23 : 3.295e+06
fastKernel : 24 : 3.289e+06
fastKernel : 25 : 3.289e+06
fastKernel : 26 : 3.286e+06
fastKernel : 27 : 3.284e+06
fastKernel : 28 : 3.29e+06
fastKernel : 29 : 3.285e+06
fastKernel : 30 : 3.283e+06
fastKernel : 31 : 3.282e+06
fastKernel : 32 : 3.283e+06
fastKernel : 33 : 3.284e+06
fastKernel : 34 : 3.279e+06
fastKernel : 35 : 3.281e+06
fastKernel : 36 : 3.282e+06
fastKernel : 37 : 3.276e+06
fastKernel : 38 : 3.278e+06
fastKernel : 39 : 3.274e+06
fastKernel : 40 : 3.273e+06
fastKernel : 41 : 3.275e+06
fastKernel : 42 : 3.275e+06
fastKernel : 43 : 3.269e+06
fastKernel : 44 : 3.274e+06
fastKernel : 45 : 3.27e+06
fastKernel : 46 : 3.268e+06
fastKernel : 47 : 3.269e+06
fastKernel : 48 : 3.265e+06
fastKernel : 49 : 3.263e+06
fastKernel : 50 : 3.263e+06
fastKernel : 51 : 3.663e+06
fastKernel : 52 : 3.258e+06
fastKernel : 53 : 3.258e+06
fastKernel : 54 : 3.257e+06
fastKernel : 55 : 3.258e+06
fastKernel : 56 : 3.251e+06
fastKernel : 57 : 3.251e+06
fastKernel : 58 : 3.251e+06
fastKernel : 59 : 3.248e+06
fastKernel : 60 : 3.248e+06
fastKernel : 61 : 3.243e+06
fastKernel : 62 : 3.239e+06
fastKernel : 63 : 3.239e+06
fastKernel : 64 : 3.237e+06
fastKernel : 65 : 3.242e+06
fastKernel : 66 : 3.233e+06
fastKernel : 67 : 3.235e+06
fastKernel : 68 : 3.235e+06
fastKernel : 69 : 3.231e+06
fastKernel : 70 : 3.23e+06
fastKernel : 71 : 3.232e+06
fastKernel : 72 : 3.227e+06
fastKernel : 73 : 3.223e+06
fastKernel : 74 : 3.227e+06
fastKernel : 75 : 3.224e+06
fastKernel : 76 : 3.225e+06
fastKernel : 77 : 3.219e+06
fastKernel : 78 : 3.216e+06
fastKernel : 79 : 3.216e+06
fastKernel : 80 : 3.217e+06
fastKernel : 81 : 3.212e+06
fastKernel : 82 : 3.213e+06
fastKernel : 83 : 3.213e+06
fastKernel : 84 : 3.212e+06
fastKernel : 85 : 3.21e+06
fastKernel : 86 : 3.211e+06
fastKernel : 87 : 3.205e+06
fastKernel : 88 : 3.205e+06
fastKernel : 89 : 3.205e+06
fastKernel : 90 : 3.202e+06
fastKernel : 91 : 3.204e+06
fastKernel : 92 : 3.198e+06
fastKernel : 93 : 3.202e+06
fastKernel : 94 : 3.196e+06
fastKernel : 95 : 3.196e+06
fastKernel : 96 : 3.197e+06
fastKernel : 97 : 3.588e+06
fastKernel : 98 : 3.229e+06
fastKernel : 99 : 3.194e+06
fastKernel : 100 : 3.191e+06
fastKernel : 101 : 3.188e+06
fastKernel : 102 : 3.19e+06
fastKernel : 103 : 3.191e+06
fastKernel : 104 : 3.193e+06
fastKernel : 105 : 3.194e+06
fastKernel : 106 : 3.194e+06
fastKernel : 107 : 3.194e+06
fastKernel : 108 : 3.185e+06
fastKernel : 109 : 3.191e+06
fastKernel : 110 : 3.183e+06
fastKernel : 111 : 3.455e+06
fastKernel : 112 : 3.221e+06
fastKernel : 113 : 3.182e+06
fastKernel : 114 : 3.183e+06
fastKernel : 115 : 3.183e+06
fastKernel : 116 : 3.178e+06
fastKernel : 117 : 3.182e+06
fastKernel : 118 : 3.18e+06
fastKernel : 119 : 3.179e+06
fastKernel : 120 : 3.179e+06
fastKernel : 121 : 3.179e+06
fastKernel : 122 : 3.179e+06
fastKernel : 123 : 3.178e+06
fastKernel : 124 : 3.177e+06
fastKernel : 125 : 3.178e+06
fastKernel : 126 : 3.178e+06
fastKernel : 127 : 3.174e+06
fastKernel : 128 : 3.174e+06
fastKernel : 129 : 3.176e+06
fastKernel : 130 : 3.176e+06
fastKernel : 131 : 3.174e+06
fastKernel : 132 : 3.175e+06
fastKernel : 133 : 3.178e+06
fastKernel : 134 : 3.176e+06
fastKernel : 135 : 3.173e+06
fastKernel : 136 : 3.174e+06
fastKernel : 137 : 3.179e+06
fastKernel : 138 : 3.174e+06
fastKernel : 139 : 3.172e+06
fastKernel : 140 : 3.172e+06
fastKernel : 141 : 3.174e+06
fastKernel : 142 : 3.173e+06
fastKernel : 143 : 3.177e+06
fastKernel : 144 : 3.176e+06
fastKernel : 145 : 3.172e+06
fastKernel : 146 : 3.173e+06
fastKernel : 147 : 3.175e+06
fastKernel : 148 : 3.177e+06
fastKernel : 149 : 3.176e+06
fastKernel : 150 : 3.21e+06
fastKernel : 151 : 3.572e+06
fastKernel : 152 : 3.21e+06
fastKernel : 153 : 3.178e+06
fastKernel : 154 : 3.173e+06
fastKernel : 155 : 3.177e+06
fastKernel : 156 : 3.174e+06
fastKernel : 157 : 3.175e+06
fastKernel : 158 : 3.173e+06
fastKernel : 159 : 3.173e+06
fastKernel : 160 : 3.175e+06
fastKernel : 161 : 3.175e+06
fastKernel : 162 : 3.174e+06
fastKernel : 163 : 3.173e+06
fastKernel : 164 : 3.173e+06
fastKernel : 165 : 3.175e+06
fastKernel : 166 : 3.173e+06
fastKernel : 167 : 3.176e+06
fastKernel : 168 : 3.175e+06
fastKernel : 169 : 3.173e+06
fastKernel : 170 : 3.172e+06
fastKernel : 171 : 3.175e+06
fastKernel : 172 : 3.173e+06
fastKernel : 173 : 3.172e+06
fastKernel : 174 : 3.176e+06
fastKernel : 175 : 3.174e+06
fastKernel : 176 : 3.173e+06
fastKernel : 177 : 3.174e+06
fastKernel : 178 : 3.172e+06
fastKernel : 179 : 3.173e+06
fastKernel : 180 : 3.172e+06
fastKernel : 181 : 3.175e+06
fastKernel : 182 : 3.171e+06
fastKernel : 183 : 3.176e+06
fastKernel : 184 : 3.174e+06
fastKernel : 185 : 3.172e+06
fastKernel : 186 : 3.173e+06
fastKernel : 187 : 3.177e+06
fastKernel : 188 : 3.171e+06
fastKernel : 189 : 3.176e+06
fastKernel : 190 : 3.175e+06
fastKernel : 191 : 3.174e+06
fastKernel : 192 : 3.173e+06
fastKernel : 193 : 3.172e+06
fastKernel : 194 : 3.173e+06
fastKernel : 195 : 3.176e+06
fastKernel : 196 : 3.177e+06
fastKernel : 197 : 3.177e+06
fastKernel : 198 : 3.174e+06
fastKernel : 199 : 3.173e+06


Unless my clock measurements with NVML are wrong, the clocks seem to stay the same for each kernel run. The performance change must be then caused by other factors.


> I have no idea what the "power mizer" settings do (e.g. what is the difference between 'auto' and 'adaptive'?).

In fact, I have not analysed this in detail. I only observed that in 'auto' and 'adaptive' modes the GPU clocks slow down at iddle. Setting to 'prefer maximum performance' keeps clocks high despite GPU utilisation.


> Some "warmup effect" will exist on any complex processor

Yes, thank you, I am aware of that. However, for CPUs usually only at most a few first runs were affected by this phenomena.
In my case, the performance changes for the first 100-150 first kernel runs, thus probably some other factors may cause this.


> You appear to counteract that by running the fan at 100% and dialing in the highest supported power limit.

Yes, frequency of clocks seems to remain stable in long term during my performance measurements.
However, previously I missed that during kernel run the memory clock seems to remain different than requested.


Once again thank you for advice :)

#3
Posted 11/15/2017 03:48 PM   
I don't have a ready explanation for the minimal speedup (4.9% between slowest and fastest instance, from what in can see) over the first 120 kernel invocation. I wouldn't consider the observed differences practically relevant. (1) Are all those kernel instances doing the exact same amount of processing (same amount of data, same sequence of operation)? (2) Does the code contain data-dependent branches that may impact branch efficiency (mispredicts, thread divergence)? (3) Does the code contain calls to standard math functions (many of which contain data dependent branches internally)? For the exact same processing on the exact same data, we would expect to reach steady state performance after half a dozen invocations or so. If the data differs between invocations, all bets are off and one would have to look at the details on how this potentially impacts code performance. Maybe your code contains some data-dependent control flows, but as you are iterating over data the data itself converges to a steady state?
I don't have a ready explanation for the minimal speedup (4.9% between slowest and fastest instance, from what in can see) over the first 120 kernel invocation. I wouldn't consider the observed differences practically relevant.

(1) Are all those kernel instances doing the exact same amount of processing (same amount of data, same sequence of operation)?

(2) Does the code contain data-dependent branches that may impact branch efficiency (mispredicts, thread divergence)?

(3) Does the code contain calls to standard math functions (many of which contain data dependent branches internally)?

For the exact same processing on the exact same data, we would expect to reach steady state performance after half a dozen invocations or so. If the data differs between invocations, all bets are off and one would have to look at the details on how this potentially impacts code performance. Maybe your code contains some data-dependent control flows, but as you are iterating over data the data itself converges to a steady state?

#4
Posted 11/15/2017 05:00 PM   
The L2 cache is not flushed between kernel calls, so there could be some caching influence, although I acknowledge it does not fully describe the behavior.
The L2 cache is not flushed between kernel calls, so there could be some caching influence, although I acknowledge it does not fully describe the behavior.

#5
Posted 11/15/2017 06:31 PM   
But reaching steady state on any sort of cache shouldn't take 120 iterations, provided every kernel instance does pretty much the same kind of processing. Generally speaking, CPUs have more complex control structures than GPU, which take longer to train and reach a steady state, but even there half a dozen repetitions usually get you to steady state (I read that use of AI-type learning is to replace simple state machines in CPUs, in which case I would expect it to take longer to reach steady state; but that approach is not in any shipping parts, best I know).
But reaching steady state on any sort of cache shouldn't take 120 iterations, provided every kernel instance does pretty much the same kind of processing.

Generally speaking, CPUs have more complex control structures than GPU, which take longer to train and reach a steady state, but even there half a dozen repetitions usually get you to steady state (I read that use of AI-type learning is to replace simple state machines in CPUs, in which case I would expect it to take longer to reach steady state; but that approach is not in any shipping parts, best I know).

#6
Posted 11/15/2017 06:39 PM   
Thank you once again for all hints. It may be possible that this behaviour may be caused by some of the below points: > (1) Are all those kernel instances doing the exact same amount of processing (same amount of data, same sequence of operation)? > > (2) Does the code contain data-dependent branches that may impact branch efficiency (mispredicts, thread divergence)? > > (3) Does the code contain calls to standard math functions (many of which contain data dependent branches internally)? In fact, I have not been looking into this code for months and completely forgot, how to properly run performance tests. Apologize for the confusion, I suggested that after the hardware change (I have Titan Xp for a few days) the results are different than expected. I assumed that nothing has changed since last tests, but this may not be true. I will check this next week and report the results. Once again apologise and thank you for your effort. PS I agree that in practice a few percent difference in performance is not especially important, but I am trying to generate results for scientific paper available at https://arxiv.org/abs/1611.02445 and was expecting rather results similar to shown in Fig. 13.
Thank you once again for all hints.

It may be possible that this behaviour may be caused by some of the below points:

> (1) Are all those kernel instances doing the exact same amount of processing (same amount of data, same sequence of operation)?
>
> (2) Does the code contain data-dependent branches that may impact branch efficiency (mispredicts, thread divergence)?
>
> (3) Does the code contain calls to standard math functions (many of which contain data dependent branches internally)?

In fact, I have not been looking into this code for months and completely forgot, how to properly run performance tests.
Apologize for the confusion, I suggested that after the hardware change (I have Titan Xp for a few days) the results are different than expected.
I assumed that nothing has changed since last tests, but this may not be true.

I will check this next week and report the results.

Once again apologise and thank you for your effort.


PS

I agree that in practice a few percent difference in performance is not especially important, but I am trying to generate results for scientific paper available at https://arxiv.org/abs/1611.02445 and was expecting rather results similar to shown in Fig. 13.

#7
Posted 11/16/2017 08:26 AM   
Well, you can have the same kind of graph, but the little plus-signs depicting data points are going to be distributed somewhat differently with the Titan Xp :-) I wonder whether the GDDR5X memory on the Titan Xp might have something to with your observations. Almost all other GPUs use GDDR5 (without the 'X'). I am not a hardware guy, but I seem to recall that with these modern memories the memory controller needs to tune its receivers (reception of a high frequency signal!), and might have to retune them on occasion if signal quality declines due to changing temperatures etc. No data can be received during the re-tuning process. This is a [i]very speculative[/i] thought: less frequent re-tuning might be necessary after the program has been hammering the GPU memory for a little while.
Well, you can have the same kind of graph, but the little plus-signs depicting data points are going to be distributed somewhat differently with the Titan Xp :-)

I wonder whether the GDDR5X memory on the Titan Xp might have something to with your observations. Almost all other GPUs use GDDR5 (without the 'X').

I am not a hardware guy, but I seem to recall that with these modern memories the memory controller needs to tune its receivers (reception of a high frequency signal!), and might have to retune them on occasion if signal quality declines due to changing temperatures etc. No data can be received during the re-tuning process. This is a very speculative thought: less frequent re-tuning might be necessary after the program has been hammering the GPU memory for a little while.

#8
Posted 11/16/2017 08:40 AM   
Scroll To Top

Add Reply