Titan XP (Pascal) warms up for double precision despite the clock frequency set

tadtm · November 14, 2017, 1:27pm

Hello,

I have observed behaviour that I am not able explain and avoid.
Since I am trying to correctly meausure performance, I wonder if there any settings of technique which allows to eliminate this phenomena.

For double precision computations with floating point divisions the time of subsequent kernel runs gradually decreases during the first 300-400 ms. At the beginning the kernel takes about 3.7 ms, at the end 3.2 ms. After this warm-up the performance stays close to the maximum.

For the same kernels in single precision, and for similar double precision kernels but without division, the performance is maximum from the beginning.

I measure each kernel separately using C++ std::chrono library.

All double precision kernels are compute bound on Titan XP, single precision kernels are bandwidth bound.

This phenomenon occurs only on GTX Titan XP device.
Previously I have been running the same code on GTX Titan with Kepler processor and kernel times were almost constant after setting of fixed GPU clocks.

I am using the following settings for Titan XP:

nvidia-smi -pm 1
nvidia-smi --power-limit=300
nvidia-smi --application-clocks=5705,1911

nvidia-settings -a “[gpu:0]/GPUFanControlState=1” -a “[fan:0]/GPUTargetFanSpeed=100”

0 - adaptive

1 - prefer maximum performance

2 - auto

nvidia-settings -a “[gpu:0]/GpuPowerMizerMode=1”

Operating system: Ubuntu 16.04.3, CUDA 9.0 with driver 384.81

I would be greatful for any advice.

njuffa · November 14, 2017, 3:33pm

Since GPU don’t have any division instructions, divisions are just regular code composed of multiplies and adds and assorted other instructions, some are inlined (like 32-bit integer division) but most are called subroutines. So that would appear to be a bit of a red herring.

However, the use of divisions typically increases register pressure and dynamic instruction count, and may therefore decrease occupancy and make an application more compute bound.

Kernel execution time reducing in the millisecond time frame you describe is typically a consequence of the GPU’s control software gradually boosting clocks to the maximum possible. However, you are specifying application clocks, so to my understanding that should hold the clocks constant.

Have you verified, with nvidia-smi, that this is actually the case? I wonder whether non-Tesla cards in the Pascal family may simply ignore application clock settings and always use auto boost to maintain maximum power efficiency. Application clocks were introduced for Tesla cards so all GPUs in a cluster can be operated at the same clock; I was under the impression they are not supported for consumer cards. I have no idea what the “power mizer” settings do (e.g. what is the difference between ‘auto’ and ‘adaptive’?).

Some “warmup effect” will exist on any complex processor and memory subsyetm, but the first few calls to a kernel should train up caches, TLBs, branch predictors, etc, and lead to steady-state performance. It is always a good idea to wait until the code is in steady state before measuring the performance. For example, some benchmarks run ten times, and report only the fastest time.

Note that due to throttling (most frequently thermal throttling), a more common problem when measuring performance of GPUs is that the performance drops after the first few minutes. You appear to counteract that by running the fan at 100% and dialing in the highest supported power limit.

tadtm · November 15, 2017, 3:48pm

@njuffa: Thank you very much for many interesting hints. Today I have checked some of them, though only with partial success.

Have you verified, with nvidia-smi, that this is actually the case?

Yes, but at first I only checked GPU clocks just after running the script from the first post nad got something like this (all clocks at proper values):

nvidia-smi -q

==============NVSMI LOG==============

Timestamp                           : Wed Nov 15 10:24:05 2017
Driver Version                      : 384.81

Attached GPUs                       : 1
GPU 00000000:02:00.0
    Product Name                    : TITAN Xp
    Product Brand                   : GeForce
    Display Mode                    : Enabled
    Display Active                  : Enabled
    Persistence Mode                : Enabled
    Accounting Mode                 : Disabled
    Accounting Mode Buffer Size     : 1920
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : 0323217042377
    GPU UUID                        : GPU-03f651d1-dc62-d37e-6a48-f9c6714f2cf5
    Minor Number                    : 0
    VBIOS Version                   : 86.02.3D.00.01
    MultiGPU Board                  : No
    Board ID                        : 0x200
    GPU Part Number                 : 900-1G611-2530-000
    Inforom Version
        Image Version               : G001.0000.01.04
        OEM Object                  : 1.1
        ECC Object                  : N/A
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    GPU Virtualization Mode
        Virtualization mode         : None
    PCI
        Bus                         : 0x02
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0x1B0210DE
        Bus Id                      : 00000000:02:00.0
        Sub System Id               : 0x11DF10DE
        GPU Link Info
            PCIe Generation
                Max                 : 3
                Current             : 3
            Link Width
                Max                 : 16x
                Current             : 16x
        Bridge Chip
            Type                    : N/A
            Firmware                : N/A
        Replays since reset         : 0
        Tx Throughput               : 6000 KB/s
        Rx Throughput               : 0 KB/s
    Fan Speed                       : 100 %
    Performance State               : P0
    Clocks Throttle Reasons
        Idle                        : Not Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
        Sync Boost                  : Not Active
        SW Thermal Slowdown         : Not Active
    FB Memory Usage
        Total                       : 12188 MiB
        Used                        : 460 MiB
        Free                        : 11728 MiB
    BAR1 Memory Usage
        Total                       : 256 MiB
        Used                        : 5 MiB
        Free                        : 251 MiB
    Compute Mode                    : Default
    Utilization
        Gpu                         : 1 %
        Memory                      : 1 %
        Encoder                     : 0 %
        Decoder                     : 0 %
    Encoder Stats
        Active Sessions             : 0
        Average FPS                 : 0
        Average Latency             : 0
    Ecc Mode
        Current                     : N/A
        Pending                     : N/A
    ECC Errors
        Volatile
            Single Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : N/A
            Double Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : N/A
        Aggregate
            Single Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : N/A
            Double Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : N/A
    Retired Pages
        Single Bit ECC              : N/A
        Double Bit ECC              : N/A
        Pending                     : N/A
    Temperature
        GPU Current Temp            : 47 C
        GPU Shutdown Temp           : 99 C
        GPU Slowdown Temp           : 96 C
        GPU Max Operating Temp      : N/A
        Memory Current Temp         : N/A
        Memory Max Operating Temp   : N/A
    Power Readings
        Power Management            : Supported
        Power Draw                  : 90.59 W
        Power Limit                 : 300.00 W
        Default Power Limit         : 250.00 W
        Enforced Power Limit        : 300.00 W
        Min Power Limit             : 125.00 W
        Max Power Limit             : 300.00 W
    Clocks
        Graphics                    : 1911 MHz
        SM                          : 1911 MHz
        Memory                      : 5702 MHz
        Video                       : 1708 MHz
    Applications Clocks
        Graphics                    : 1911 MHz
        Memory                      : 5705 MHz
    Default Applications Clocks
        Graphics                    : 1404 MHz
        Memory                      : 5705 MHz
    Max Clocks
        Graphics                    : 1911 MHz
        SM                          : 1911 MHz
        Memory                      : 5705 MHz
        Video                       : 1708 MHz
    Max Customer Boost Clocks
        Graphics                    : N/A
    Clock Policy
        Auto Boost                  : N/A
        Auto Boost Default          : N/A
    Processes
        Process ID                  : 1269
            Type                    : G
            Name                    : /usr/lib/xorg/Xorg
            Used GPU Memory         : 458 MiB

nvidia-settings -q GPUCurrentPerfLevel -q GPUAdaptiveClockState -q GPUCurrentClockFreqs -q GPUCurrentClockFreqsString -q GPUPerfModes -q GPUCoreTemp -q GPUPowerMizerMode -q GPUPowerMizerDefaultMode -q GPUPerfModes

  Attribute 'GPUCurrentPerfLevel' (tad:0.0): 3.
    'GPUCurrentPerfLevel' is an integer attribute.
    'GPUCurrentPerfLevel' is a read-only attribute.
    'GPUCurrentPerfLevel' can use the following target types: X Screen, GPU.

  Attribute 'GPUAdaptiveClockState' (tad:0.0): 1.
    'GPUAdaptiveClockState' is a boolean attribute; valid values are: 1 (on/true) and 0 (off/false).
    'GPUAdaptiveClockState' is a read-only attribute.
    'GPUAdaptiveClockState' can use the following target types: X Screen, GPU.

  Attribute 'GPUCurrentClockFreqs' (tad:0.0): 1911,5702.
    'GPUCurrentClockFreqs' is a packed integer attribute.
    'GPUCurrentClockFreqs' is a read-only attribute.
    'GPUCurrentClockFreqs' can use the following target types: X Screen, GPU.

  Attribute 'GPUCurrentClockFreqsString' (tad:0.0): nvclock=1911, nvclockmin=139, nvclockmax=1911, nvclockeditable=1, memclock=5702, memclockmin=5705, memclockmax=5705, memclockeditable=1, memTransferRate=11404, memTransferRatemin=11410, memTransferRatemax=11410, memTransferRateeditable=1

  Attribute 'GPUPerfModes' (tad:0.0): perf=0, nvclock=139, nvclockmin=139, nvclockmax=607, nvclockeditable=1, memclock=405, memclockmin=405, memclockmax=405, memclockeditable=1, memTransferRate=810, memTransferRatemin=810, memTransferRatemax=810, memTransferRateeditable=1 ; perf=1, nvclock=139, nvclockmin=139, nvclockmax=1911, nvclockeditable=1, memclock=810, memclockmin=810, memclockmax=810, memclockeditable=1,
  memTransferRate=1620, memTransferRatemin=1620, memTransferRatemax=1620, memTransferRateeditable=1 ; perf=2, nvclock=139, nvclockmin=139, nvclockmax=1911, nvclockeditable=1, memclock=5505, memclockmin=5505, memclockmax=5505, memclockeditable=1, memTransferRate=11010, memTransferRatemin=11010, memTransferRatemax=11010, memTransferRateeditable=1 ; perf=3, nvclock=139, nvclockmin=139, nvclockmax=1911, nvclockeditable=1,
  memclock=5705, memclockmin=5705, memclockmax=5705, memclockeditable=1, memTransferRate=11410, memTransferRatemin=11410, memTransferRatemax=11410, memTransferRateeditable=1

  Attribute 'GPUCoreTemp' (tad:0.0): 47.
    'GPUCoreTemp' is an integer attribute.
    'GPUCoreTemp' is a read-only attribute.
    'GPUCoreTemp' can use the following target types: X Screen, GPU.



  Attribute 'GPUPerfModes' (tad:0.0): perf=0, nvclock=139, nvclockmin=139, nvclockmax=607, nvclockeditable=1, memclock=405, memclockmin=405, memclockmax=405, memclockeditable=1, memTransferRate=810, memTransferRatemin=810, memTransferRatemax=810, memTransferRateeditable=1 ; perf=1, nvclock=139, nvclockmin=139, nvclockmax=1911, nvclockeditable=1, memclock=810, memclockmin=810, memclockmax=810, memclockeditable=1,
  memTransferRate=1620, memTransferRatemin=1620, memTransferRatemax=1620, memTransferRateeditable=1 ; perf=2, nvclock=139, nvclockmin=139, nvclockmax=1911, nvclockeditable=1, memclock=5505, memclockmin=5505, memclockmax=5505, memclockeditable=1, memTransferRate=11010, memTransferRatemin=11010, memTransferRatemax=11010, memTransferRateeditable=1 ; perf=3, nvclock=139, nvclockmin=139, nvclockmax=1911, nvclockeditable=1,
  memclock=5705, memclockmin=5705, memclockmax=5705, memclockeditable=1, memTransferRate=11410, memTransferRatemin=11410, memTransferRatemax=11410, memTransferRateeditable=1

Today I did more thorough test and found some strange behaviour, though unfortunately it does not seem to be the cause.

First, I have observed that during heavy computations the GPU --decreases-- “Performance State” from P0 (at iddle) to P2 (at heavy load) and decreases memory clock from 5705 MHz to 5508 MHz. Graphic and video clocks stay the same.
However, this change of clock frequency rather is not a reason of gradual performance increase, because
the clock frequency decrease occurs before the first kernel run. I have checked that using NVML library:

void printClocks (nvmlDevice_t device, const nvmlClockId_t clockID)
{
	unsigned int clockMHz ;
	NVML_CHECK 
		(nvmlDeviceGetClock (device, NVML_CLOCK_SM, clockID, &clockMHz))
	logger << "SM = " << clockMHz << "   " ;

	NVML_CHECK 
		(nvmlDeviceGetClock (device, NVML_CLOCK_MEM, clockID, &clockMHz))
	logger << "Mem = " << clockMHz << "   " ;

	NVML_CHECK 
		(nvmlDeviceGetClock (device, NVML_CLOCK_GRAPHICS, clockID, &clockMHz))
	logger << "Graphics = " << clockMHz << "   " ;
}

void showParams()
{
	nvmlDevice_t device ;

	NVML_CHECK (nvmlDeviceGetHandleByIndex (0, &device)) ;
	
	logger << "Clocks current: " ;
	printClocks (device, NVML_CLOCK_ID_CURRENT) ;
	logger << " ; target : " ;
	printClocks (device, NVML_CLOCK_ID_APP_CLOCK_TARGET) ;
	logger << " ; default: " ;
	printClocks (device, NVML_CLOCK_ID_APP_CLOCK_DEFAULT) ;

	logger << "\n" ;
}

and calling showParams() before each kernel run gives

Clocks current: SM = 1911   Mem = 5508   Graphics = 1911    ; target : SM = 1911   Mem = 5705   Graphics = 1911    ; default: SM = 1404   Mem = 5705   Graphics = 1404   
Clocks current: SM = 1911   Mem = 5508   Graphics = 1911    ; target : SM = 1911   Mem = 5705   Graphics = 1911    ; default: SM = 1404   Mem = 5705   Graphics = 1404   
Clocks current: SM = 1911   Mem = 5508   Graphics = 1911    ; target : SM = 1911   Mem = 5705   Graphics = 1911    ; default: SM = 1404   Mem = 5705   Graphics = 1404   
Clocks current: SM = 1911   Mem = 5508   Graphics = 1911    ; target : SM = 1911   Mem = 5705   Graphics = 1911    ; default: SM = 1404   Mem = 5705   Graphics = 1404   
Clocks current: SM = 1911   Mem = 5508   Graphics = 1911    ; target : SM = 1911   Mem = 5705   Graphics = 1911    ; default: SM = 1404   Mem = 5705   Graphics = 1404   
Clocks current: SM = 1911   Mem = 5508   Graphics = 1911    ; target : SM = 1911   Mem = 5705   Graphics = 1911    ; default: SM = 1404   Mem = 5705   Graphics = 1404   
...

Second, I was not able to change the memory frequency. I tried both

nvidia-smi --application-clocks=810,139

and NVML

NVML_CHECK( nvmlDeviceSetApplicationsClocks (device, 810, 139) ) ;

This seems consistent with your impression that “application clocks are not supported for consumer cards”. However, setting graphic clock works as expected, only memory clocks remains constant at 5508 MHz.

Thus, since all clocks seem to remain the same for all kernel runs, I still do not understand why the performance of my kernels increases up to about 120th launch (the third column contains kernel time in ns).

fastKernel : 0 : 3.328e+06
fastKernel : 1 : 3.321e+06
fastKernel : 2 : 3.314e+06
fastKernel : 3 : 3.311e+06
fastKernel : 4 : 3.311e+06
fastKernel : 5 : 3.311e+06
fastKernel : 6 : 3.307e+06
fastKernel : 7 : 3.307e+06
fastKernel : 8 : 3.312e+06
fastKernel : 9 : 3.308e+06
fastKernel : 10 : 3.313e+06
fastKernel : 11 : 3.304e+06
fastKernel : 12 : 3.305e+06
fastKernel : 13 : 3.306e+06
fastKernel : 14 : 3.301e+06
fastKernel : 15 : 3.303e+06
fastKernel : 16 : 3.3e+06
fastKernel : 17 : 3.301e+06
fastKernel : 18 : 3.297e+06
fastKernel : 19 : 3.3e+06
fastKernel : 20 : 3.295e+06
fastKernel : 21 : 3.294e+06
fastKernel : 22 : 3.294e+06
fastKernel : 23 : 3.295e+06
fastKernel : 24 : 3.289e+06
fastKernel : 25 : 3.289e+06
fastKernel : 26 : 3.286e+06
fastKernel : 27 : 3.284e+06
fastKernel : 28 : 3.29e+06
fastKernel : 29 : 3.285e+06
fastKernel : 30 : 3.283e+06
fastKernel : 31 : 3.282e+06
fastKernel : 32 : 3.283e+06
fastKernel : 33 : 3.284e+06
fastKernel : 34 : 3.279e+06
fastKernel : 35 : 3.281e+06
fastKernel : 36 : 3.282e+06
fastKernel : 37 : 3.276e+06
fastKernel : 38 : 3.278e+06
fastKernel : 39 : 3.274e+06
fastKernel : 40 : 3.273e+06
fastKernel : 41 : 3.275e+06
fastKernel : 42 : 3.275e+06
fastKernel : 43 : 3.269e+06
fastKernel : 44 : 3.274e+06
fastKernel : 45 : 3.27e+06
fastKernel : 46 : 3.268e+06
fastKernel : 47 : 3.269e+06
fastKernel : 48 : 3.265e+06
fastKernel : 49 : 3.263e+06
fastKernel : 50 : 3.263e+06
fastKernel : 51 : 3.663e+06
fastKernel : 52 : 3.258e+06
fastKernel : 53 : 3.258e+06
fastKernel : 54 : 3.257e+06
fastKernel : 55 : 3.258e+06
fastKernel : 56 : 3.251e+06
fastKernel : 57 : 3.251e+06
fastKernel : 58 : 3.251e+06
fastKernel : 59 : 3.248e+06
fastKernel : 60 : 3.248e+06
fastKernel : 61 : 3.243e+06
fastKernel : 62 : 3.239e+06
fastKernel : 63 : 3.239e+06
fastKernel : 64 : 3.237e+06
fastKernel : 65 : 3.242e+06
fastKernel : 66 : 3.233e+06
fastKernel : 67 : 3.235e+06
fastKernel : 68 : 3.235e+06
fastKernel : 69 : 3.231e+06
fastKernel : 70 : 3.23e+06
fastKernel : 71 : 3.232e+06
fastKernel : 72 : 3.227e+06
fastKernel : 73 : 3.223e+06
fastKernel : 74 : 3.227e+06
fastKernel : 75 : 3.224e+06
fastKernel : 76 : 3.225e+06
fastKernel : 77 : 3.219e+06
fastKernel : 78 : 3.216e+06
fastKernel : 79 : 3.216e+06
fastKernel : 80 : 3.217e+06
fastKernel : 81 : 3.212e+06
fastKernel : 82 : 3.213e+06
fastKernel : 83 : 3.213e+06
fastKernel : 84 : 3.212e+06
fastKernel : 85 : 3.21e+06
fastKernel : 86 : 3.211e+06
fastKernel : 87 : 3.205e+06
fastKernel : 88 : 3.205e+06
fastKernel : 89 : 3.205e+06
fastKernel : 90 : 3.202e+06
fastKernel : 91 : 3.204e+06
fastKernel : 92 : 3.198e+06
fastKernel : 93 : 3.202e+06
fastKernel : 94 : 3.196e+06
fastKernel : 95 : 3.196e+06
fastKernel : 96 : 3.197e+06
fastKernel : 97 : 3.588e+06
fastKernel : 98 : 3.229e+06
fastKernel : 99 : 3.194e+06
fastKernel : 100 : 3.191e+06
fastKernel : 101 : 3.188e+06
fastKernel : 102 : 3.19e+06
fastKernel : 103 : 3.191e+06
fastKernel : 104 : 3.193e+06
fastKernel : 105 : 3.194e+06
fastKernel : 106 : 3.194e+06
fastKernel : 107 : 3.194e+06
fastKernel : 108 : 3.185e+06
fastKernel : 109 : 3.191e+06
fastKernel : 110 : 3.183e+06
fastKernel : 111 : 3.455e+06
fastKernel : 112 : 3.221e+06
fastKernel : 113 : 3.182e+06
fastKernel : 114 : 3.183e+06
fastKernel : 115 : 3.183e+06
fastKernel : 116 : 3.178e+06
fastKernel : 117 : 3.182e+06
fastKernel : 118 : 3.18e+06
fastKernel : 119 : 3.179e+06
fastKernel : 120 : 3.179e+06
fastKernel : 121 : 3.179e+06
fastKernel : 122 : 3.179e+06
fastKernel : 123 : 3.178e+06
fastKernel : 124 : 3.177e+06
fastKernel : 125 : 3.178e+06
fastKernel : 126 : 3.178e+06
fastKernel : 127 : 3.174e+06
fastKernel : 128 : 3.174e+06
fastKernel : 129 : 3.176e+06
fastKernel : 130 : 3.176e+06
fastKernel : 131 : 3.174e+06
fastKernel : 132 : 3.175e+06
fastKernel : 133 : 3.178e+06
fastKernel : 134 : 3.176e+06
fastKernel : 135 : 3.173e+06
fastKernel : 136 : 3.174e+06
fastKernel : 137 : 3.179e+06
fastKernel : 138 : 3.174e+06
fastKernel : 139 : 3.172e+06
fastKernel : 140 : 3.172e+06
fastKernel : 141 : 3.174e+06
fastKernel : 142 : 3.173e+06
fastKernel : 143 : 3.177e+06
fastKernel : 144 : 3.176e+06
fastKernel : 145 : 3.172e+06
fastKernel : 146 : 3.173e+06
fastKernel : 147 : 3.175e+06
fastKernel : 148 : 3.177e+06
fastKernel : 149 : 3.176e+06
fastKernel : 150 : 3.21e+06
fastKernel : 151 : 3.572e+06
fastKernel : 152 : 3.21e+06
fastKernel : 153 : 3.178e+06
fastKernel : 154 : 3.173e+06
fastKernel : 155 : 3.177e+06
fastKernel : 156 : 3.174e+06
fastKernel : 157 : 3.175e+06
fastKernel : 158 : 3.173e+06
fastKernel : 159 : 3.173e+06
fastKernel : 160 : 3.175e+06
fastKernel : 161 : 3.175e+06
fastKernel : 162 : 3.174e+06
fastKernel : 163 : 3.173e+06
fastKernel : 164 : 3.173e+06
fastKernel : 165 : 3.175e+06
fastKernel : 166 : 3.173e+06
fastKernel : 167 : 3.176e+06
fastKernel : 168 : 3.175e+06
fastKernel : 169 : 3.173e+06
fastKernel : 170 : 3.172e+06
fastKernel : 171 : 3.175e+06
fastKernel : 172 : 3.173e+06
fastKernel : 173 : 3.172e+06
fastKernel : 174 : 3.176e+06
fastKernel : 175 : 3.174e+06
fastKernel : 176 : 3.173e+06
fastKernel : 177 : 3.174e+06
fastKernel : 178 : 3.172e+06
fastKernel : 179 : 3.173e+06
fastKernel : 180 : 3.172e+06
fastKernel : 181 : 3.175e+06
fastKernel : 182 : 3.171e+06
fastKernel : 183 : 3.176e+06
fastKernel : 184 : 3.174e+06
fastKernel : 185 : 3.172e+06
fastKernel : 186 : 3.173e+06
fastKernel : 187 : 3.177e+06
fastKernel : 188 : 3.171e+06
fastKernel : 189 : 3.176e+06
fastKernel : 190 : 3.175e+06
fastKernel : 191 : 3.174e+06
fastKernel : 192 : 3.173e+06
fastKernel : 193 : 3.172e+06
fastKernel : 194 : 3.173e+06
fastKernel : 195 : 3.176e+06
fastKernel : 196 : 3.177e+06
fastKernel : 197 : 3.177e+06
fastKernel : 198 : 3.174e+06
fastKernel : 199 : 3.173e+06

Unless my clock measurements with NVML are wrong, the clocks seem to stay the same for each kernel run. The performance change must be then caused by other factors.

I have no idea what the “power mizer” settings do (e.g. what is the difference between ‘auto’ and ‘adaptive’?).

In fact, I have not analysed this in detail. I only observed that in ‘auto’ and ‘adaptive’ modes the GPU clocks slow down at iddle. Setting to ‘prefer maximum performance’ keeps clocks high despite GPU utilisation.

Some “warmup effect” will exist on any complex processor

Yes, thank you, I am aware of that. However, for CPUs usually only at most a few first runs were affected by this phenomena.
In my case, the performance changes for the first 100-150 first kernel runs, thus probably some other factors may cause this.

You appear to counteract that by running the fan at 100% and dialing in the highest supported power limit.

Yes, frequency of clocks seems to remain stable in long term during my performance measurements.
However, previously I missed that during kernel run the memory clock seems to remain different than requested.

Once again thank you for advice :)

njuffa · November 15, 2017, 5:00pm

I don’t have a ready explanation for the minimal speedup (4.9% between slowest and fastest instance, from what in can see) over the first 120 kernel invocation. I wouldn’t consider the observed differences practically relevant.

(1) Are all those kernel instances doing the exact same amount of processing (same amount of data, same sequence of operation)?

(2) Does the code contain data-dependent branches that may impact branch efficiency (mispredicts, thread divergence)?

(3) Does the code contain calls to standard math functions (many of which contain data dependent branches internally)?

For the exact same processing on the exact same data, we would expect to reach steady state performance after half a dozen invocations or so. If the data differs between invocations, all bets are off and one would have to look at the details on how this potentially impacts code performance. Maybe your code contains some data-dependent control flows, but as you are iterating over data the data itself converges to a steady state?

Robert_Crovella · November 15, 2017, 6:31pm

The L2 cache is not flushed between kernel calls, so there could be some caching influence, although I acknowledge it does not fully describe the behavior.

njuffa · November 15, 2017, 6:39pm

But reaching steady state on any sort of cache shouldn’t take 120 iterations, provided every kernel instance does pretty much the same kind of processing.

Generally speaking, CPUs have more complex control structures than GPU, which take longer to train and reach a steady state, but even there half a dozen repetitions usually get you to steady state (I read that use of AI-type learning is to replace simple state machines in CPUs, in which case I would expect it to take longer to reach steady state; but that approach is not in any shipping parts, best I know).

tadtm · November 16, 2017, 8:26am

Thank you once again for all hints.

It may be possible that this behaviour may be caused by some of the below points:

(1) Are all those kernel instances doing the exact same amount of processing (same amount of data, same sequence of operation)?

(2) Does the code contain data-dependent branches that may impact branch efficiency (mispredicts, thread divergence)?

(3) Does the code contain calls to standard math functions (many of which contain data dependent branches internally)?

In fact, I have not been looking into this code for months and completely forgot, how to properly run performance tests.
Apologize for the confusion, I suggested that after the hardware change (I have Titan Xp for a few days) the results are different than expected.
I assumed that nothing has changed since last tests, but this may not be true.

I will check this next week and report the results.

Once again apologise and thank you for your effort.

PS

I agree that in practice a few percent difference in performance is not especially important, but I am trying to generate results for scientific paper available at [1611.02445] A new GPU implementation for lattice-Boltzmann simulations on sparse geometries and was expecting rather results similar to shown in Fig. 13.

njuffa · November 16, 2017, 8:40am

Well, you can have the same kind of graph, but the little plus-signs depicting data points are going to be distributed somewhat differently with the Titan Xp :-)

I wonder whether the GDDR5X memory on the Titan Xp might have something to with your observations. Almost all other GPUs use GDDR5 (without the ‘X’).

I am not a hardware guy, but I seem to recall that with these modern memories the memory controller needs to tune its receivers (reception of a high frequency signal!), and might have to retune them on occasion if signal quality declines due to changing temperatures etc. No data can be received during the re-tuning process. This is a very speculative thought: less frequent re-tuning might be necessary after the program has been hammering the GPU memory for a little while.

tadtm · November 22, 2017, 1:19pm

Thank you for you help, problem detected and solved.

The reason is as suggested by njuffa: when I am iterating over data, the soulution converges to steady state.
The problem may be simply avoided by starting computations from the previously computed data.
When I simply use the data produced by kernels without divisions as an input data to kernels with divisions, the performance stays almost constant.

Unfortunately, saving and restoring data was time consuming and removed at some point, what caused behaviour reported in the first post :(
Once again sorry for confusion.

Some details:

The division procedure works slower when a large number of division are of 0.0 by something ~~by 1.0~~ (previously I made bad conclusions), because at the beginning of the division procedure nvvp reports 6% of inactive threads (lines 2-28). When the majoriy of divisions is of values different than 0.0 ~~by 1.0 (usually between 0.999995 and 1.00005)~~, no inactive threads are reported and kernel works faster.

__cuda_sm20_div_f64_slowpath_v2:
{         LOP32I.AND R47, R43, 0x40000000;
          PBK `(.L_135);        }
          ISETP.LT.U32.AND P0, PT, R47, c[0x2][0x38], PT;
          MOV32I R66, 0x1ff00000;
          MOV R56, RZ;
          SEL R57, R66, c[0x2][0x3c], !P0;
          DMUL R64, R42, R56;
{         LOP32I.AND R47, R44, 0x7f800000;
          MUFU.RCP64H R58, R65;        }
          ISETP.LT.U32.AND P0, PT, R47, c[0x2][0x40], PT;
          LOP.XOR R45, R45, R44;
          MOV R59, R58;
          MOV32I R58, 0x1;
          LOP.XOR R44, R45, R44;
          DFMA R60, R64, -R58, c[0x2][0x0];
          SEL R62, R66, c[0x2][0x3c], !P0;
          LOP.XOR R45, R45, R44;
          MOV R63, R62;
          DFMA R60, R60, R60, R60;
          MOV R62, RZ;
          DMUL R62, R44, R62;
          DFMA R58, R58, R60, R58;
          DMUL R60, R62, R58;
          DFMA R62, R64, -R60, R62;
          DFMA R58, R58, R62, R60;
          DSETP.LEU.AND P0, PT, |R58|, RZ, PT;
      @P0 BRA `(.L_136);
          ISETP.GT.U32.AND P0, PT, R47, c[0x2][0x44], PT;
          DMUL R62, R56, R58;
          SEL R60, R66, c[0x2][0x3c], !P0;
          MOV R61, R60;
          MOV R60, RZ;
          DMUL R58, R58, R60;
          DMUL R60, R60, R62;
          DMUL R62, R56, R58;
          DFMA R58, R42.reuse, R60, -R44.reuse;
          DFMA R56, R42, R62, -R44;
          DSETP.GT.AND P0, PT, |R58|, |R56|, PT;
          SEL R47, R63, R61, P0;
          FSETP.GTU.AND P1, PT, |R47|, 1.469367938527859385e-39, PT;
          SEL R56, R62, R60, P0;
{         MOV R57, R47;
      @P1 BRK;        }
          FSETP.LT.AND P0, PT, |R45|, 1.5046327690525280102e-36, PT;
          MOV32I R58, 0x3ff00000;
          LOP32I.AND R62, R56, 0xfffffffe;
          SEL R58, R58, c[0x2][0x48], !P0;
          MOV R59, R58;
          MOV R58, RZ;
          LOP32I.OR R60, R56, 0x1;
          MOV R56, R62;
          MOV R57, R47.reuse;
          DMUL R42, R42, R58.reuse;
          DMUL R44, R44, R58;
          MOV R61, R47;
          DFMA R58, R56, R42.reuse, -R44.reuse;
          DFMA R56, R60, R42, -R44;
          DSETP.GT.AND P0, PT, |R58|, |R56|, PT;
          SEL R58, R60, R62, P0;
          LOP32I.AND R56, R58, 0x1;
          IADD32I R61.CC, R58, 0x1;
          ISETP.EQ.U32.AND P0, PT, R56, 0x1, PT;
          IADD.X R60, RZ, R47;
          IADD32I R56.CC, R58, -0x1;
          IADD32I.X R57, R47.reuse, -0x1;
          SEL R61, R58, R61, !P0;
          SEL R60, R47, R60, !P0;
          SEL R47, R57, R47, !P0;
          SEL R58, R56, R58, !P0;
          MOV R56, R61;
          MOV R57, R60;
          MOV R59, R47;
          DFMA R56, R42.reuse, R56, -R44.reuse;
          DFMA R42, R42, R58, -R44;
          DSETP.GT.AND P0, PT, |R56|, |R42|, PT;
          SEL R56, R58, R61, P0;
{         SEL R57, R47, R60, P0;
          BRK;        }
.L_136:
          DSETP.EQ.AND P0, PT, R58, RZ, PT;
      @P0 BRA `(.L_137);
{         MOV R56, RZ;
          MUFU.RCP64H R57, R43;        }
          DSETP.GT.AND P1, PT, |R56|, RZ, PT;
     @!P1 DSETP.NEU.AND P0, PT, |R42|, +INF , PT;
     @!P1 SEL R42, R42, R56, P0;
     @!P1 SEL R47, R43, R57, P0;
     @!P1 MOV R56, R42;
     @!P1 MOV R57, R47;
          DMUL R56, R44, R56;
          BRK;
.L_137:
          DMUL R56, R44, R42;
          BRK;
.L_135:
          MOV R42, R56;
{         MOV R43, R57;
          RET;        }
.L_138:
          BRA `(.L_138);
          NOP;
          NOP;
          NOP;

I have not analysed this code in detail, only observed that it is different than that generated by Cuda 7.5 for Kepler device.

njuffa · November 22, 2017, 5:30pm

It seems this has been root caused to data-dependent code path selection inside the double-precision division subroutine. Not much you can do about that other than try to avoid any division that may be unnecessary.

Given that on the GPU, all divisions are implemented by software, it is conceivable that implementation details vary between CUDA versions, or between different architectures (compute capabilities). Divisions with a dividend of zero may fall into the slow path (special case handling). Whether that is avoidable, I could not say offhand; it has been almost a decade since I last looked at the details of CUDA’s double-precision division.

You may want to consider filing an enhancement request with NVIDIA to improve the division performance for the case of zero dividend. This may or may not be technically feasible, but it would not hurt to file such a request.

tadtm · November 23, 2017, 10:48am

Yes, thank you, since I focus on memory bandwidth optimisations, the computational performance is not so important for me. I only had to know, why the performance of some of my kernels increases at the beginning.

Yes, I am aware that this is a feature, not a bug :)

I will try to do this next week.
As for now, I have prepared a short code that illustrates this behaviour:

#include <iostream>
#include <sstream>
#include <cuda_runtime.h>

using namespace std ;


void cudaCheck( cudaError_t cudaCode, std::string file, size_t line )
{
  if ( cudaSuccess != (cudaCode) )
  {
		std::stringstream sstr ;
		sstr << "Error at " << file << ":" << line 
    		 << " : " <<  cudaGetErrorString(cudaCode) << "\n" ;
		cout << sstr.str() ;
    throw sstr.str() ;
  }
}

#define CUDA_CHECK(code)   cudaCheck( (code), __FILE__, __LINE__ ) ;


typedef double DTYPE ;



__global__ void divKern (DTYPE * in, DTYPE * out, DTYPE divisor, unsigned N)
{
	int i = blockDim.x * blockIdx.x + threadIdx.x;

	if (i < N)
	{
		out [i] = in [i] / divisor;
	}
}


void perfTest (DTYPE dividend)
{
	cout << "\n\nPerformance measurement for dividend = " << dividend << "\n\n" ;

	constexpr unsigned N_ELEM = 10000000 ;
	constexpr unsigned SIZE = N_ELEM * sizeof (DTYPE) ;

	DTYPE * hIn, * hOut ;
	hIn  = (double*) malloc (SIZE) ;
	hOut = (double*) malloc (SIZE) ;

	for (unsigned i=0 ; i < N_ELEM ; i++)
	{
		hIn [i] = dividend ;
	}

	DTYPE * dIn, * dOut ;
	CUDA_CHECK (cudaMalloc (&dIn , SIZE)) ;
	CUDA_CHECK (cudaMalloc (&dOut, SIZE)) ;

	CUDA_CHECK (cudaMemcpy (dIn, hIn, SIZE, cudaMemcpyHostToDevice)) ;

	constexpr unsigned threadsPerBlock = 64 ;
	constexpr unsigned blocksPerGrid = 
		(N_ELEM + threadsPerBlock - 1) / threadsPerBlock ;


	for (unsigned t=0 ; t < 20 ; t++)
	{
		cudaEvent_t start, stop;
		float time;
		CUDA_CHECK (cudaEventCreate (&start)) ;
		CUDA_CHECK (cudaEventCreate (&stop)) ;
		CUDA_CHECK (cudaEventRecord (start, 0)) ;
			
			divKern <<< blocksPerGrid, threadsPerBlock >>> (dIn, dOut, 1.0, N_ELEM) ;

		CUDA_CHECK (cudaEventRecord (stop, 0)) ;
		CUDA_CHECK (cudaEventSynchronize (stop)) ;
		CUDA_CHECK (cudaEventElapsedTime (&time, start, stop)) ;
		CUDA_CHECK (cudaEventDestroy (start)) ;
		CUDA_CHECK (cudaEventDestroy (stop)) ;
		
		CUDA_CHECK (cudaPeekAtLastError()) ;
		CUDA_CHECK (cudaDeviceSynchronize()) ; 

		cout << t << " Kernel time : " << 1000 * time << " us \n" ;
	}

	CUDA_CHECK (cudaMemcpy(hOut, dOut, SIZE, cudaMemcpyDeviceToHost)) ;

	for (unsigned i=0 ; i < 10 ; i++)
	{
		cout << "hOut [" << i << "] = " << hOut [i] << "\n" ;
	}

	CUDA_CHECK (cudaFree (dIn)) ;
	CUDA_CHECK (cudaFree (dOut)) ;

	free (hIn) ;
	free (hOut) ;
}


int main (int argc, char ** argv)
{
	cout << "Division performance test.\n" ;

	CUDA_CHECK (cudaSetDevice(0)) ;
	int i = -1 ;
	CUDA_CHECK (cudaGetDevice(&i)) ;
	cout << "Using CUDA device #" << i << "\n" ;
	
	perfTest (0.01) ;
	perfTest (0) ;

	CUDA_CHECK (cudaDeviceSynchronize()) ;
	cudaDeviceReset() ;
			
	return 0 ;
}

Results show, that division performance is halved for Titan XP GPU, when dividend is 0:

nvcc -std=c++11 -Xcompiler -ggdb -lineinfo --generate-code arch=compute_61,code=sm_61 -O3 --fmad=true --use_fast_math -lcuda  divTest.cu  -o divTest
Division performance test.
Using CUDA device #0


Performance measurement for dividend = 0.01

0 Kernel time : 386.208 us 
1 Kernel time : 382.528 us 
2 Kernel time : 380.928 us 
3 Kernel time : 379.744 us 
4 Kernel time : 380.928 us 
5 Kernel time : 379.904 us 
6 Kernel time : 380.544 us 
7 Kernel time : 379.904 us 
8 Kernel time : 379.904 us 
9 Kernel time : 382.592 us 
10 Kernel time : 384 us 
11 Kernel time : 380.928 us 
12 Kernel time : 379.584 us 
13 Kernel time : 379.904 us 
14 Kernel time : 379.488 us 
15 Kernel time : 377.856 us 
16 Kernel time : 377.376 us 
17 Kernel time : 376.832 us 
18 Kernel time : 377.536 us 
19 Kernel time : 376.832 us 
hOut [0] = 0.01
hOut [1] = 0.01
hOut [2] = 0.01
hOut [3] = 0.01
hOut [4] = 0.01
hOut [5] = 0.01
hOut [6] = 0.01
hOut [7] = 0.01
hOut [8] = 0.01
hOut [9] = 0.01


Performance measurement for dividend = 0

0 Kernel time : 795.488 us 
1 Kernel time : 756.736 us 
2 Kernel time : 756.736 us 
3 Kernel time : 758.752 us 
4 Kernel time : 756.704 us 
5 Kernel time : 755.712 us 
6 Kernel time : 755.392 us 
7 Kernel time : 755.456 us 
8 Kernel time : 755.936 us 
9 Kernel time : 755.712 us 
10 Kernel time : 755.712 us 
11 Kernel time : 755.648 us 
12 Kernel time : 755.712 us 
13 Kernel time : 755.712 us 
14 Kernel time : 753.664 us 
15 Kernel time : 752.64 us 
16 Kernel time : 756.288 us 
17 Kernel time : 756.32 us 
18 Kernel time : 755.712 us 
19 Kernel time : 754.688 us 
hOut [0] = 0
hOut [1] = 0
hOut [2] = 0
hOut [3] = 0
hOut [4] = 0
hOut [5] = 0
hOut [6] = 0
hOut [7] = 0
hOut [8] = 0
hOut [9] = 0

njuffa · November 23, 2017, 6:11pm

The data from your test app definitely suggest the case of a zero dividend is going down the slow path. I’d say, file the enhancement request and see what NVIDIA comes back with.