I have just taken over a CUDA project here at work. I have not done CUDA programming other than a hello world, or even much in the way of C++ before, so I am expecting a big learning curve. However, even without touching a line of code, I have found that our application runs about 30% slower on a brand-new system with a Tesla K20 than on my own beat-up developer workstation with a GTX 580. Also, most of the samples I’ve tried that are included with the 5.0 toolkit run slower on the Tesla machine. Both computers are running Windows 7 64-bit.
It was expected that we might have to tweak our code to take advantage of some of the new CUDA features in 5.0 (and the Telsa architecture), but I was not expecting this type of performance disparity. I guess my question is two-fold:
How can I make sure the Tesla is not damaged, and is functioning correctly?
Does anyone have a “well, duh” explanation for this behavior?
When Kepler first came out, quite a few apps ran slower than Fermi GPUs. It took months for actual testing and development to show that in many, if not most, of those cases, the problem was simply tuning. The radically different SP per SM counts, the new ratios of compute to shared memory size/bandwidth, the different cache behavior, the different register counts… all of these were a much bigger change than the Tesla->Fermi transition was.
But experience has shown that Kepler really is just as good as Fermi for CUDA apps. Most apps which had performance drops need retuning and sometimes reorganization. It’s not that Kepler is slower, it’s just different. A Kepler-tuned kernel will run slower on Fermi as well.
It’s not ideal… we’d all love to have every code automatically retune itself for every GPU… in fact there’s been research on that http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=5709485 . Now after 6 months of playing with Kepler, I prefer Kepler’s enhancements, even though I had to rebalance a lot of Tesla/Fermi code to get my performance back.
One thing to keep in mind when comparing Tesla cards to consumer cards in general is that Tesla cards come with ECC. When ECC is enabled, memory bandwidth available to applications is reduced. So if your application is bandwidth-bound, you may want to temporarily turn off ECC on the Tesla card for a like-to-like comparison with the GTX 580. Specifications for the theoretial memory bandwidth of these two GPUs can be found here:
As SPWorley points out, there are various architectural differences (many driven by requirements for increased energy efficiency) between the Fermi and Kepler families. Some retuning of existing code may be necessary. This is particularly likely for code that has been very tightly tuned to the Fermi architecture. One possible scenario is that the existing code simply does not expose sufficient parallelism to fully utilize the much “wider” Kepler architecture, which provides many more functional units (running at lower clock speeds) than previous GPU architectures.
Thanks everyone for the thoughtful replies. I have some of the requested information below -
Is ‘number’ the number of cores per SM? I guess this would be (figures from deviceQuery):
Tesla K20: 13 x 192 x 706 = 1,762,176
GTX 580: 16 x 32 x 1544 = 790,528
Yikes, can we expect no significant performance gains? I sense an uncomfortable conversation with my boss in the near future, who expected a 10-fold speed increase with the new card and assigned me to make it happen…
I believe it is a K20c. The results of nvidia-smi -q are below:
Timestamp : Thu Jan 10 09:08:37 2013
Driver Version : 307.45
Attached GPUs : 2
GPU 0000:02:00.0
Product Name : Tesla K20c
Display Mode : Disabled
Persistence Mode : N/A
Driver Model
Current : TCC
Pending : TCC
Serial Number : 0324712003072
GPU UUID : GPU-e82f0cdb-0e58-1387-ee59-6cc969b6610b
VBIOS Version : 80.10.14.00.02
Inforom Version
Image Version : 2081.0204.00.07
OEM Object : 1.1
ECC Object : 3.0
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
PCI
Bus : 0x02
Device : 0x00
Domain : 0x0000
Device Id : 0x102210DE
Bus Id : 0000:02:00.0
Sub System Id : 0x098210DE
GPU Link Info
PCIe Generation
Max : 2
Current : 1
Link Width
Max : 16x
Current : 16x
Fan Speed : 30 %
Performance State : P8
Clocks Throttle Reasons
Idle : Active
User Defined Clocks : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
Unknown : Not Active
Memory Usage
Total : 4799 MB
Used : 13 MB
Free : 4786 MB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Aggregate
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : 0
Total : 0
Temperature
Gpu : 38 C
Power Readings
Power Management : Supported
Power Draw : 16.61 W
Power Limit : 225.00 W
Default Power Limit : 225.00 W
Min Power Limit : 150.00 W
Max Power Limit : 225.00 W
Clocks
Graphics : 324 MHz
SM : 324 MHz
Memory : 324 MHz
Applications Clocks
Graphics : 705 MHz
Memory : 2600 MHz
Max Clocks
Graphics : 758 MHz
SM : 758 MHz
Memory : 2600 MHz
Compute Processes : None
Nothing in the nvidia-smi output indicates a problem to me. The output clearly identifies this K20 as a K20c, i.e. an actively cooled wortkstation-class part, and otherwise looks very much like the output from my K20c at work (except that you seem to be running under Windows). I don’t know what performance gains you expected compared to your previous Tesla solution (a C2050 or C2075 ?). In general you will observe larger performance improvements for compute-bound tasks than tasks bound by memory throughput.
Thanks for taking a look at it. This is our first Tesla - our application was developed on and currently runs on Fermi-based cards - but I did expect it to be faster on the Tesla. I disabled ECC per your earlier post and saw no significant gains, so I am guessing the processing is not bandwidth-bound.
It sounds like restructuring the app is the thing to do now. I appreciate everyone’s time and comments.
You are probably aware of it at this point, but I figured I should point it out just in case: A change to the ECC settings requires a reboot to take effect.
It seems like the best thing to do now is to find out where the bottlenecks are in the application with the help of the profiler, then drill down on those.
You could in the best case scenario expect ~ 8x speedup if it scales perfectly to the new architecture.
Even if you fulfill all 3 requirements, your application could still end up going from being compute bound to becoming bandwidth bound given that the K20 only has a marginal increase in bandwidth capacity over for ex GTX580, while radically increasing SP and DP performance.
I have access to the server with two Tesla K20X cards installed (the most powerfull Kepler solution Nvidia privides), making a test drive right now.
My conslusion is very frustrating: even Tesla K20X is approximately 10-15% SLOWER than GTX580 for tasks that use single precision math and employ lots of random access reads from global memory. When it comes to the cache efficiency Kepler looks very weak.
I have also done a number of tests with GTX680, this card is 2.5 times slower on my tasks.
Need to re tweak a program. Though do not expect big speed up in compare with GTX580. Those chips have nearly same size afaik. Kepler has lower frequency, therefor lower power consumption and easier to produce.
A Kepler SMX has quite a different “shape” than a Fermi SM. A block can have twice as many 63-register warps but only half as much shared memory per warp. That right there has a huge impact on preexisting Fermi kernels.
In a decent-sized project I finished last fall, each Kepler thread block was achieving 2.2x the throughput on 2x the workload of a maxed out Fermi block. The advantage decreases on larger problems as the device becomes memory bound.
At the least, any highly tuned Kepler kernel should be focused on fully utilizing the SMX since each one is quite a chunk of silicon by itself.
I would classify my Kepler kernel design style as Volkovian with an extreme focus on abusing SHFL and minimizing shared and device memory transactions. Baroque Neo-Volkovian? :)
Summary: Kepler is a monster if you design your kernels to fully utilize what it offers.
Unfortunately, “Volkovian” style can’t be used for all kinds of problems … It can’t for mine. Vasily Volkov was very kind to check one of my kernels out to see whether I’m missing something serious that impacts the performance - his verdict is simple: my sort of calculations is just not too Kepler-friendly.
The first part of your uncomfortable conversation with your boss should be about setting reasonable expectations for hardware improvements. No generation of GPU (or CPU for that matter) has ever been 10x faster than the previous generation. You’re lucky if you see that kind of improvement after 3 or 4 generations. :)
Now, that said, Kepler’s changes relative to Fermi go in both directions. Comparing a GTX 680 (which is, ignoring double precision, about 13% slower than a K20 assuming naive clock*CUDA cores scaling) to a GTX 580:
Atomic operations are much faster
Single precision floating point throughput is 2x greater
Hardware special function throughput is 2.5x greater
Memory bandwidth is about the same
Integer operations are a little slower
Shared memory, registers, L1 and L2 cache per thread is going to be lower due to the need for much larger blocks to maximize throughput on Kepler
When I first ran my programs on the GTX 680, they ran a lot slower than the GTX 580 because my block sizes were way too small for the GTX 680. Once I fixed that, I found that my programs ran anywhere from 15% slower to 2x faster on the GTX 680, depending on the mix of operations I was using.