2080ti vs Titan V

Does anyone know why RTX is not using Tensor cores properly? Compute & Synthetics - The NVIDIA GeForce RTX 2070 Founders Edition Review: Mid-Range Turing, High-End Price

HGEMM performance 2080ti: 48K
HGEMM performance Titan V: 97K

2080ti should be similar or a bit slower than a Titan V given the number of tensor cores.

is it drivers? its actually nearly exactly half, which is suspicious! did Nvidia disable them quietly? people should know if thats the case.

Has Nvidia specified anywhere that Turing tensor cores have the same throughput as Volta ones?
I wouldn’t be surprised if Nvidia spent a bit less silicon on tensor cores in consumer graphics cards than in specialised (AI) compute cards - that would make a lot of sense actually.

what evidence do you have one way or the other? why should we have to speculate when Nvidia could just be transparent and document it SOMEHWERE. Since they have not, on face value a turing tensor core should be comparable to a volta tensor core, given the same feature name.

.

Historically, NVIDIA has been secretive with regard to details of their GPUs’ microarchitecture. I see nothing that would incentivize them to be more transparent at this time.

In practical terms, it would be best to file a performance bug, as it is possible that the software simply has not been sufficiently optimized for the new architecture. Experience indicates that NVIDIA operates the compute business driven by customer demand. So the more bugs are filed for a particular performance issue, the more likely a fix will materialize.

NVIDIA’s business is selling hardware; providing lots of performance software is just a means to that end. If new expensive parts lack application level performance, it will be in NVIDIA’s best interest to address the underlying issues so hardware sales remain brisk.

For what it’s worth, at least one review has made similar observations:

At reference specifications, peak theoretical tensor throughput is around 107.6 TFLOPS for the RTX 2080 Ti, 80.5 TFLOPS for the RTX 2080, and 59.7 TFLOPS for the RTX 2070. Unlike the 89% efficiency with the Titan V’s 97.5 TFLOPS, the RTX cards are essentially at half that level, with around 47%, 48%, and 45% efficiency for the RTX 2080 Ti, 2080, and 2070 respectively. A Turing-optimized binary should bring that up, though it is possible that the GeForce RTX cards may not be designed for efficient tensor FP16 operations as opposed to the INT dot-product acceleration. After all, the GeForce RTX cards are for consumers and ostensibly intended for inferencing rather than training, which is the reasoning for the new INT support in Turing tensor cores.

you can benchmark CUDA 10 cublasTensorCore examples, and 2080ti is half the Titan V.

so Nvidia has really screwed up.

they should be called TenCores not TensorCores because they are half missing.

@LukeCuda, you missed the most obvious pun. Call them “Sores” for what they are.

As far as I know the AnandTech benchmarks have so far been made with code built for sm_70 (Volta). Can anyone confirm the same bad performance with sm_75 optimized code?

The throughput of Turing’s tensor cores is described much more explicitly than most other GPU architecture throughput details usually are.

Table 4 on page 59 of the Turing GPU whitepaper specifies the tensor core peak FP16 throughput of the RTX 2080 as 80.5 TFLOPS with FP16 accumulate or 40.2 TFLOPS with FP32 accumulate.
If I remember correctly, Volta tensor core performance numbers were always given for accumulation in FP32. So it seems like you want to look out for benchmarks run with sm_75 code.

NVIDIA has advertised FP16 113.8Tflops which is comparable to Titan V. Thats all that matters to make 2080Ti as fast as a Titan V when doing HGEMM.

Above, Tera said the Turing Tesla T4 is advertised at 60Tflops. Which would match what a 2080Ti is currently doing. But Nvidia would have advertised 60TFlops, not 113.8Tflops

could the cuda drivers be miss identifying 2080ti as a T4?

or did Nvidia do a swifty and base 2080ti off an inference chip, and not tell anyone?!!

Apologies I edited my post after you cited it, as I noted the more relevant RTX 2080 specs in the whitepaper.
And now I note the even more relevant RTX 2080 Ti specs in table 1 on page 9: 107.6 TFLOPS with FP16 accumulate and 53.8 TFLOPS accumulating in FP32. So you really want benchmarks for sm_75.

I don’t see how idle speculation provides benefits to anyone (other than helping pass the time for retired folks like me :-)

From long experience I can say that marketing people will latch on to the highest number they see. That is usually some theoretical throughput number, or some “up to” peak performance number, neither of which are sustained in real-life scenarios. That doesn’t mean these numbers are wrong, just not useful for practical decision making. Decisions are best based on benchmarking one’s actual use case(s).

I repeat: The best course of action for perceived CUDA-related performance shortfalls is to notify NVIDIA in the form of bug reports (after performing due diligence), accompanied by sufficient amounts of supporting data. This course of action does not guarantee positive change, but it gives the best odds of such change.

Huffing and puffing and jumping up and down in forums (these or others) is unlikely to have any effect.

There is some discussion that a 2080Ti RTX Tensor Core is not the same as a Quadro RTX Tensor Core, and that is why 2080Ti is not performing as advertised in CUDA.

Anyone have information in this regard?

@LukeCuda: I doubt this as I think the Quadro and RTX line are based off the same die.

(EDIT: notable exceptions being the most expensive Quadro GP100/GV100 models with HBM2 memory which are using the P100 and V100 chips)

I was trying very hard to find out why 2080Ti tensor cores were half as fast as Titan V tensor cores.

The reason is that they can only do FP32 accumulate at half speed. Titan V tensors and infact Quadro RTX tensors(!!) do full speed.

So they did gimp the tensor cores for the consumer models of RTX.

If you check the reference in posts #9 / #11 above, you will find this is documented behaviour.

yes you are absolutely correct. i did not see that earlier. i think this is the end of the mystery. it is documented so i shouldn’t be too hard on nvidia.