Perfomance question for Tesla V100

I was intrigued by this from the Tesla V100 product page https://www.nvidia.com/en-us/data-center/tesla-v100/

Under performance it says that the V100 achieves 7.5 TeraFLOPS for double precision, 15 TeraFLOPS for single precision, and 120 TeraFLOPS (!!!) for deep learning.

How exactly would one achieve the 8x increase when doing deep learning? Would this only be when using specific packages (e.g. cuDNN)? Can you get this with TensorFlow? Would it be possible to achieve this with a hand-written DNN algorithm in CUDA? Why is it only with deep learning that one can achieve 120 tFPS?

The 120 TFLOPS is achieved using a special 4x4x4 matrix multiplication instruction. At GTC, there were some presentations by NVIDIA’s Olivier Giroux, Luke Durant and Mark Harris that gave more details on this. Short story is that this will be exposed in CUDA C++ for anyone to use and it won’t be confined to CUDNN.

It’s based on the use of TensorCore, which is a new computation engine in the Volta V100 GPU.

The TensorCore is not a general purpose arithmetic unit like an FP ALU, but performs a specific 4x4 matrix operation with hybrid data types. If your algorithm (whatever it may be) can take advantage of that, then you may witness a perf improvement. It has to be coded for, and this operation does not trivially map into a C or C++ operator (like multiply) so the exposure will probably primarily be through libraries, and the library in question for deep learning would be CuDNN.

It’s likely that future versions of cuDNN will use the TensorCores on V100 (when V100 becomes available, in the future) and to the extent that these then become “available” to operations from e.g. TensorFlow that use the GPU, it should be possible (theoretically) to achieve a speed up for certain operations in Tensorflow.

You should in the future be able to use the TensorCore conceptually similarly to the way novel compute modes like INT8 and FP16 are currently exposed via CuDNN. You will have to specify the right settings, and format your data correctly, and after that it should “just work” for a particular cuDNN library call.

Using it as a standalone operation in pure CUDA C/C++ should theoretically be possible, but it remains to be seen exactly how it will be exposed (if at all) in future versions of CUDA.

I don’t think it is accurate to say that the 120 TFLOPS can only be achieved with deep learning. The high throughput is a function of specialized operation(s) in conjunction with reduced precision. If you can find a good use for those specialized operations in use cases other than deep learning, you can get that performance for those other applications as well.

Caveats:

(1) The specialized operation(s) may not be exposed in a way that is conducive to use by other than Ninja programmers (e.g. may require use of inline assembly).

(2) Independent of precision and available operations, in many cases compiled code cannot achieve more than 75% to 85% of theoretical peak throughput, due to various other limiting factors such as register bank usage or decoder and scheduler limitations.

Is there a link to this presentation?

Content will be available to everyone after June 8, 2017, according to:
http://on-demand-gtc.gputechconf.com/gtcnew/on-demand-gtc.php

If you happen to know someone who registered for GTC 2017, you can get the content earlier by following the link to the GTC registration portal linked from above.

Fair enough :)

Meanwhile (and this is a more theoretical question) what exactly is it about deep learning that makes 4x4x4 matrix multiplies a calculation of interest?

Deep Learning usually employs layered artificial neural nets.

Each neuron in each layer has a set of connection weights (multiplicative factors) which are used to compute the neuron output, based on the outputs from the previous layer.

F(yi) = w1x1+w2x2+w3x3…

Therefore, each neuron in layer y has a corresponding w vector of weights, which are used to multiply the outputs of the previous layer (x vector).

Taken together, layer y computation is a matrix-vector multiply. With some additional magic using a method such as batching or convolutional neural nets, we can convert this matrix-vector multiply into a matrix-matrix multiply.

The TensorCore accelerates 4x4 chunks of these matrix-matrix multiply operations that are used in DNNs.

For neural networks, it may be sufficient to express the weight vector (matrix) as FP16 quantities, and likewise to express individual neuron outputs as FP16 quantities, but computing the matrix-matrix product may work better if the accumulation operation is done against an FP32 reduction variable.

This is a hand-waving description of the motivation for this type of operation with hybrid (mixed FP16/FP32) data.

This description tends to apply more to the training operation, which is not exactly what I described above, but similar, and will also use matrix-matrix multiply to adjust weights. For inference operations, it may be also interesting to use even further reduced precision datatypes such as INT8.

Gotcha. So could the new tensor module be used in any large-matrix multiplication of FP16/INT8 values?

Sounds like you want a full functional spec and all the details now. I don’t think all that information has been disclosed yet.

I think what I’ve heard so far does not map into INT8 at all, and does not map directly into an ordinary FP16 matrix-matrix multiply, because there the output datatype would be FP16.

If you had some imaginary BLAS operation that did a FP16xFP16 matrix-matrix multiply that produced a FP32 result matrix, you could probably use this feature to good effect there. You might want to look at some of the existing exposed capability in the SgemmEx function in CUBLAS.

So just to clarify, (at least some) existing cuBLAS functions should be able to take advantage of the new tensor module?

To me SgemmEx is a grab-bag of various gemm-like functions that don’t easily map to standard BLAS functionality. I could imagine another item being added to that grab-bag. Beyond that, I don’t want to speculate what the final library implementations may look like. I have no doubt that cuDNN will take advantage of it somehow, in the future. It seems likely to me also that it would be exposed, somehow, via CUBLAS, but I am not certain of that and this is all basically just speculation anyway. Specific implementation announcements have not been made yet AFAIK.

(later:)
I can probably safely be slightly more definitive that there should be both C++, as well as CUDNN and CUBLAS “exposures” of the tensorcore functionality. From the CUDA 9 blog:

“During program execution, multiple Tensor Cores are used concurrently by a full warp of execution. The threads within a warp provide a larger 16x16x16 matrix operation to be processed by the Tensor Cores. CUDA 9 includes a CUDA C++ API for warp-level matrix-multiply and accumulate as a preview feature. These C++ interfaces provide specialized matrix load, matrix multiply and accumulate, and matrix store operations to efficiently utilize Tensor Cores in CUDA C++ programs.

In addition to CUDA C++ interfaces to program Tensor Cores directly, CUDA 9 cuBLAS and cuDNN libraries include new library interfaces to make use of Tensor Cores for deep learning applications and frameworks. NVIDIA has worked with many popular deep learning frameworks such as Caffe2 and MXNet to enable the use of Tensor Cores for deep learning research on Volta GPU based systems. NVIDIA continues to work with other framework developers to enable broad access to Tensor Cores for the entire deep learning ecosystem.”