Hi Community member,
Please let me confirm the following question.
Would you please teach me the way to confirm whether the Tensor core is working or not.
Best regards.
Kaka
Hi Community member,
Please let me confirm the following question.
Would you please teach me the way to confirm whether the Tensor core is working or not.
Best regards.
Kaka
Hi Kaka, please refer to this topic: [url]https://devtalk.nvidia.com/default/topic/1047176/jetson-agx-xavier/tenson-core-sample[/url]
Hi Dusty,
Thank you for your response.
But I could not find the answer for my question.
Again, how should we confirm whether the TensorCore is working or not?
Do you have any tools/status to check it?
Best regards.
Kaka
Hi Kaka, that sample runs using the Tensor Core HMMA operations, it wouldn’t run if Tensor Cores weren’t working properly for some reason (not that it should occur).
Hi Dusty,
Thank you for your support. I understood that if used the this sample code, the TensorCore will work.
But I would like to confirm whether the TensorCore is working or not in the case of making the original code and
Do you know the way to confirm it?
Best regards.
Kaka
Thanks Kaka, I understand now. You can use the nvprof CUDA profiler tool to capture the Tensor Core usage while your application runs. nvprof supports two metrics for Tensor Core utilization:
Here is an example output of running it on the HMMA cudaTensorCoreGemm sample:
$ sudo /usr/local/cuda/bin/nvprof --kernels compute_gemm --metrics tensor_precision_fu_utilization,tensor_int_fu_utilization ./cudaTensorCoreGemm
Initializing...
==24384== NVPROF is profiling process 24384, command: ./cudaTensorCoreGemm
GPU Device 0: "Xavier" with compute capability 7.2
M: 4096 (16 x 256)
N: 4096 (16 x 256)
K: 4096 (16 x 256)
Preparing data for GPU...
Required shared memory size: 64 Kb
Computing... using high performance kernel compute_gemm
==24384== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
Replaying kernel "compute_gemm(__half const *, __half const *, float const *, float*, float, float)" (done)
Time: 1086.695679 msvents
TFLOPS: 0.13
==24384== Profiling application: ./cudaTensorCoreGemm
==24384== Profiling result:
==24384== Metric result:
Invocations Metric Name Metric Description Min Max Avg
Device "Xavier (0)"
Kernel: compute_gemm(__half const *, __half const *, float const *, float*, float, float)
1 tensor_precision_fu_utilization Tensor-Precision Function Unit Utilization Mid (5) Mid (5) Mid (5)
1 tensor_int_fu_utilization Tensor-Int Function Unit Utilization Idle (0) Idle (0) Idle (0)
Note that in this example, tensor_int_fu_utilization metric is shown as idle, because the sample uses HMMA FP16 operations and not IMMA INT8.
Hi Dusty,
Thank you for your response and information. I got it!
Also we would like to confirm the TensorCore performance. Do you know any sample codes which we can set as enabling or disabling in order to compare it?
Best regards.
Kaka
If you are using TensorRT, TensorRT will automatically enable Tensor Cores, so they can’t be disabled other than by not using FP16 / FP32.
If you are using cuDNN directly, you can choose not to have layers executed on the Tensor Cores by not specifying CUDNN_TENSOR_OP_MATH: [url]http://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#tensor_ops[/url]
If you are using TensorFlow, it looks like there is a setting to disable Tensor Core usage in cuDNN: [url]https://docs.nvidia.com/deeplearning/dgx/tensorflow-user-guide/index.html#tf_disable_tensor_op_math[/url]
For cuBLAS, Tensor Cores are used through cublasGemmEx(), so use normal cublasGemm() function if you don’t want to use the Tensor Cores.