I am just wondering if there is a sample for use of tensor cores with xavier, as the below ddoesn’t seem to support aarch64.
https://github.com/NVIDIA/cuda-samples/tree/master/Samples/cudaTensorCoreGemm
Thanks
Hi Andrey, with a couple minor modifications to the Makefile, this WMMA sample builds and runs on Jetson AGX Xavier.
Comment out lines 250-253 of Makefile:
#ifeq ($(TARGET_ARCH),aarch64)
# $(info >>> WARNING - cudaTensorCoreGemm is not supported on aarch64 - waiving sample <<<)
# SAMPLE_ENABLED := 0
#endif
Change line 267 of Makefile to include support for compute_72 / sm_72:
# Gencode arguments
SMS ?= 70 72 75
Then build and run it:
$ make
/usr/local/cuda/bin/nvcc -ccbin g++ -I../../Common -m64 -maxrregcount=255 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_72,code=sm_72 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_75,code=compute_75 -o cudaTensorCoreGemm.o -c cudaTensorCoreGemm.cu
/usr/local/cuda/bin/nvcc -ccbin g++ -m64 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_72,code=sm_72 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_75,code=compute_75 -o cudaTensorCoreGemm cudaTensorCoreGemm.o
mkdir -p ../../bin/aarch64/linux/release
cp cudaTensorCoreGemm ../../bin/aarch64/linux/release
$ ./cudaTensorCoreGemm
Initializing...
GPU Device 0: "Xavier" with compute capability 7.2
M: 4096 (16 x 256)
N: 4096 (16 x 256)
K: 4096 (16 x 256)
Preparing data for GPU...
Required shared memory size: 64 Kb
Computing... using high performance kernel compute_gemm
Time: 54.112225 ms
TFLOPS: 2.54
Dustin, thank you.
With sequentially executing [specifically in that order, not otherwise]
sudo nvpmodel -m 0
sudo ./jetson_clocks.sh
the performance noticeable increases
./cudaTensorCoreGemm
Initializing...
GPU Device 0: "Xavier" with compute capability 7.2
M: 4096 (16 x 256)
N: 4096 (16 x 256)
K: 4096 (16 x 256)
Preparing data for GPU...
Required shared memory size: 64 Kb
Computing... using high performance kernel compute_gemm
Time: 37.658943 ms
TFLOPS: 3.65