tensor core sample

I am just wondering if there is a sample for use of tensor cores with xavier, as the below ddoesn’t seem to support aarch64.
https://github.com/NVIDIA/cuda-samples/tree/master/Samples/cudaTensorCoreGemm
Thanks

Hi Andrey, with a couple minor modifications to the Makefile, this WMMA sample builds and runs on Jetson AGX Xavier.

Comment out lines 250-253 of Makefile:

#ifeq ($(TARGET_ARCH),aarch64)
#  $(info >>> WARNING - cudaTensorCoreGemm is not supported on aarch64 - waiving sample <<<)
#  SAMPLE_ENABLED := 0
#endif

Change line 267 of Makefile to include support for compute_72 / sm_72:

# Gencode arguments
SMS ?= 70 72 75

Then build and run it:

$ make
/usr/local/cuda/bin/nvcc -ccbin g++ -I../../Common  -m64    -maxrregcount=255 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_72,code=sm_72 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_75,code=compute_75 -o cudaTensorCoreGemm.o -c cudaTensorCoreGemm.cu
/usr/local/cuda/bin/nvcc -ccbin g++   -m64      -gencode arch=compute_70,code=sm_70 -gencode arch=compute_72,code=sm_72 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_75,code=compute_75 -o cudaTensorCoreGemm cudaTensorCoreGemm.o 
mkdir -p ../../bin/aarch64/linux/release
cp cudaTensorCoreGemm ../../bin/aarch64/linux/release
$ ./cudaTensorCoreGemm 
Initializing...
GPU Device 0: "Xavier" with compute capability 7.2

M: 4096 (16 x 256)
N: 4096 (16 x 256)
K: 4096 (16 x 256)
Preparing data for GPU...
Required shared memory size: 64 Kb
Computing... using high performance kernel compute_gemm 
Time: 54.112225 ms
TFLOPS: 2.54

Dustin, thank you.
With sequentially executing [specifically in that order, not otherwise]

sudo nvpmodel -m 0
sudo ./jetson_clocks.sh

the performance noticeable increases

./cudaTensorCoreGemm 
Initializing...
GPU Device 0: "Xavier" with compute capability 7.2

M: 4096 (16 x 256)
N: 4096 (16 x 256)
K: 4096 (16 x 256)
Preparing data for GPU...
Required shared memory size: 64 Kb
Computing... using high performance kernel compute_gemm 
Time: 37.658943 ms
TFLOPS: 3.65