Unexpected low fp16 performance on P3
I'm getting highly non-uniform performance for float16 matmul on P3 using recommended NVidia container. I was told by Tom Reed at GTC that this is not expected so maybe someone could redirect this to proper channel: To reproduce, run the following on Volta machine. [code]wget https://raw.githubusercontent.com/yaroslavvb/stuff/master/matmul_benchmark_seq.py export TF_CPP_MIN_LOG_LEVEL=1 python matmul_benchmark_seq.py --dtype=float16 [/code] You'll see something like this. [code]7512,76.0847702634 8192,87.2323633474 8933,15.2443599021 9741,15.0255254543 [/code] This means that it got 87 Tops/second for 8192x8192 matmul, followed by 15 T ops/second for 8933. For more graphs, see https://medium.com/@yaroslavvb/peak-performance-of-amazon-p3-instances-f2bc48f9ef71 For more details, I used Amazon Ubuntu CUDA 9 AMI -- https://aws.amazon.com/marketplace/pp/B076TGJHY1?qid=1509675887754&sr=0-4&ref_=srh_res_product_title Then used AWS instructions to optimize for GPUs http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/optimize_gpu.html Then used nvidia docker with official TensorFlow container. [code]sudo nvidia-persistenced sudo nvidia-smi -ac 877,1530 # p3 sudo apt-get install apt-transport-https ca-certificates curl software-properties-common curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add - sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu xenial stable" sudo apt-get update apt-cache search docker-ce sudo apt-get install -y docker-ce wget https://github.com/NVIDIA/nvidia-docker/releases/download/v1.0.1/nvidia-docker_1.0.1-1_amd64.deb sudo dpkg -i nvidia-docker_1.0.1-1_amd64.deb sudo docker login nvcr.io sudo docker pull nvcr.io/nvidia/tensorflow:17.10 sudo nvidia-docker run -it --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 --rm -v /home/ubuntu/docker:/data/mnist nvcr.io/nvidia/tensorflow:17.10 wget https://raw.githubusercontent.com/yaroslavvb/stuff/master/matmul_benchmark_seq.py export TF_CPP_MIN_LOG_LEVEL=1 export CUDA_VISIBLE_DEVICES=0 python matmul_benchmark_seq.py --dtype=float16 [/code]
I'm getting highly non-uniform performance for float16 matmul on P3 using recommended NVidia container. I was told by Tom Reed at GTC that this is not expected so maybe someone could redirect this to proper channel:

To reproduce, run the following on Volta machine.

wget https://raw.githubusercontent.com/yaroslavvb/stuff/master/matmul_benchmark_seq.py
export TF_CPP_MIN_LOG_LEVEL=1
python matmul_benchmark_seq.py --dtype=float16


You'll see something like this.

7512,76.0847702634
8192,87.2323633474
8933,15.2443599021
9741,15.0255254543

This means that it got 87 Tops/second for 8192x8192 matmul, followed by 15 T ops/second for 8933.

For more graphs, see https://medium.com/@yaroslavvb/peak-performance-of-amazon-p3-instances-f2bc48f9ef71


For more details, I used Amazon Ubuntu CUDA 9 AMI -- https://aws.amazon.com/marketplace/pp/B076TGJHY1?qid=1509675887754&sr=0-4&ref_=srh_res_product_title

Then used AWS instructions to optimize for GPUs
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/optimize_gpu.html

Then used nvidia docker with official TensorFlow container.

sudo nvidia-persistenced
sudo nvidia-smi -ac 877,1530 # p3

sudo apt-get install apt-transport-https ca-certificates curl software-properties-common
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu xenial stable"
sudo apt-get update
apt-cache search docker-ce
sudo apt-get install -y docker-ce
wget https://github.com/NVIDIA/nvidia-docker/releases/download/v1.0.1/nvidia-docker_1.0.1-1_amd64.deb
sudo dpkg -i nvidia-docker_1.0.1-1_amd64.deb

sudo docker login nvcr.io
sudo docker pull nvcr.io/nvidia/tensorflow:17.10

sudo nvidia-docker run -it --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 --rm -v /home/ubuntu/docker:/data/mnist nvcr.io/nvidia/tensorflow:17.10

wget https://raw.githubusercontent.com/yaroslavvb/stuff/master/matmul_benchmark_seq.py
export TF_CPP_MIN_LOG_LEVEL=1
export CUDA_VISIBLE_DEVICES=0
python matmul_benchmark_seq.py --dtype=float16
Attachments

img.png

#1
Posted 11/03/2017 02:30 AM   
Our engineering team states that all of k, lda, ldb, and ldc must be a multiple of eight; m must be a multiple of four. The Tensor Core math routines stride through input data in steps of eight values, so the dimensions of the matrices must be multiples of eight. For more details see the post at https://devblogs.nvidia.com/parallelforall/programming-tensor-cores-cuda-9/
Our engineering team states that all of k, lda, ldb, and ldc must be a multiple of eight; m must be a multiple of four. The Tensor Core math routines stride through input data in steps of eight values, so the dimensions of the matrices must be multiples of eight.

For more details see the post at https://devblogs.nvidia.com/parallelforall/programming-tensor-cores-cuda-9/

#2
Posted 11/07/2017 11:05 PM   
Testing on my Titan V I see spikes too. I assume that spikes only become visible starting with 512x512 matrices because for smaller matmuls too much time is spent copying data: [code]430,1.0276874923 469,1.2882271302 512,2.2436777223 558,1.8147125514 608,3.6640702490 663,2.8910543966 724,3.6435559331 789,3.5648453471 861,4.4266646591 939,5.2201968200 1024,14.8427175163 1116,6.2453918178[/code] I am still trying to get significant performance increase in more realistic DL-tasks with V100 (Titan V). So far, even with very matmul/conv-heavy architectures (Transformer) I only see 25% performance increase when switching to FP16 - nothing like these spikes in this sythentic test. Also, I don't see "doubling" of available memory: I can only increase batch-size ~10% when switching all my variables to FP16 from FP32 before hitting out-of-memory. I guess, my Tensorflow implementation has FP32's somewhere, and lots of it.
Testing on my Titan V I see spikes too.

I assume that spikes only become visible starting with 512x512 matrices because for smaller matmuls too much time is spent copying data:

430,1.0276874923
469,1.2882271302
512,2.2436777223
558,1.8147125514
608,3.6640702490
663,2.8910543966
724,3.6435559331
789,3.5648453471
861,4.4266646591
939,5.2201968200
1024,14.8427175163
1116,6.2453918178


I am still trying to get significant performance increase in more realistic DL-tasks with V100 (Titan V). So far, even with very matmul/conv-heavy architectures (Transformer) I only see 25% performance increase when switching to FP16 - nothing like these spikes in this sythentic test.

Also, I don't see "doubling" of available memory: I can only increase batch-size ~10% when switching all my variables to FP16 from FP32 before hitting out-of-memory. I guess, my Tensorflow implementation has FP32's somewhere, and lots of it.

#3
Posted 12/28/2017 03:48 AM   
From above observations let's assume that for 8192x8192 matmul V100 becomes compute-bound: [list] [.] it needs to transfer ~256MiB plus moving data within GPU[/.] [.] it performs ~2Tflops[/.] [/list] For 1x1 2d convolution with N input-channels, and N output-channels: [list] [.] we need ~floor(sqrt(134217728/N))-sized matrix to apply conv to, to get the same amount of data[/.] [.] it will perform (N**2 + (N-1)*N)*(134217728/N)~268435456N flops~0.27N Gflops (we need N~10000 to get the same amount of computation as 8192x8192 matmul[/.] [/list] In practice I am getting the error when trying to create matrix that big. Had to scale the size down by x100. Error looks like: [code]tensorflow.python.framework.errors_impl.InternalError: Blas SGEMM launch failed : m=134212225, n=1, k=1 [[Node: Conv2D = Conv2D[T=DT_HALF, data_format="NHWC", padding="VALID", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Variable/read, Variable_1/read)]][/code] Testing (expanding Yaroslav's code: [url]https://gist.github.com/dimitry12/d8eb165eb9ecd474d6a017156bec3466#file-conv-py-L76-L79[/url]): [code]38,0.6277977837 41,0.7149783805 45,0.8152390789 49,0.9075997542 53,1.0192397357 58,1.0858789485 64,1.5005017117 69,1.3305306010 76,1.4670993722 82,1.4340319776 90,1.5266188081 98,1.6651400846 107,1.9491080286 117,2.1242829384 128,2.9893740279 139,1.9558330697 152,2.9484651222 165,2.3725707415 181,2.5822941898 197,2.9695160932 215,3.1080774282 234,3.2204323179 256,5.3002793901 279,3.8160422806 304,4.7311686984 331,3.7756175425 362,4.2303946644 394,3.9343579247 430,3.8086561374 469,5.0551012201 512,6.7722605583 558,4.7593597729 608,7.7169478252 663,4.5043897079 724,4.3966474121 789,4.4033521839 861,4.3245146979 939,3.8774351501 1024,4.4114756494[/code] Spikes are present, but are less exciting and overall flops are much-much lower (my math is certainly wrong somewhere). At least it confirms for me that nvcr.io/nvidia/tensorflow:17.12's conv2d does use Tensor Cores. Interestingly, performance degrades as number of channels increases (and HxW correspondingly decreases). Really, with 1x1 kernel conv2d is not even matrix-matrix multiply, but vector-matrix multiply - I am suprised spikes (as a tell-tale of Tensor Cores) even show-up.
Answer Accepted by Forum Admin
From above observations let's assume that for 8192x8192 matmul V100 becomes compute-bound:

  • it needs to transfer ~256MiB plus moving data within GPU
  • it performs ~2Tflops


For 1x1 2d convolution with N input-channels, and N output-channels:
  • we need ~floor(sqrt(134217728/N))-sized matrix to apply conv to, to get the same amount of data
  • it will perform (N**2 + (N-1)*N)*(134217728/N)~268435456N flops~0.27N Gflops (we need N~10000 to get the same amount of computation as 8192x8192 matmul


In practice I am getting the error when trying to create matrix that big. Had to scale the size down by x100. Error looks like:
tensorflow.python.framework.errors_impl.InternalError: Blas SGEMM launch failed : m=134212225, n=1, k=1
[[Node: Conv2D = Conv2D[T=DT_HALF, data_format="NHWC", padding="VALID", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Variable/read, Variable_1/read)]]



Testing (expanding Yaroslav's code: https://gist.github.com/dimitry12/d8eb165eb9ecd474d6a017156bec3466#file-conv-py-L76-L79):
38,0.6277977837
41,0.7149783805
45,0.8152390789
49,0.9075997542
53,1.0192397357
58,1.0858789485
64,1.5005017117
69,1.3305306010
76,1.4670993722
82,1.4340319776
90,1.5266188081
98,1.6651400846
107,1.9491080286
117,2.1242829384
128,2.9893740279
139,1.9558330697
152,2.9484651222
165,2.3725707415
181,2.5822941898
197,2.9695160932
215,3.1080774282
234,3.2204323179
256,5.3002793901
279,3.8160422806
304,4.7311686984
331,3.7756175425
362,4.2303946644
394,3.9343579247
430,3.8086561374
469,5.0551012201
512,6.7722605583
558,4.7593597729
608,7.7169478252
663,4.5043897079
724,4.3966474121
789,4.4033521839
861,4.3245146979
939,3.8774351501
1024,4.4114756494



Spikes are present, but are less exciting and overall flops are much-much lower (my math is certainly wrong somewhere). At least it confirms for me that nvcr.io/nvidia/tensorflow:17.12's conv2d does use Tensor Cores.

Interestingly, performance degrades as number of channels increases (and HxW correspondingly decreases).

Really, with 1x1 kernel conv2d is not even matrix-matrix multiply, but vector-matrix multiply - I am suprised spikes (as a tell-tale of Tensor Cores) even show-up.

#4
Posted 12/28/2017 04:50 AM   
Scroll To Top

Add Reply