Tensorflow batch_to_space_nd() not working for large channel sizes on TX2

Hi,

I am moving a GraphDef that includes a model trained on desktop/server class GPUs to work on the Jetson TX2. The model output on the TX2 is very bad, so I started tracing through layers until I found the first different output from a GTX 1080 was after one of my convolution layers. I drilled down to find that the output of the BatchToSpaceND operation is not working correctly. There is no zero-padding on dimensions that I expect to be zero-padded and none of the input tensor values seem to be preserved by the reshape.

Upon searching I found https://devtalk.nvidia.com/default/topic/1036144/jetson-tx2/tensorflow-operation-tf-batch_to_space_nd-function-not-working-as-expected-on-jetson-tx2/ and ran a similar test. Rather than random data I insert ones so that I know when corruption is occuring. Here is the result on a run with GPU:

In [3]: import os
   ...: os.environ['CUDA_VISIBLE_DEVICES'] = '0'
   ...: import tensorflow as tf
   ...: import numpy as np
   ...: mat=np.ones((1,65,65, 543))
   ...: in1=tf.constant(mat,tf.float32)
   ...: block_shape=tf.constant([2,2],tf.int32)
   ...: paddings=tf.constant([[2,3],[2,3]],tf.int32)
   ...: op=tf.space_to_batch_nd(in1,block_shape,paddings)
   ...: print(in1)
   ...: print(op)
   ...: with tf.Session() as sess:
   ...:     out=sess.run(op)
   ...:     print('sum of elements in out:',np.sum(out))
   ...:
Tensor("Const_3:0", shape=(1, 65, 65, 543), dtype=float32)
Tensor("SpaceToBatchND_1:0", shape=(4, 35, 35, 543), dtype=float32)
2018-07-26 21:39:28.232793: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-07-26 21:39:28.232885: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-07-26 21:39:28.232915: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929]      0
2018-07-26 21:39:28.232937: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0:   N
2018-07-26 21:39:28.233029: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 973 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
('sum of elements in out:', 0.0)

As you can see the sum is 0, which is incorrect. If I restart python and hide the GPU then I get the correct result.

In [1]: import os
   ...: os.environ['CUDA_VISIBLE_DEVICES'] = ''
   ...: import tensorflow as tf
   ...: import numpy as np
   ...: mat=np.ones((1,65,65, 543))
   ...: in1=tf.constant(mat,tf.float32)
   ...: block_shape=tf.constant([2,2],tf.int32)
   ...: paddings=tf.constant([[2,3],[2,3]],tf.int32)
   ...: op=tf.space_to_batch_nd(in1,block_shape,paddings)
   ...: print(in1)
   ...: print(op)
   ...: with tf.Session() as sess:
   ...:     out=sess.run(op)
   ...:     print('sum of elements in out:',np.sum(out))
   ...:
Tensor("Const:0", shape=(1, 65, 65, 543), dtype=float32)
Tensor("SpaceToBatchND:0", shape=(4, 35, 35, 543), dtype=float32)
2018-07-26 21:38:02.678283: E tensorflow/stream_executor/cuda/cuda_driver.cc:406] failed call to cuInit: CUDA_ERROR_NO_DEVICE
2018-07-26 21:38:02.678361: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:145] kernel driver does not appear to be running on this host (tegra-ubuntu): /proc/driver/nvidia/version does not exist
('sum of elements in out:', 2294175.0)

I have noticed that 543 seems to be the tipping point. Above 543 I always get corruption. Below 543 it seems ok. If I use > 543, but swap the dimension ordering then I also seem to get sane results. On a first run I will get a sum of zeros, but if I’ve been doing a lot of computation (or for example, I load my protobuf file first) then I get seemingly random numbers.

I was using Pete Lee’s tensorflow originally, but just reproduced the same results with the NVIDIA released tensorflow 1.8 from https://devtalk.nvidia.com/default/topic/1031300/jetson-tx2/tensorflow-1-9-rc-wheel-with-jetpack-3-2-/.

Hi,

This issue can be reproduced in our side also. We are working on it.

From the TensorFlow log:
2018-07-27 09:37:35.185330: E tensorflow/core/common_runtime/direct_session.cc:154] Internal: Failed to get memory allocator for TF GPU 0 with 5587259392 bytes of memory.

Do you need such big memory for your application?
Or just try it for debugging?

Thanks.

Hi,

That is just the default tensorflow behavior. See https://www.tensorflow.org/guide/using_gpu#allowing_gpu_memory_growth for details/reasons. This particular example doesn’t actually need near that much memory. Also, I mostly took the dimensions from the previous linked issue. My graph displays similar behavior, but has different dimensions.

I’m not intimately familiar with how cuda/tf maps memory-- if it’s zeroed out upon allocation then that helps explain the behavior that I mentioned

On a first run I will get a sum of zeros, but if I’ve been doing a lot of computation (or for example, I load my protobuf file first) then I get seemingly random numbers.

If it needs continues memory region, I believe (AastaLLL please correct me if it’s wrong) that around 4GB is the max amount of continues memory that CUDA can allocate on Jetson at the moment. That’s quite ironical since TX2 was released awhile ago and already features 8GB of RAM. Upcoming Xavier will have 16GB. I hope NVidia will release a fix for it before Xavier is out.

-albertr

Hi, nwest

Sorry for the stupid question. I forgot TensorFlow by default allocates all the available memory.
Checking this usecase with cuda-memcheck, I got the error comes from cudaErrorLaunchOutOfResources:

GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
========= Program hit cudaErrorLaunchOutOfResources (error 7) due to "too many resources requested for launch" on CUDA API call to cudaLaunch. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1 [0x2e6928]
=========     Host Frame:/usr/local/cuda-9.0/lib64/libcudart.so.9.0 (cudaLaunch + 0x128) [0x414cc]
=========
('sum of elements in out:', 0.0)
========= ERROR SUMMARY: 1 error

You can find more explanation in our document.
Three possiblity for a resource error are:

  • Too large a block or grid size.
  • Too many registers
  • Too much shared memory

As a result, this error may comes from incorrect shared memory size or incorrect threads number.
We are still discussing this internally. Will update information with you later.

Thanks.

Hi,

We found this is a regression issue and won’t happen with CUDA 8.0.
Please use CUDA 8.0 as a temporal workaround.

Ex,
Jetpack3.1/Cuda8/Cudnn6/TF1.3
Jetpack3.1/Cuda8/Cudnn6/TF1.5

We are working on fixing this issue with CUDA 9.0. Will update more information with you later.
Thanks.

have jetpack 4.2 fixed this issue?

Hi,

Sorry that the answer is no.

This issue is fixed in the Xavier platform.
The workaroud for TX2 is still ongoing but in the lower priority.

Thanks.

With latest tensorflow (tensorflow_gpu-1.13.1+nv19.4-cp36-cp36m-linux_aarch64.whl) and R32.1 on Xavior, the issue still there:

failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected

2019-04-25 13:55:45.949648: W tensorflow/core/platform/profile_utils/cpu_utils.cc:98] Failed to find bogomips in /proc/cpuinfo; cannot determine CPU frequency
2019-04-25 13:55:45.951973: I tensorflow/compiler/xla/service/service.cc:161] XLA service 0x27e1e180 executing computations on platform Host. Devices:
2019-04-25 13:55:45.952520: I tensorflow/compiler/xla/service/service.cc:168] StreamExecutor device (0): ,
2019-04-25 13:55:46.000949: E tensorflow/stream_executor/cuda/cuda_driver.cc:300] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2019-04-25 13:55:46.001140: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:148] kernel driver does not appear to be running on this host (nvidia-xavier): /proc/driver/nvidia/version does not exist

Hi,

You meet a different issue. It looks like TensorFlow cannot recognize your device.

failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected

Please noticed that there are dependencies between JetPack and library.
You will need to install the TensorFlow package which is built with the same JetPack version as yours.

You can find our official release here:
https://devtalk.nvidia.com/default/topic/1038957/jetson-tx2/tensorflow-for-jetson-tx2-/

Thanks.

I am using jetpack4.3 on tx2 with cuda 10.0 and tensorflow 2.0 cudnn version is 7.6.3. below is the error I am facing. Any help would be appreciated.

2020-07-13 15:36:39.592939: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-07-13 15:36:39.597257: E tensorflow/stream_executor/cuda/cuda_driver.cc:318] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2020-07-13 15:36:39.597353: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (civilmapstx2-desktop): /proc/driver/nvidia/version does not exist

Hi harendra,

Please open a new topic for your issue. Thanks