Tensorflow 1.6 not working with Jetpack 3.2

vellamike · February 7, 2018, 1:47pm

I installed tensorflow 1.6 with Jetpack 3.2 as outlined here: https://gist.github.com/vellamike/7c26158c93e89ef155c1cc953bbba956

however when I try and run the following trivial example:

import tensorflow as tf
# Creates a graph.
print('a')
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
print('b')
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
print('c')
c = tf.matmul(a, b)
print('d')

# Creates a session with log_device_placement set to True.
print('e')
sess = tf.Session()#config=tf.ConfigProto(log_device_placement=True))
## Runs the op.
print('f')
print(sess.run(c))

I get the following error over and over again:

2016-05-06 05:43:48.865368: E tensorflow/stream_executor/cuda/cuda_driver.cc:967] failed to alloc 1048576 bytes on host: CUDA_ERROR_UNKNOWN
2016-05-06 05:43:48.865480: W ./tensorflow/core/common_runtime/gpu/pool_allocator.h:195] could not allocate pinned host memory of size: 1048576
2016-05-06 05:43:48.865508: E tensorflow/stream_executor/cuda/cuda_driver.cc:967] failed to alloc 943872 bytes on host: CUDA_ERROR_UNKNOWN
2016-05-06 05:43:48.865532: W ./tensorflow/core/common_runtime/gpu/pool_allocator.h:195] could not allocate pinned host memory of size: 943872
2016-05-06 05:43:48.865568: E tensorflow/stream_executor/cuda/cuda_driver.cc:967] failed to alloc 849664 bytes on host: CUDA_ERROR_UNKNOWN
2016-05-06 05:43:48.865619: W ./tensorflow/core/common_runtime/gpu/pool_allocator.h:195] could not allocate pinned host memory of size: 849664
2016-05-06 05:43:48.865644: E tensorflow/stream_executor/cuda/cuda_driver.cc:967] failed to alloc 764928 bytes on host: CUDA_ERROR_UNKNOWN
2016-05-06 05:43:48.865667: W ./tensorflow/core/common_runtime/gpu/pool_allocator.h:195] could not allocate pinned host memory of size: 764928

AastaLLL · February 8, 2018, 3:21am

Hi,

Thanks for your feedback.
We are checking this issue internally. Will update information to you later.

vellamike · February 8, 2018, 9:14am

Thank you @AsataLL - It will be interesting to see if you are able to reproduce the problem internally.

If you are able to get this TF1.6 to run with Jetpack 3.2 can you provide detailed instructions of how you did this? Also if possible would you be able to provide a Python wheel? I am using Python 3.

AastaLLL · February 9, 2018, 9:22am

Hi,

We can reproduce this error internally.
Guess that there is an authority issue since we can launch tf session successfully before rebooting.

Our script and pip wheel can be found here(python2 only):
https://github.com/AastaNV/JEP/tree/master/script/TensorFlow_1.6

Will update information with you later.

Thanks.

vellamike · February 9, 2018, 9:31am

@AastaLLL - thank you for your support. Please let me know when you have found a resolution to this issue.

Mike

AastaLLL · February 13, 2018, 9:28am

Hi,

We have tested TF-1.6rc1 on JetPack3.1 and it work correctly. (Previous is TF-1.6rc0)

Could you help to check TF-1.6rc1 on JetPack3.2 DP?
You can build it with this script:
https://github.com/AastaNV/JEP/tree/master/script/TensorFlow_1.6

Thanks.

Hallon · February 13, 2018, 4:05pm

Hello all

I built the 1.6rc0 release with JetPack 3.2 yesterday and also had problems like above when running a network.

Thanks for providing the scripts updated for the latest version JetPack!

I just rebuilt everything using the latest version on the 1.6rc1 branch and it works perfectly with JetPack 3.2 DP.

Hallon · February 13, 2018, 4:48pm

It seems I was a bit too quick to celebrate. Sometimes the same error presents itself. It’s intermittent and very difficult to pinpoint.

At first I got it after a reboot. Later after attempting to free up memory to fit a larger model.

I could run my tests, but the stability is abysmal.

vellamike · February 20, 2018, 9:23am

Dear @AstaLLL I am afraid that I agree with @Hallon - the script (tf1.6_install_wheel.sh) does not work correctly. When I first ran the script it worked. However, after reboot, when running even a very simple operation such as the following:

>>> import tensorflow as tf
>>> a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
>>> b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
>>> c = tf.matmul(a,b)
>>> sess = tf.Session()
2016-05-06 05:45:36.431471: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:859] ARM64 does not support NUMA - returning NUMA node zero
2016-05-06 05:45:36.431593: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1208] Found device 0 with properties: 
name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.3005
pciBusID: 0000:00:00.0
totalMemory: 7.67GiB freeMemory: 5.56GiB
2016-05-06 05:45:36.431662: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1308] Adding visible gpu devices: 0
2016-05-06 05:45:37.126531: I tensorflow/core/common_runtime/gpu/gpu_device.cc:989] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 5208 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
>>> sess.run(c)

I get errors which look like this:

2016-05-06 05:46:07.460227: W ./tensorflow/core/common_runtime/gpu/pool_allocator.h:195] could not allocate pinned host memory of size: 2304
2016-05-06 05:46:07.460244: E tensorflow/stream_executor/cuda/cuda_driver.cc:967] failed to alloc 2304 bytes on host: CUDA_ERROR_UNKNOWN
2016-05-06 05:46:07.460261: W ./tensorflow/core/common_runtime/gpu/pool_allocator.h:195] could not allocate pinned host memory of size: 2304
2016-05-06 05:46:07.460279: E tensorflow/stream_executor/cuda/cuda_driver.cc:967] failed to alloc 2304 bytes on host: CUDA_ERROR_UNKNOWN
2016-05-06 05:46:07.460295: W ./tensorflow/core/common_runtime/gpu/pool_allocator.h:195] could not allocate pinned host memory of size: 2304
2016-05-06 05:46:07.460313: E tensorflow/stream_executor/cuda/cuda_driver.cc:967] failed to alloc 2304 bytes on host: CUDA_ERROR_UNKNOWN
2016-05-06 05:46:07.460336: W ./tensorflow/core/common_runtime/gpu/pool_allocator.h:195] could not allocate pinned host memory of size: 2304
2016-05-06 05:46:07.460354: E tensorflow/stream_executor/cuda/cuda_driver.cc:967] failed to alloc 2304 bytes on host: CUDA_ERROR_UNKNOWN
2016-05-06 05:46:07.460371: W ./tensorflow/core/common_runtime/gpu/pool_allocator.h:195] could not allocate pinned host memory of size: 2304
2016-05-06 05:46:07.460388: E tensorflow/stream_executor/cuda/cuda_driver.cc:967] failed to alloc 2304 bytes on host: CUDA_ERROR_UNKNOWN
2016-05-06 05:46:07.460442: W ./tensorflow/core/common_runtime/gpu/pool_allocator.h:195] could not allocate pinned host memory of size: 2304
2016-05-06 05:46:07.460463: E tensorflow/stream_executor/cuda/cuda_driver.cc:967] failed to alloc 2304 bytes on host: CUDA_ERROR_UNKNOWN
2016-05-06 05:46:07.460480: W ./tensorflow/core/common_runtime/gpu/pool_allocator.h:195] could not allocate pinned host memory of size: 2304
2016-05-06 05:46:07.460497: E tensorflow/stream_executor/cuda/cuda_driver.cc:967] failed to alloc 2304 bytes on host: CUDA_ERROR_UNKNOWN
2016-05-06 05:46:07.460514: W ./tensorflow/core/common_runtime/gpu/pool_allocator.h:195] could not allocate pinned host memory of size: 2304
2016-05-06 05:46:07.460531: E tensorflow/stream_executor/cuda/cuda_driver.cc:967] failed to alloc 2304 bytes on host: CUDA_ERROR_UNKNOWN
2016-05-06 05:46:07.460547: W ./tensorflow/core/common_runtime/gpu/pool_allocator.h:195] could not allocate pinned host memory of size: 2304
2016-05-06 05:46:07.460565: E tensorflow/stream_executor/cuda/cuda_driver.cc:967] failed to alloc 2304 bytes on host: CUDA_ERROR_UNKNOWN
2016-05-06 05:46:07.460581: W ./tensorflow/core/common_runtime/gpu/pool_allocator.h:195] could not allocate pinned host memory of size: 2304
2016-05-06 05:46:07.460599: E tensorflow/stream_executor/cuda/cuda_driver.cc:967] failed to alloc 2304 bytes on host: CUDA_ERROR_UNKNOWN
2016-05-06 05:46:07.460615: W ./tensorflow/core/common_runtime/gpu/pool_allocator.h:195] could not allocate pinned host memory of size: 2304
2016-05-06 05:46:07.460633: E tensorflow/stream_executor/cuda/cuda_driver.cc:967] failed to alloc 2304 bytes on host: CUDA_ERROR_UNKNOWN
2016-05-06 05:46:07.460649: W ./tensorflow/core/common_runtime/gpu/pool_allocator.h:195] could not allocate pinned host memory of size: 2304
2016-05-06 05:46:07.460667: E tensorflow/stream_executor/cuda/cuda_driver.cc:967] failed to alloc 2304 bytes on host: CUDA_ERROR_UNKNOWN
2016-05-06 05:46:07.460683: W ./tensorflow/core/common_runtime/gpu/pool_allocator.h:195] could not allocate pinned host memory of size: 2304
2016-05-06 05:46:07.460701: E tensorflow/stream_executor/cuda/cuda_driver.cc:967] failed to alloc 2304 bytes on host: CUDA_ERROR_UNKNOWN
2016-05-06 05:46:07.460717: W ./tensorflow/core/common_runtime/gpu/pool_allocator.h:195] could not allocate pinned host memory of size: 2304
2016-05-06 05:46:07.460735: E tensorflow/stream_executor/cuda/cuda_driver.cc:967] failed to alloc 2304 bytes on host: CUDA_ERROR_UNKNOWN
2016-05-06 05:46:07.460751: W ./tensorflow/core/common_runtime/gpu/pool_allocator.h:195] could not allocate pinned host memory of size: 2304

Can you please reboot the jetson where you ran your install script and try my code to verify that you observe the same issue as me?

AastaLLL · February 21, 2018, 8:52am

Hi,

We have checked this on both JetPack3.1 and JetPack3.2 DP.
This error only occurs with CUDA 9.0.

Not sure if any issue on allocating memory with CUDA 9.0.
We are checking this with internal CUDA team and will update information later.

Thanks.

vellamike · February 21, 2018, 3:09pm

@AastaLLL thank you for confirming that you are able to reproduce the error. It’s interesting that it’s CUDA 9.0 related. I look forward to the resolution on this.

Mike

Hallon · February 21, 2018, 4:24pm

@AastaLLL

Thank you for the feedback!

I also look forward to hearing about a resolution to the problem!

calper.ql · February 26, 2018, 6:12pm

Any updates on the issue ? I just encountered this today.

AastaLLL · February 27, 2018, 2:01am

Hi,

Could you try if memory allocation works by enabling the gpu_options.allow_growth option?

config = tf.ConfigProto()
config.gpu_options.allow_growth = True

session = tf.Session(config=config, ...)

Please let us know the result.
Thanks.

vellamike · February 28, 2018, 10:05am

@AastaLLL this appears to fix the problem! My understanding is that the allow_growth option increases graphics memory dynamically and as needed, rather than mapping all the available GPU memory to the tensorflow process. What is the original cause of the problem and what is the reason that this is solving the issue?

AastaLLL · March 1, 2018, 7:32am

Hi,

Per Tensor RT documentation:
------
by default it will try to allocate all the available GPU memory.
------

On fresh boot the available memory will be very high (6.2 GB).
On iGPU environment, such a huge memory allocation will fail in general as host and GPU share the same memory.
The workaround restrict the amount of memory allocation hence it passes.

Thanks.

RichieA · March 15, 2018, 8:21am

Hello;

which file do i edit so I can enable the gpu_options.allow_growth ?

vellamike · March 15, 2018, 9:24am

@RichieA You make this change in your tensorflow script not in any file in particular:

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config, ...)

RichieA · March 16, 2018, 4:52am

Ah thank you! This fixed my issue too!

dasoto · May 12, 2018, 7:35pm

Any ide of how to apply the config.gpuoptions.allow_growth = True on tensorflow C API? I have the same problem trying to make an inference on C.