TensorFlow on Jetson TX1

I have been following the TK1 discussion where people tried to compile TensorFlow for the TK1, but since the TX1 is out with CUDA 7.0 (see JetPack 2.0) I was wondering how to compile it for the TX1.

I started with the TensorFlow setup guide.
One of the first steps is to install bazel.
Already at that stage I get stuck. I was able to install all dependencies:

$ sudo add-apt-repository ppa:webupd8team/java
$ sudo apt-get update
$ sudo apt-get install oracle-java8-installer

Then I downloaded the sources and tried to copile it:

wget https://github.com/bazelbuild/bazel/archive/0.1.1.tar.gz
tar -xf 0.1.1.tar.gz
cd bazel-0.1.1
./compile.sh

All I got was:

INFO: You can skip this first step by providing a path to the bazel binary as second argument:
INFO:    ./compile.sh build /path/to/bazel
Building Bazel from scratch

I don’t have the binaries:

bazel-bin/src/bazel

and I did not receive any error message. Am I doing something obvious wrong?

Is there another way to get TensorFlow (with GPU support) up and running on the TX1?

You cannot use the Oracle Java, it is very confused by the kernel being 64bit and the user space 32bit.
You will need to install openjdk and then modify the scripts/bootstrap/buildenv.sh file to identify the Jetson TX1 as arm 32bit.

I wrote down the instruction for TK1 (CUDA Musing: Building TensorFlow for Jetson TK1),
they are a better starting point than the official one for a Jetson TX1 porting.

I got protobuf and bazel compiled pretty easily.

mfatica’s guide helped a lot with tensorflow, but it still had a lot of problems. I think I have worked through all the arm64 problems but now I am running into out of memory problems, and I need to add some swap to the board to finish the bazel build. If you have thoughts on this I have a thread.
(TX1 -- swapon failed: Function not implemented - Jetson TX1 - NVIDIA Developer Forums)

I have the changes I made up here. I plan on cleaning stuff up and doing a better post with a binary when I get done.

Oh and maybe I should answer the question too.

I installed openjdk8. I also had to modify some BUILD file in bazel to get to compile. You will see you also have to build protobuf from source on arm64 as well and swap out a jar and a binary file.

I have the compiled src folders zipped on my google drive. They have the compiled bazel and protobuf binaries in there. You are welcome to download it and use that. It won’t get you anywhere unless you can figure out how to mount a swap though.

After I get the thing compiled fully I’ll also post a cleaned up github fork with the changes for the TX1.

Here is the google drive folder
https://drive.google.com/folderview?id=0B2Lts5RHkvG9dkR2U0VHd0Jvbmc&usp=sharing

Cheers,
Nathan

The Oracle JDK 8.0.72 (early access) works on the TX1, seems they noticed they made an error there.

@nburn42 Any progress on getting enough (swap) memory to compile tensor flow?
@seeky15 Did you install the 64-bit or 32-bit early access JDK?

Hi everybody,

There has been no activity here for some time, so please allow me to ask a blunt question. What is the preferred way to install Tensorboard to JTX1 today? Does GPU support work?

Best,

Siniša

I think this posting summarized the issue pretty clearly. Most likely we’ll have to wait for NVIDIA’s release of a new nvcc compiler (variadic templates problem) to be able to build GPU accelerated TensorFlow on JTX1.

[url]https://github.com/tensorflow/tensorflow/issues/851[/url]

The new L4T 24.1/CUDA release might help: [url]https://devtalk.nvidia.com/default/topic/941913/jetson-tx1/jetpack-2-2-and-l4t-r24-1-for-jetson-tx1-released/[/url]

I tried recently with the latest (as of this writing) LT24.2 and JetPack 2.3, and even with the advice on the CudaMusing blogpost referenced earlier it was still a no-go. Has anyone had success with any of the recent configurations under JetPack 2.3? Having access to TensorFlow on my TX1 would be the achievement of my heart’s desire, as far as the TX1 goes.

作者:makeSomeThingWt
链接:利用英伟达jetson TX1搭建TensorFlow玩flappy Bird - 知乎
来源:知乎
著作权归作者所有,转载请联系作者获得授权。

TensorFlow搭建网上已经有很教程了。但是基于英伟达的TX1芯片几乎没有。所以本教程基本是我搭建环境几个星期踩的坑。平时都是下班折腾的所以花的时间比较长。

先做好基本环境的搭建,CUDA8.0,cudnn这些tensorflow基本要的。这些用英伟达提供的工具包JetPack 2.3 L4T就可以安装。简单说一下,jetson TX1开发板如何刷机:先断电拔掉电源,用usb线连接到你的电脑上,上电。按一下电源键后,马上按住rec键,中间按一下RST键。两秒后释放REC键,就进入刷机模式。
基本环境搭建好后,因为tx1是arm架构,tensorflow还不支持需要源码编译安装。那么编译tensorflow之前需要安装bazel和protobuf这两个工具。bazel是用来编译tensorflow。具体可以github里找,这里不做详述。
安装protobuf

install deps

cd ~
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer
sudo apt-get install git zip unzip autoconf automake libtool curl zlib1g-dev maven swig bzip2

#build  build protobuf 3.0.0-beta-2 jar
git clone https://github.com/google/protobuf.git
cd protobuf
# autogen.sh downloads broken gmock.zip in d5fb408d
git checkout master
./autogen.sh
git checkout d5fb408d
./configure --prefix=/usr
make -j 4
sudo make install
cd java

#如果下载包很慢,可以百度一下mvn切换oschina源
mvn package

#注意要切换到0.2.1版本这个分支,因为没有grpc的bug
git clone https://github.com/bazelbuild/bazel.git
cd bazel
git checkout 0.2.1
cp /usr/bin/protoc third_party/protobuf/protoc-linux-arm32.exe
cp ../protobuf/java/target/protobuf-java-3.0.0-beta-2.jar third_party/protobuf/protobuf-java-3.0.0-beta-1.jar

因为github源被GFW限速原因,全程保持翻墙状态味道更佳。注意一点是bazel不支持arm架构需要在源码做改动。
改动bazel源码:找到/src/main/java/com/google/devtools/build/lib/util/CPU.java该目录,看下面代码中+号开头是新增修改的,-号开头是要删掉的。不过玩git的大家伙应该都懂,记得把前面+ -号去掉。
@@ -25,7 +25,7 @@ import java.util.Set;
public enum CPU {
X86_32(“x86_32”, ImmutableSet.of(“i386”, “i486”, “i586”, “i686”, “i786”, “x86”)),
X86_64(“x86_64”, ImmutableSet.of(“amd64”, “x86_64”, “x64”)),
- ARM(“arm”, ImmutableSet.of(“arm”, “armv7l”)),
+ ARM(“arm”, ImmutableSet.of(“arm”, “armv7l”, “aarch64”)),
UNKNOWN(“unknown”, ImmutableSet.of());
然后bazel当前目录执行

./compile.sh
编译
克隆tensorflow项目

git clone -b r0.9 https://github.com/tensorflow/tensorflow.git
./configure
#–jobs 3控制当前执行任务数量,多了cpu吃不消
#–local_resources 2048,.5,1.0 限制占用内存数量,以免溢出
#把你编译好的bazel二进制文件,在bazel项目文件夹的out目录,拷贝到tensorflow
./bazel build -c opt --config=cuda --jobs 3 --verbose_failures --local_resources 2048,.5,1.0 //tensorflow/tools/pip_package:build_pip_package

执行完会报错,我们要.cache里的两个文件一下

cd ~
wget -O config.guess ‘http://git.savannah.gnu.org/gitweb/?p=config.git;a=blob_plain;f=config.guess;hb=HEAD
wget -O config.sub ‘http://git.savannah.gnu.org/gitweb/?p=config.git;a=blob_plain;f=config.sub;hb=HEAD

# 注意!!!进到cd .cache/bazel目录下查看你系统对应的目录
cp config.guess ./.cache/bazel/_bazel_socialh/742c01ff0765b098544431b60b1eed9f/external/farmhash_archive/farmhash-34c13ddfab0e35422f4c3979f360635a8c050260/config.guess
cp config.sub ./.cache/bazel/_bazel_socialh/742c01ff0765b098544431b60b1eed9f/external/farmhash_archive/farmhash-34c13ddfab0e35422f4c3979f360635a8c050260/config.sub

编译还不止这些,你需要修改tensorflow一些代码才可行

— a/tensorflow/core/kernels/BUILD
+++ b/tensorflow/core/kernels/BUILD
@@ -985,7 +985,7 @@ tf_kernel_libraries(
“reduction_ops”,
“segment_reduction_ops”,
“sequence_ops”,
- “sparse_matmul_op”,
+ #DC “sparse_matmul_op”,
],
deps = [
“:bounds_check”,
— a/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc
+++ b/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc
@@ -888,6 +888,9 @@ CudaContext* CUDAExecutor::cuda_context() { return context_; }
// For anything more complicated/prod-focused than this, you’ll likely want to
// turn to gsys’ topology modeling.
static int TryToReadNumaNode(const string &pci_bus_id, int device_ordinal) {
+ // DC - make this clever later. ARM has no NUMA node, just return 0
+ LOG(INFO) << “ARM has no NUMA node, hardcoding to return zero”;
+ return 0;
#if defined(APPLE)
LOG(INFO) << “OS X does not support NUMA - returning NUMA node zero”;
return 0;

— a/tensorflow/stream_executor/cuda/cuda_blas.cc
+++ b/tensorflow/stream_executor/cuda/cuda_blas.cc
@@ -25,6 +25,12 @@ limitations under the License.
#define EIGEN_HAS_CUDA_FP16
#endif

+#if CUDA_VERSION >= 8000
+#define SE_CUDA_DATA_HALF CUDA_R_16F
+#else
+#define SE_CUDA_DATA_HALF CUBLAS_DATA_HALF
+#endif
+
 #include "tensorflow/stream_executor/cuda/cuda_blas.h"

 #include 
@@ -1680,10 +1686,10 @@ bool CUDABlas::DoBlasGemm(
   return DoBlasInternal(
       dynload::cublasSgemmEx, stream, true /* = pointer_mode_host */,
       CUDABlasTranspose(transa), CUDABlasTranspose(transb), m, n, k, &alpha,
-      CUDAMemory(a), CUBLAS_DATA_HALF, lda,
-      CUDAMemory(b), CUBLAS_DATA_HALF, ldb,
+      CUDAMemory(a), SE_CUDA_DATA_HALF, lda,
+      CUDAMemory(b), SE_CUDA_DATA_HALF, ldb,
       &beta,
-      CUDAMemoryMutable(c), CUBLAS_DATA_HALF, ldc);
+      CUDAMemoryMutable(c), SE_CUDA_DATA_HALF, ldc);
 #else
   LOG(ERROR) << "fp16 sgemm is not implemented in this cuBLAS version "
              << "(need at least CUDA 7.5)";

--- a/tensorflow/core/kernels/sparse_tensor_dense_matmul_op_gpu.cu.cc
+++ b/tensorflow/core/kernels/sparse_tensor_dense_matmul_op_gpu.cu.cc
@@ -104,9 +104,17 @@ struct SparseTensorDenseMatMulFunctor {
     int n = (ADJ_B) ? b.dimension(0) : b.dimension(1);

 #if !defined(EIGEN_HAS_INDEX_LIST)
-    Eigen::Tensor::Dimensions matrix_1_by_nnz{{ 1, nnz }};
-    Eigen::array n_by_1{{ n, 1 }};
-    Eigen::array reduce_on_rows{{ 0 }};
+    //DC Eigen::Tensor::Dimensions matrix_1_by_nnz{{ 1, nnz }};
+    Eigen::Tensor::Dimensions matrix_1_by_nnz;
+    matrix_1_by_nnz[0] = 1;
+    matrix_1_by_nnz[1] = nnz;
+    //DC Eigen::array n_by_1{{ n, 1 }};
+    Eigen::array n_by_1;
+    n_by_1[0] = n;
+    n_by_1[1] = 1;
+    //DC Eigen::array reduce_on_rows{{ 0 }};
+    Eigen::array reduce_on_rows;
+    reduce_on_rows[0] = 0;
 #else
     Eigen::IndexList, int> matrix_1_by_nnz;
     matrix_1_by_nnz.set(1, nnz);

— a/tensorflow/core/kernels/cwise_op_gpu_select.cu.cc
+++ b/tensorflow/core/kernels/cwise_op_gpu_select.cu.cc
@@ -43,8 +43,14 @@ struct BatchSelectFunctor {
const int all_but_batch = then_flat_outer_dims.dimension(1);

 #if !defined(EIGEN_HAS_INDEX_LIST)
-    Eigen::array broadcast_dims{{ 1, all_but_batch }};
-    Eigen::Tensor::Dimensions reshape_dims{{ batch, 1 }};
+    //DC Eigen::array broadcast_dims{{ 1, all_but_batch }};
+    Eigen::array broadcast_dims;
+    broadcast_dims[0] = 1;
+    broadcast_dims[1] = all_but_batch;
+    //DC Eigen::Tensor::Dimensions reshape_dims{{ batch, 1 }};
+    Eigen::Tensor::Dimensions reshape_dims;
+    reshape_dims[0] = batch;
+    reshape_dims[1] = 1;
 #else
     Eigen::IndexList, int> broadcast_dims;
     broadcast_dims.set(1, all_but_batch);

— a/tensorflow/python/BUILD
+++ b/tensorflow/python/BUILD
@@ -1110,7 +1110,7 @@ medium_kernel_test_list = glob([
“kernel_tests/seq2seq_test.py”,
“kernel_tests/slice_op_test.py”,
“kernel_tests/sparse_ops_test.py”,
- “kernel_tests/sparse_matmul_op_test.py”,
+ #DC “kernel_tests/sparse_matmul_op_test.py”,
“kernel_tests/sparse_tensor_dense_matmul_op_test.py”,
])

代码全部修改完后./bazel build -c opt --config=cuda --jobs 3 --verbose_failures --local_resources 2048,.5,1.0 //tensorflow/tools/pip_package:build_pip_package 这里强烈建议翻墙模式下,不然装是装可以就是很恶心。Fu*k GFW。执行完我们要打包出pip 安装包 bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg 然后 cd /tmp/tensorflow_pkg 执行 sudo pip install tensorflow-0.9.0-py2-none-any.whl。这里tensorflow就算安装好了。

为了能利用tensorflow实现深度学习玩flappy Bird,我们需要安装pygame和opencv

安装opencv

sudo apt-get install python-opencv
安装pygame

wget http://www.pygame.org/ftp/pygame-1.9.1release.tar.gz 下载pygame

sudo apt-get install libsdl1.2-dev (SDL安装)
sudo pip install numpy
cd pygame-1.9.1release

python config.py
python setup.py install
克隆flappy Bird

git clone --recursive GitHub - yenchenlin/DeepLearningFlappyBird: Flappy Bird hack using Deep Reinforcement Learning (Deep Q-learning).

cd DeepLearningFlappyBird
python deep_q_network.py
#如果提示出现 linux/videodev.h:No such file or directory error
sudo
apt-get install libv4l-dev
cd
/usr/include/linux
sudo
ln -s …/libv4l1-videodev.h videodev.h

I tried to compile tensorflow 0.12.1 on TX1 with JetPack-2.3.1 (the latest)

# install necessary requirements for building everything
sudo apt-get update
sudo apt-get upgrade
sudo apt-get install protobuf-compiler curl libprotobuf-dev python-numpy python-dev

# install oracle java dev kit - bazel has issues compiling with openjdk
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer


mkdir src
cd src

# bazel compiles from "dist" zip only, since it includes required binaries for bootstrap.
# zip in release 0.4.3 includes working arm32 binaries for bootstrapping
wget https://github.com/bazelbuild/bazel/releases/download/0.4.3/bazel-0.4.3-dist.zip --no-check-zertificate

# yet bazel needs to be patched to compile (and run correctly) on aarch64
echo 'diff --git a/src/main/java/com/google/devtools/build/lib/util/CPU.java b/src/main/java/com/google/devtools/build/lib/util/CPU.java
index 7a85c29..ccd95bd 100644
--- a/src/main/java/com/google/devtools/build/lib/util/CPU.java
+++ b/src/main/java/com/google/devtools/build/lib/util/CPU.java
@@ -25,7 +25,7 @@ public enum CPU {
   X86_32("x86_32", ImmutableSet.of("i386", "i486", "i586", "i686", "i786", "x86")),
   X86_64("x86_64", ImmutableSet.of("amd64", "x86_64", "x64")),
   PPC("ppc", ImmutableSet.of("ppc", "ppc64", "ppc64le")),
-  ARM("arm", ImmutableSet.of("arm", "armv7l")),
+  ARM("arm", ImmutableSet.of("arm", "armv7l","aarch64")),
   S390X("s390x", ImmutableSet.of("s390x", "s390")),
   UNKNOWN("unknown", ImmutableSet.<String>of());
 
diff --git a/tools/cpp/cc_configure.bzl b/tools/cpp/cc_configure.bzl
index 330a068..793c2a2 100644
--- a/tools/cpp/cc_configure.bzl
+++ b/tools/cpp/cc_configure.bzl
@@ -140,6 +140,8 @@ def _get_cpu_value(repository_ctx):
     return "x64_windows"
   # Use uname to figure out whether we are on x86_32 or x86_64
   result = repository_ctx.execute(["uname", "-m"])
+  if result.stdout.strip() in ["aarch64"]:
+    return "arm"
   return "k8" if result.stdout.strip() in ["amd64", "x86_64", "x64"] else "piii"
 
 
' >bazel-0.4.3-arm64.patch


mkdir bazel-0.4.3
cd bazel-0.4.3/
unzip ../bazel-0.4.3-dist.zip
patch -p1 <../bazel-0.4.3-arm64.patch
./compile.sh
# ... Build successful! Binary is here: /home/ubuntu/src/bazel-0.4.3/output/bazel

sudo cp -a output/bazel /usr/local/bin/


cd ..
# tensorflow gets compiled from gir source, latest release is 0.12.1
git clone git clone -b 0.12.1 https://github.com/tensorflow/tensorflow.git tensorflow-0.12.1
# however it tries to download build time requirements, some of which have dead links

echo 'diff --git a/tensorflow/workspace.bzl b/tensorflow/workspace.bzl
index 06e16cd..702b6be 100644
--- a/tensorflow/workspace.bzl
+++ b/tensorflow/workspace.bzl
@@ -64,7 +64,7 @@ def tf_workspace(path_prefix = "", tf_repo_name = ""):
 
   native.new_http_archive(
     name = "nasm",
-    url = "http://www.nasm.us/pub/nasm/releasebuilds/2.12.02/nasm-2.12.02.tar.bz2",
+    url = "http://pkgs.fedoraproject.org/repo/pkgs/nasm/nasm-2.12.02.tar.bz2/d15843c3fb7db39af80571ee27ec6fad/nasm-2.12.02.tar.bz2",
     sha256 = "00b0891c678c065446ca59bcee64719d0096d54d6886e6e472aeee2e170ae324",
     strip_prefix = "nasm-2.12.02",
     build_file = str(Label("//third_party:nasm.BUILD")),
@@ -228,7 +228,7 @@ def tf_workspace(path_prefix = "", tf_repo_name = ""):
 
   native.new_http_archive(
     name = "zlib_archive",
-    url = "http://zlib.net/zlib-1.2.8.tar.gz",
+    url = "http://zlib.net/fossils/zlib-1.2.8.tar.gz",
     sha256 = "36658cb768a54c1d4dec43c3116c27ed893e88b02ecfcb44f2166f9c0b7f2a0d",
     strip_prefix = "zlib-1.2.8",
     build_file = str(Label("//:zlib.BUILD")),
' >tensorflow-0.12.1-tx1.patch
cd tensorflow-0.12.1
patch -p1 <../tensorflow-0.12.1-tx1.patch

./configure # inputs according to http://www.nvidia.com/object/gpu-accelerated-applications-tensorflow-installation.html
# set GPU usage to yes, set CuDnn install location (defaults to /usr/local/cuda to) /usr  (cudnn.h is in /usr/include, won't be found otherwise)

bazel build -c opt --config=cuda --jobs 1 --verbose_failures --local_resources 2048,.5,1.0 //tensorflow/tools/pip_package:build_pip_package

However, the compilation fails with an internal error in the Cuda compiler. Seems like an issue with cuda8 is blocking tensorflow compilation right now:

bazel build -c opt --config=cuda --jobs 1 --verbose_failures --local_resources 2048,.5,1.0 //tensorflow/tools/pip_package:build_pip_package
WARNING: Sandboxed execution is not supported on your system and thus hermeticity of actions cannot be guaranteed. See http://bazel.build/docs/bazel-user-manual.html#sandboxing for more information. You can turn off this warning via --ignore_unsupported_sandboxing.
INFO: Found 1 target...
INFO: From Compiling tensorflow/core/kernels/sparse_xent_op_gpu.cu.cc:
external/eigen_archive/unsupported/Eigen/CXX11/src/Tensor/TensorBroadcasting.h(275): internal error: assertion failed at: "/dvs/p4/build/sw/rel/gpu_drv/r361/r361_00/drivers/compiler/edg/EDG_4.10/src/folding.c", line 9819


1 catastrophic error detected in the compilation of "/tmp/tmpxft_000027e8_00000000-9_sparse_xent_op_gpu.cu.compute_61.cpp1.ii".
Compilation aborted.
Aborted
ERROR: /home/ubuntu/src/tensorflow-0.12.1/tensorflow/core/kernels/BUILD:2018:1: output 'tensorflow/core/kernels/_objs/sparse_xent_op_gpu/tensorflow/core/kernels/sparse_xent_op_gpu.cu.pic.o' was not created.
ERROR: /home/ubuntu/src/tensorflow-0.12.1/tensorflow/core/kernels/BUILD:2018:1: not all outputs were created or valid.
Target //tensorflow/tools/pip_package:build_pip_package failed to build
INFO: Elapsed time: 6.765s, Critical Path: 5.98s

(This is from a repeated run of bazel, the original run took hours, but the error is always the same on sparse_xent_op_gpu)

The same error also happens when trying tensorflow’s master branch and/or 1.0 pre-releases.

A similar issue seems to happen with CUDA8 on darwin, blocking tensorflow compilation with the same error.

https://github.com/tensorflow/tensorflow/issues/3845

I investigated that a bit further. What bazel is executing (in the bazel cache directory) is the following command:

nvcc -D_FORCE_INLINES -gencode=arch=compute_52,"code=sm_52,compute_52" -gencode=arch=compute_61,"code=sm_61,compute_61" --expt-relaxed-constexpr --ftz=true -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=1 -DNDEBUG -DEIGEN_MPL2_ONLY -DGOOGLE_CUDA=1 -DEIGEN_AVOID_STL_ARRAY -DGOOGLE_CUDA=1 -std=c++11 --compiler-options " -isystem external/bazel_tools/tools/cpp/gcc3 -isystem external/eigen_archive -isystem bazel-out/local_linux-opt/genfiles/external/eigen_archive -isystem external/protobuf/src -isystem bazel-out/local_linux-opt/genfiles/external/protobuf/src -isystem external/gif_archive -isystem bazel-out/local_linux-opt/genfiles/external/gif_archive -isystem external/farmhash_archive -isystem bazel-out/local_linux-opt/genfiles/external/farmhash_archive -isystem external/highwayhash -isystem bazel-out/local_linux-opt/genfiles/external/highwayhash -isystem external/png_archive -isystem bazel-out/local_linux-opt/genfiles/external/png_archive -isystem external/zlib_archive -isystem bazel-out/local_linux-opt/genfiles/external/zlib_archive -isystem external/local_config_cuda/cuda/include -isystem bazel-out/local_linux-opt/genfiles/external/local_config_cuda/cuda/include -isystem external/local_config_cuda/cuda -isystem bazel-out/local_linux-opt/genfiles/external/local_config_cuda/cuda -iquote . -iquote bazel-out/local_linux-opt/genfiles -iquote external/bazel_tools -iquote bazel-out/local_linux-opt/genfiles/external/bazel_tools -iquote external/eigen_archive -iquote bazel-out/local_linux-opt/genfiles/external/eigen_archive -iquote external/local_config_sycl -iquote bazel-out/local_linux-opt/genfiles/external/local_config_sycl -iquote external/protobuf -iquote bazel-out/local_linux-opt/genfiles/external/protobuf -iquote external/gif_archive -iquote bazel-out/local_linux-opt/genfiles/external/gif_archive -iquote external/jpeg -iquote bazel-out/local_linux-opt/genfiles/external/jpeg -iquote external/com_googlesource_code_re2 -iquote bazel-out/local_linux-opt/genfiles/external/com_googlesource_code_re2 -iquote external/farmhash_archive -iquote bazel-out/local_linux-opt/genfiles/external/farmhash_archive -iquote external/highwayhash -iquote bazel-out/local_linux-opt/genfiles/external/highwayhash -iquote external/png_archive -iquote bazel-out/local_linux-opt/genfiles/external/png_archive -iquote external/zlib_archive -iquote bazel-out/local_linux-opt/genfiles/external/zlib_archive -iquote external/local_config_cuda -iquote bazel-out/local_linux-opt/genfiles/external/local_config_cuda -g0 -fno-canonical-system-headers -fPIC" --compiler-bindir=/usr/bin/gcc -I . -x cu -O2 -I external/gemmlowp -c tensorflow/core/kernels/sparse_xent_op_gpu.cu.cc -o bazel-out/local_linux-opt/bin/tensorflow/core/kernels/_objs/sparse_xent_op_gpu/tensorflow/core/kernels/sparse_xent_op_gpu.cu.pic.o

this executed fails with:

external/eigen_archive/unsupported/Eigen/CXX11/src/Tensor/TensorBroadcasting.h(275): internal error: assertion failed at: "/dvs/p4/build/sw/rel/gpu_drv/r361/r361_00/drivers/compiler/edg/EDG_4.10/src/folding.c", line 9819


1 catastrophic error detected in the compilation of "/tmp/tmpxft_00002d0e_00000000-7_sparse_xent_op_gpu.cu.cpp1.ii".
Compilation aborted.
Aborted

as before

TensorBroadcasting.h(275) is

if (internal::index_statically_eq<InputDimensions>(0, 1)) {

This has been reported previously. NVIDIA engineers are investigating. See the following link for a workaround, plus an existing wheel file:

https://devtalk.nvidia.com/default/topic/987306/?comment=5059105