Error of building caffe FP16 in Jetson Tx1

Jianbin · March 15, 2017, 12:07am

I followed:

dusty-nv/jetson-inference/blob/master/docs/building-nvcaffe.md

<img src="https://github.com/dusty-nv/jetson-inference/raw/master/docs/images/deep-vision-header.jpg" width="100%">

# Building nvcaffe

A special branch of caffe is used on TX1 which includes support for FP16.<br />
The code is released in NVIDIA's caffe repo in the experimental/fp16 branch, located here:
> https://github.com/nvidia/caffe/tree/experimental/fp16

#### 1. Installing Dependencies

``` bash
$ sudo apt-get update -y
$ sudo apt-get install cmake -y

# General dependencies
$ sudo apt-get install libprotobuf-dev libleveldb-dev libsnappy-dev \
libhdf5-serial-dev protobuf-compiler -y
$ sudo apt-get install --no-install-recommends libboost-all-dev -y

# BLAS

This file has been truncated. show original

My Jetson Tx1 environment is Jetpack 2.3.1 with cuda 8.0 and cudnn 5.1, here is error I’ve got：

[ 49%] Building CXX object src/caffe/CMakeFiles/caffe.dir/util/db_leveldb.cpp.o
[ 49%] Building CXX object src/caffe/CMakeFiles/caffe.dir/util/db_lmdb.cpp.o
[ 50%] Building CXX object src/caffe/CMakeFiles/caffe.dir/util/float16.cpp.o
[ 50%] Building CXX object src/caffe/CMakeFiles/caffe.dir/util/gpu_memory.cpp.o
[ 52%] Building CXX object src/caffe/CMakeFiles/caffe.dir/util/hdf5.cpp.o
/home/ubuntu/sdcard/tools/nvcaffe/src/caffe/util/gpu_memory.cpp: In static member function ‘static void caffe::gpu_memory::getInfo(size_t*, size_t* ’:
/home/ubuntu/sdcard/tools/nvcaffe/src/caffe/util/gpu_memory.cpp:202:66: error: ‘std::map<int, cub::CachingDeviceAllocator::TotalBytes>::mapped_type {aka class cub::CachingDeviceAllocator::TotalBytes}’ has no member named ‘busy’
       *free_mem = poolsize_ - cubAlloc->cached_bytes[cur_device].busy;
                                                                  ^
src/caffe/CMakeFiles/caffe.dir/build.make:22895: recipe for target 'src/caffe/CMakeFiles/caffe.dir/util/gpu_memory.cpp.o' failed
make[2]: *** [src/caffe/CMakeFiles/caffe.dir/util/gpu_memory.cpp.o] Error 1
make[2]: *** Waiting for unfinished jobs....
CMakeFiles/Makefile2:272: recipe for target 'src/caffe/CMakeFiles/caffe.dir/all' failed
make[1]: *** [src/caffe/CMakeFiles/caffe.dir/all] Error 2
Makefile:127: recipe for target 'all' failed
make: *** [all] Error 2</li>

May I get a hint? Thanks very much!

AastaLLL · March 15, 2017, 4:31am

Hi,

Thanks for the question.

Just tried this branch and it build successfully without the error you met.
Could you please try it again with following procedures?

Thanks.

Share my procedure:

$ sudo apt-get update
$ sudo apt-get install software-properties-common
$ sudo add-apt-repository universe 
$ sudo add-apt-repository multiverse
$ sudo apt-get install libboost-dev libboost-all-dev libgflags-dev libgoogle-glog-dev liblmdb-dev libatlas-base-dev liblmdb-dev libblas-dev libatlas-base-dev libprotobuf-dev libleveldb-dev libsnappy-dev libhdf5-serial-dev protobuf-compiler
$ sudo apt-get install protobuf-compiler libprotobuf-dev cmake git libgflags-dev libgoogle-glog-dev libhdf5-dev libatlas-dev libatlas-base-dev libatlas3-base liblmdb-dev libleveldb-dev

$ mkdir caffe_fp16
$ exporvim t CAFFE_ROOT=/media/ubuntu/NVIDIA/caffe_fp16
$ git clone -b experimental/fp16 https://github.com/NVIDIA/caffe $CAFFE_ROOT
$ cd caffe_fp16/
$ cp Makefile.config.example Makefile.config
$ sed -i 's/# NATIVE_FP16/NATIVE_FP16/g' Makefile.config
$ sed -i 's/# USE_CUDNN/USE_CUDNN/g' Makefile.config
$ sed -i 's/-gencode arch=compute_50,code=compute_50/-gencode arch=compute_53,code=sm_53 -gencode arch=compute_53,code=compute_53/g' Makefile.config
$ sed -i 's/\/usr\/local\/include/\/usr\/local\/include \/usr\/include\/hdf5\/serial\//g' Makefile.config
$ sed -i 's/hdf5_hl/hdf5_serial_hl/g' Makefile
$ sed -i 's/hdf5/hdf5_serial/g' Makefile
$ make -j4

Jianbin · March 16, 2017, 5:06am

Thanks! AastaLLL

I followed you way with minor modification and successfully compiled fp16 version

I have to remove this line,
sed -i ‘s/hdf5_hl/hdf5_serial_hl/g’ Makefile (to be removed)

otherwise Makefile will have this:
LIBRARIES += glog gflags protobuf boost_system m hdf5_serial_serial_hl hdf5_serial

Below is my way:

git clone -b experimental/fp16 https://github.com/NVIDIA/caffe caffe_fp16
cd caffe_fp16/
cp Makefile.config.example Makefile.config
sed -i 's/# NATIVE_FP16/NATIVE_FP16/g' Makefile.config
sed -i 's/# USE_CUDNN/USE_CUDNN/g' Makefile.config
sed -i 's/-gencode arch=compute_50,code=compute_50/-gencode arch=compute_53,code=sm_53 -gencode arch=compute_53,code=compute_53/g' Makefile.config
sed -i 's/\/usr\/local\/include/\/usr\/local\/include \/usr\/include\/hdf5\/serial\//g' Makefile.config
sed -i 's/hdf5/hdf5_serial/g' Makefile
make -j4
make test -j4
make runtest -j4

However, when I make runtest, it failed in some cases.

[  FAILED  ] CuDNNNeuronLayerTest/2.TestTanHGradientCuDNN, where TypeParam = caffe::MultiPrecision<caffe::f                                                                                  loat16, float> (226 ms)
[ RUN      ] CuDNNNeuronLayerTest/2.TestReLUCuDNN
[       OK ] CuDNNNeuronLayerTest/2.TestReLUCuDNN (0 ms)
[----------] 8 tests from CuDNNNeuronLayerTest/2 (896 ms total)

[----------] 1 test from InfogainLossLayerTest/4, where TypeParam = caffe::GPUDevice<caffe::MultiPrecision<                                                                                  double, double> >
[ RUN      ] InfogainLossLayerTest/4.TestGradient
[       OK ] InfogainLossLayerTest/4.TestGradient (48 ms)
[----------] 1 test from InfogainLossLayerTest/4 (48 ms total)

[----------] 12 tests from DataLayerTest/1, where TypeParam = caffe::CPUDevice<caffe::MultiPrecision<double                                                                                  , double> >
[ RUN      ] DataLayerTest/1.TestReadLMDB
F0316 15:15:48.180516 32693 db_lmdb.hpp:25] Check failed: mdb_status == 0 (12 vs. 0) Cannot allocate memory
*** Check failure stack trace: ***
    @       0x7f9ff29718  google::LogMessage::Fail()
    @       0x7f9ff2b614  google::LogMessage::SendToLog()
    @       0x7f9ff29290  google::LogMessage::Flush()
    @       0x7f9ff2beb4  google::LogMessageFatal::~LogMessageFatal()
    @       0x7f9eb58314  caffe::db::LMDB::Open()
    @           0x651394  caffe::DataLayerTest<>::Fill()
    @           0x65c61c  caffe::DataLayerTest_TestReadLMDB_Test<>::TestBody()
    @           0xa6af44  testing::internal::HandleExceptionsInMethodIfSupported<>()
    @           0xa6324c  testing::Test::Run()
    @           0xa63388  testing::TestInfo::Run()
    @           0xa63448  testing::TestCase::Run()
    @           0xa645a8  testing::internal::UnitTestImpl::RunAllTests()
    @           0xa648bc  testing::UnitTest::Run()
    @           0x56c1d8  main
    @       0x7f9e6058a0  __libc_start_main
Makefile:552: recipe for target 'runtest' failed

AastaLLL · March 22, 2017, 3:27am

Hi,

Sorry for keeping you waiting and my late reply.

Looks like LMDB_MAP_SIZE is platform dependent.
We modify LMDB_MAP_SIZE to 2TB by the suggestion of caffe on jetson tk1 · Issue #1861 · BVLC/caffe · GitHub.
Test LMDB functionality with “DataLayerTest/4.TestReadCropTrainSequenceUnseededLMDB” test item and MNIST training sample.
Both works well.

Could you also give it a try?

diff --git a/include/caffe/util/db_lmdb.hpp b/include/caffe/util/db_lmdb.hpp
index a484a77..f86c736 100644
--- a/include/caffe/util/db_lmdb.hpp
+++ b/include/caffe/util/db_lmdb.hpp
@@ -16,7 +16,9 @@ namespace caffe { namespace db {
 const size_t LMDB_MAP_SIZE = 1073741824;  // 1 GB
 #elif UINTPTR_MAX == 0xffffffffffffffff
 /* 64-bit */
-const size_t LMDB_MAP_SIZE = 1099511627776;  // 1 TB
+//const size_t LMDB_MAP_SIZE = 1099511627776;  // 1 TB
+const size_t LMDB_MAP_SIZE = 2147483648;  // 2 TB
+
 #else
 #  error "Bad stdint.h!"
 #endif

leoncss92 · June 1, 2017, 9:26am

Hi AastalLLL,

I faced the same error when I was trying to make runtest.
i.e F0601 09:09:41.345368 4892 db_lmdb.hpp:13] Check failed: mob_status == 0 (12 vs. 0) Cannot allocate memory

I tried your suggestion in #4, however, I realised that my db_lmdb.hpp file looks quite different as it does not have “const size_t LMDB_MAP_SIZE” at all. I tried to add the suggestions to the file but that just made more errors during runtest.

Do you know if I’m having the right db_lmdb.hpp file?

Thank you.

leoncss92 · June 1, 2017, 9:27am

AastaLLL · June 2, 2017, 6:35am

Hi,

Please re-build caffe after changing the LMDB_MAP_SIZE?

make clean
make -j4

NipunaVega · June 2, 2017, 7:14am

+const size_t LMDB_MAP_SIZE = 2147483648; // 2 TB

Is this 2TB?

Looks a lot less than that.

AastaLLL · June 5, 2017, 2:49am

Hi,

This value is not related to the disk space.

As mentioned in #4, this suggestion is based on an comment posted in caffe github.
https://github.com/BVLC/caffe/issues/1861
Looks like enlarge LMDB_MAP_SIZE to 2TB helps.

For LMDB_MAP_SIZE definition, please read more at LMDB document:
http://lmdb.readthedocs.io/en/release/

Thanks.