GPU does not work when running SSD

Hi, I have a strange problem with TX2.
The utilization rate of GPU was 0 when running framework SSD, with using tegrastats to check.

sudo ~/tegrastats
RAM 827/7854MB (lfb 1593*4MB) cpu [4%@346, 0%@358, 0%@353, 0%@348, 3%@347, 1%@348] EMC 5%@665 APE 150 VDE 1203 GR3D 0%@140
nvpmodel -q
NV Power Mode: MAXN
0
NVPM ERROR: Error opening /sys/kernel/nvpmodel_emc_cap/emc_iso_cap: 13

Is there any solution?

Hi Tomic,

The utilization rate of GPU was 0 when running framework SSD, with using tegrastats to check.
What is “running framework SSD”? Why GPU utilization do you expect on running it?

nvpmodel -q
need to run the command with superuser privileges

Hi Vickyy,

SSD:Single Shot MultiBox Detector.

 I have several TX2 development boards, and then run the same program on another development board, with GPU utilization up to 99%.
sudo nvpmode -q 
NV Power Mode: MAXN
0

Hi,

Just want to confirm first.
Could you check if you build caffe with GPU enabled?

Hi Vickyy
I have built caffe with GPU=1,CUDNN=1.

Hi,

Please try our cuDNN sample:

$ cd ~/cudnn_samples_v5/RNN/
$ ./RNN 100 100 100 64 2

Please share the ./tegrastats results
Thanks.

Hi,

nvidia@tegra-ubuntu:~$ cd /usr/src/cudnn_samples_v5/RNN 
nvidia@tegra-ubuntu:/usr/src/cudnn_samples_v5/RNN$ ./RNN 100 100 100 64 2
Forward:  74 GFLOPS
Backward: 145 GFLOPS, (100 GFLOPS), (266 GFLOPS)
Segmentation fault (core dumped)

And

nvidia@tegra-ubuntu:~$ sudo ~/tegrastats
[sudo] password for nvidia: 
RAM 964/7854MB (lfb 1572x4MB) cpu [0%@1498,0%@356,0%@347,0%@1497,0%@1497,0%@1502] EMC 5%@1866 APE 150 VDE 1203 GR3D 21%@140
RAM 2845/7854MB (lfb 1093x4MB) cpu [6%@345,35%@2054,44%@2057,25%@347,11%@347,18%@347] EMC 10%@1866 APE 150 VDE 1203 GR3D 0%@140
RAM 2892/7854MB (lfb 1078x4MB) cpu [4%@1987,93%@2082,13%@2046,29%@1993,11%@1992,12%@1996] EMC 20%@1866 APE 150 VDE 1203 GR3D 99%@1300
RAM 2896/7854MB (lfb 1078x4MB) cpu [19%@499,40%@2029,66%@2036,0%@501,6%@502,14%@497] EMC 24%@1866 APE 150 VDE 1203 GR3D 99%@943
RAM 2899/7854MB (lfb 1077x4MB) cpu [36%@2081,48%@2051,58%@2046,20%@2036,14%@2037,8%@2037] EMC 31%@1866 APE 150 VDE 1203 GR3D 0%@586
RAM 1805/7854MB (lfb 1264x4MB) cpu [12%@346,41%@362,8%@355,38%@347,13%@347,15%@348] EMC 26%@1331 APE 150 VDE 1203 GR3D 0%@140
RAM 1793/7854MB (lfb 1266x4MB) cpu [17%@499,0%@356,4%@355,3%@501,11%@502,5%@501] EMC 27%@665 APE 150 VDE 1203 GR3D 12%@140
RAM 1792/7854MB (lfb 1266x4MB) cpu [3%@345,0%@359,4%@354,12%@347,19%@348,3%@348] EMC 16%@665 APE 150 VDE 1203 GR3D 0%@140
RAM 1792/7854MB (lfb 1266x4MB) cpu [17%@348,0%@358,6%@355,3%@347,19%@347,5%@347] EMC 11%@665 APE 150 VDE 1203 GR3D 0%@140
RAM 1795/7854MB (lfb 1266x4MB) cpu [15%@806,1%@353,5%@352,7%@810,1%@777,15%@806] EMC 9%@665 APE 150 VDE 1203 GR3D 11%@140
RAM 1791/7854MB (lfb 1266x4MB) cpu [16%@351,2%@360,4%@354,15%@348,3%@347,2%@347] EMC 8%@665 APE 150 VDE 1203 GR3D 0%@140
RAM 1791/7854MB (lfb 1266x4MB) cpu [7%@346,0%@359,5%@355,11%@347,13%@348,13%@348] EMC 8%@665 APE 150 VDE 1203 GR3D 0%@140
RAM 1792/7854MB (lfb 1266x4MB) cpu [14%@345,0%@356,4%@355,4%@347,10%@347,7%@348] EMC 7%@665 APE 150 VDE 1203 GR3D 0%@140
RAM 1792/7854MB (lfb 1266x4MB) cpu [10%@652,0%@357,4%@355,22%@655,9%@653,3%@655] EMC 7%@665 APE 150 VDE 1203 GR3D 0%@140
RAM 1794/7854MB (lfb 1266x4MB) cpu [15%@345,0%@360,4%@355,11%@347,5%@347,4%@347] EMC 7%@665 APE 150 VDE 1203 GR3D 0%@140
RAM 1791/7854MB (lfb 1266x4MB) cpu [13%@351,0%@358,5%@355,5%@348,10%@348,7%@347] EMC 6%@665 APE 150 VDE 1203 GR3D 0%@140
RAM 1792/7854MB (lfb 1266x4MB) cpu [8%@961,0%@354,6%@353,19%@963,6%@962,14%@962] EMC 7%@665 APE 150 VDE 1203 GR3D 0%@140
RAM 1792/7854MB (lfb 1266x4MB) cpu [6%@345,2%@357,3%@354,12%@348,10%@348,3%@348] EMC 7%@665 APE 150 VDE 1203 GR3D 0%@140
RAM 1792/7854MB (lfb 1266x4MB) cpu [8%@345,0%@356,3%@355,18%@348,3%@348,3%@348] EMC 6%@665 APE 150 VDE 1203 GR3D 0%@140
RAM 1792/7854MB (lfb 1266x4MB) cpu [15%@959,0%@355,4%@354,6%@962,6%@962,20%@962] EMC 7%@665 APE 150 VDE 1203 GR3D 0%@140
RAM 1791/7854MB (lfb 1266x4MB) cpu [3%@351,0%@355,3%@355,6%@347,4%@348,18%@347] EMC 7%@665 APE 150 VDE 1203 GR3D 0%@140
^C

Hi,

It looks like GPU utilization can reach 99%:

RAM 2892/7854MB (lfb 1078x4MB) cpu [4%@1987,93%@2082,13%@2046,29%@1993,11%@1992,12%@1996] EMC 20%@1866 APE 150 VDE 1203 GR3D 99%@1300

Could you fix GPU frequency to the max and try it again?

sudo ./jetson_clock.sh

But the segmentation fault you met is abnormally/
Do you run another GPU program at the same time?

Hi,

nvidia@tegra-ubuntu:~$ sudo ./jetson_clocks.sh
nvidia@tegra-ubuntu:~$ sudo ./jetson_clocks.sh --show
SOC family:tegra186  Machine:quill
Online CPUs: 0-5
CPU Cluster Switching: Disabled
cpu0: Gonvernor=schedutil MinFreq=2035200 MaxFreq=2035200 CurrentFreq=2035200
cpu1: Gonvernor=schedutil MinFreq=2035200 MaxFreq=2035200 CurrentFreq=2035200
cpu2: Gonvernor=schedutil MinFreq=2035200 MaxFreq=2035200 CurrentFreq=2035200
cpu3: Gonvernor=schedutil MinFreq=2035200 MaxFreq=2035200 CurrentFreq=2035200
cpu4: Gonvernor=schedutil MinFreq=2035200 MaxFreq=2035200 CurrentFreq=2035200
cpu5: Gonvernor=schedutil MinFreq=2035200 MaxFreq=2035200 CurrentFreq=2035200
GPU MinFreq=1300500000 MaxFreq=1300500000 CurrentFreq=1300500000
EMC MinFreq=40800000 MaxFreq=1866000000 CurrentFreq=1866000000 FreqOverride=1
Fan: speed=255
nvidia@tegra-ubuntu:~$

And running the program,

nvidia@tegra-ubuntu:~$ sudo ~/tegrastats
[sudo] password for nvidia: 
RAM 1808/7854MB (lfb 1283x4MB) cpu [0%@2035,0%@2047,0%@2046,0%@2037,0%@2035,0%@2035] EMC 15%@1866 APE 150 VDE 1203 GR3D 4%@1300
RAM 1938/7854MB (lfb 1216x4MB) cpu [22%@2032,60%@2048,52%@2044,13%@2033,33%@2039,25%@2034] EMC 14%@1866 APE 150 VDE 1203 GR3D 7%@1300
RAM 2150/7854MB (lfb 1173x4MB) cpu [1%@2035,97%@2047,8%@2050,5%@2036,0%@2037,21%@2038] EMC 9%@1866 APE 150 VDE 1203 GR3D 0%@1300
RAM 2160/7854MB (lfb 1169x4MB) cpu [18%@2035,100%@2044,66%@2046,35%@2035,24%@2036,13%@2034] EMC 10%@1866 APE 150 VDE 1203 GR3D 0%@1300
RAM 2228/7854MB (lfb 1152x4MB) cpu [14%@2036,100%@2047,36%@2045,6%@2037,14%@2038,2%@2041] EMC 7%@1866 APE 150 VDE 1203 GR3D 0%@1300
RAM 2345/7854MB (lfb 1124x4MB) cpu [12%@2035,100%@2047,20%@2046,3%@2033,12%@2035,24%@2036] EMC 8%@1866 APE 150 VDE 1203 GR3D 11%@1300
RAM 2339/7854MB (lfb 1123x4MB) cpu [18%@2002,100%@2065,42%@2049,18%@2005,13%@2006,8%@2007] EMC 8%@1866 APE 150 VDE 1203 GR3D 0%@1300
RAM 2377/7854MB (lfb 1114x4MB) cpu [17%@2035,100%@2050,1%@2051,0%@2037,5%@2042,7%@2036] EMC 5%@1866 APE 150 VDE 1203 GR3D 0%@1300
RAM 2431/7854MB (lfb 1103x4MB) cpu [0%@1998,100%@2051,1%@2048,1%@2003,2%@2004,16%@2006] EMC 4%@1866 APE 150 VDE 1203 GR3D 0%@1300
RAM 2483/7854MB (lfb 1092x4MB) cpu [5%@2035,100%@2049,2%@2049,3%@2037,10%@2038,10%@2038] EMC 4%@1866 APE 150 VDE 1203 GR3D 0%@1300
RAM 2485/7854MB (lfb 1089x4MB) cpu [1%@2034,100%@2047,2%@2048,0%@2038,11%@2037,16%@2037] EMC 3%@1866 APE 150 VDE 1203 GR3D 0%@1300
RAM 2504/7854MB (lfb 1086x4MB) cpu [14%@2001,100%@2048,10%@2049,0%@2004,0%@2005,6%@2004] EMC 3%@1866 APE 150 VDE 1203 GR3D 0%@1300
RAM 2531/7854MB (lfb 1078x4MB) cpu [2%@2000,100%@2048,8%@2048,8%@2002,13%@2003,11%@2005] EMC 4%@1866 APE 150 VDE 1203 GR3D 0%@1300
RAM 2534/7854MB (lfb 1077x4MB) cpu [2%@2035,100%@2049,4%@2047,4%@2037,5%@2043,25%@2037] EMC 4%@1866 APE 150 VDE 1203 GR3D 0%@1300
RAM 2560/7854MB (lfb 1072x4MB) cpu [0%@2003,100%@2047,5%@2049,0%@2045,3%@2047,8%@2038] EMC 3%@1866 APE 150 VDE 1203 GR3D 0%@1300
RAM 2568/7854MB (lfb 1069x4MB) cpu [12%@2035,100%@2049,6%@2051,0%@2037,0%@2036,4%@2037] EMC 3%@1866 APE 150 VDE 1203 GR3D 0%@1300
RAM 2582/7854MB (lfb 1066x4MB) cpu [13%@2036,100%@2050,5%@2050,1%@2038,3%@2038,0%@2038] EMC 3%@1866 APE 150 VDE 1203 GR3D 0%@1300
RAM 2591/7854MB (lfb 1064x4MB) cpu [6%@2035,100%@2049,2%@2051,11%@2037,1%@2037,3%@2037] EMC 3%@1866 APE 150 VDE 1203 GR3D 0%@1300
RAM 2676/7854MB (lfb 1044x4MB) cpu [7%@1992,100%@2048,3%@2049,10%@1994,0%@1997,2%@1987] EMC 3%@1866 APE 150 VDE 1203 GR3D 0%@1300
RAM 1916/7854MB (lfb 1200x4MB) cpu [13%@2034,87%@2050,54%@2047,11%@2037,27%@2038,16%@2037] EMC 7%@1866 APE 150 VDE 1203 GR3D 7%@1300
RAM 2125/7854MB (lfb 1179x4MB) cpu [11%@2035,100%@2052,63%@2047,15%@2037,8%@2037,11%@2037] EMC 9%@1866 APE 150 VDE 1203 GR3D 0%@1300
RAM 2160/7854MB (lfb 1169x4MB) cpu [15%@2035,100%@2050,9%@2047,8%@2038,7%@2038,0%@2038] EMC 6%@1866 APE 150 VDE 1203 GR3D 0%@1300
RAM 2342/7854MB (lfb 1124x4MB) cpu [7%@2035,100%@2053,18%@2051,22%@2036,7%@2037,12%@2038] EMC 6%@1866 APE 150 VDE 1203 GR3D 0%@1300
^C

–Do you run another GPU program at the same time?
I just ran cuDNN example.
I have tried reboot TX2 and ran the example, and it cames the same error.

Hi,

May I know which program you executed.
Could you share the results of our RNN sample?

Thanks.

Hi
the program I executed

// This is a demo code for using a SSD model to do detection.
// The code is modified from examples/cpp_classification/classification.cpp.
// Usage:
//    ssd_detect [FLAGS] model_file weights_file list_file
//
// where model_file is the .prototxt file defining the network architecture, and
// weights_file is the .caffemodel file containing the network parameters, and
// list_file contains a list of image files with the format as follows:
//    folder/img1.JPEG
//    folder/img2.JPEG
// list_file can also contain a list of video files with the format as follows:
//    folder/video1.mp4
//    folder/video2.mp4
//
#define USE_OPENCV

#include <caffe/caffe.hpp>
#ifdef USE_OPENCV
#include <opencv2/core/core.hpp>
#include <opencv2/highgui/highgui.hpp>
#include <opencv2/imgproc/imgproc.hpp>
#endif  // USE_OPENCV
#include <algorithm>
#include <iomanip>
#include <iosfwd>
#include <memory>
#include <string>
#include <utility>
#include <vector>
#include <stdio.h>
//#include <iostream>
//#include <sstream>


#ifdef USE_OPENCV
using namespace caffe;  // NOLINT(build/namespaces)

class Detector {
 public:
  Detector(const string& model_file,
           const string& weights_file,
           const string& mean_file,
           const string& mean_value);

  std::vector<vector<float> > Detect(const cv::Mat& img);

 private:
  void SetMean(const string& mean_file, const string& mean_value);

  void WrapInputLayer(std::vector<cv::Mat>* input_channels);

  void Preprocess(const cv::Mat& img,
                  std::vector<cv::Mat>* input_channels);

 private:
  shared_ptr<Net<float> > net_;
  cv::Size input_geometry_;
  int num_channels_;
  cv::Mat mean_;
};

Detector::Detector(const string& model_file,
                   const string& weights_file,
                   const string& mean_file,
                   const string& mean_value) {
#ifdef CPU_ONLY
  Caffe::set_mode(Caffe::CPU);
#elsetring
  Caffe::set_mode(Caffe::GPU);
#endif

  /* Load the network. */
  net_.reset(new Net<float>(model_file, TEST));
  net_->CopyTrainedLayersFrom(weights_file);

  CHECK_EQ(net_->num_inputs(), 1) << "Network should have exactly one input.";
  CHECK_EQ(net_->num_outputs(), 1) << "Network should have exactly one output.";

  Blob<float>* input_layer = net_->input_blobs()[0];
  num_channels_ = input_layer->channels();
  CHECK(num_channels_ == 3 || num_channels_ == 1)
    << "Input layer should have 1 or 3 channels.";
  input_geometry_ = cv::Size(input_layer->width(), input_layer->height());

  /* Load the binaryproto mean file. */
  SetMean(mean_file, mean_value);
}

std::vector<vector<float> > Detector::Detect(const cv::Mat& img) {
  Blob<float>* input_layer = net_->input_blobs()[0];
  input_layer->Reshape(1, num_channels_,
                       input_geometry_.height, input_geometry_.width);
  /* Forward dimension change to all layers. */
  net_->Reshape();

  std::vector<cv::Mat> input_channels;
  WrapInputLayer(&input_channels);

  Preprocess(img, &input_channels);

  LOG(INFO)<<"net_->Forward() s1!";
  net_->Forward();

  LOG(INFO)<<"net_->Forward() s2!";

  /* Copy the output layer to a std::vector */
  Blob<float>* result_blob = net_->output_blobs()[0];
  const float* result = result_blob->cpu_data();
  const int num_det = result_blob->height();
  vector<vector<float> > detections;
  for (int k = 0; k < num_det; ++k) {
    if (result[0] == -1) {
      // Skip invalid detection.
      result += 7;
      continue;
    }
    vector<float> detection(result, result + 7);
    detections.push_back(detection);
    result += 7;
  }
  return detections;
}

/* Load the mean file in binaryproto format. */
void Detector::SetMean(const string& mean_file, const string& mean_value) {
  cv::Scalar channel_mean;
  if (!mean_file.empty()) {
    CHECK(mean_value.empty()) <<
      "Cannot specify mean_file and mean_value at the same time";
    BlobProto blob_proto;
    ReadProtoFromBinaryFileOrDie(mean_file.c_str(), &blob_proto);

    /* Convert from BlobProto to Blob<float> */
    Blob<float> mean_blob;
    mean_blob.FromProto(blob_proto);
    CHECK_EQ(mean_blob.channels(), num_channels_)
      << "Number of channels of mean file doesn't match input layer.";

    /* The format of the mean file is planar 32-bit float BGR or grayscale. */
    std::vector<cv::Mat> channels;
    float* data = mean_blob.mutable_cpu_data();
    for (int i = 0; i < num_channels_; ++i) {
      /* Extract an individual channel. */
      cv::Mat channel(mean_blob.height(), mean_blob.width(), CV_32FC1, data);
      channels.push_back(channel);
      data += mean_blob.height() * mean_blob.width();
    }

    /* Merge the separate channels into a single image. */
    cv::Mat mean;
    cv::merge(channels, mean);

    /* Compute the global mean pixel value and create a mean image
     * filled with this value. */
    channel_mean = cv::mean(mean);
    mean_ = cv::Mat(input_geometry_, mean.type(), channel_mean);
  }
  if (!mean_value.empty()) {
    CHECK(mean_file.empty()) <<
      "Cannot specify mean_file and mean_value at the same time";
    stringstream ss(mean_value);
    vector<float> values;
    string item;
    while (getline(ss, item, ',')) {
      float value = std::atof(item.c_str());
      values.push_back(value);
    }
    CHECK(values.size() == 1 || values.size() == num_channels_) <<
      "Specify either 1 mean_value or as many as channels: " << num_channels_;

    std::vector<cv::Mat> channels;
    for (int i = 0; i < num_channels_; ++i) {
      /* Extract an individual channel. */
      cv::Mat channel(input_geometry_.height, input_geometry_.width, CV_32FC1,
          cv::Scalar(values[i]));
      channels.push_back(channel);
    }
    cv::merge(channels, mean_);
  }
}

/* Wrap the input layer of the network in separate cv::Mat objects
 * (one per channel). This way we save one memcpy operation and we
 * don't need to rely on cudaMemcpy2D. The last preprocessing
 * operation will write the separate channels directly to the input
 * layer. */
void Detector::WrapInputLayer(std::vector<cv::Mat>* input_channels) {
  Blob<float>* input_layer = net_->input_blobs()[0];

  int width = input_layer->width();
  int height = input_layer->height();
  float* input_data = input_layer->mutable_cpu_data();
  for (int i = 0; i < input_layer->channels(); ++i) {
    cv::Mat channel(height, width, CV_32FC1, input_data);
    input_channels->push_back(channel);
    input_data += width * height;
  }
}

void Detector::Preprocess(const cv::Mat& img,
                            std::vector<cv::Mat>* input_channels) {
  /* Convert the input image to the input image format of the network. */
  cv::Mat sample;
  if (img.channels() == 3 && num_channels_ == 1)
    cv::cvtColor(img, sample, cv::COLOR_BGR2GRAY);
  else if (img.channels() == 4 && num_channels_ == 1)
    cv::cvtColor(img, sample, cv::COLOR_BGRA2GRAY);
  else if (img.channels() == 4 && num_channels_ == 3)
    cv::cvtColor(img, sample, cv::COLOR_BGRA2BGR);
  else if (img.channels() == 1 && num_channels_ == 3)
    cv::cvtColor(img, sample, cv::COLOR_GRAY2BGR);
  else
    sample = img;

  cv::Mat sample_resized;
  if (sample.size() != input_geometry_)
    cv::resize(sample, sample_resized, input_geometry_);
  else
    sample_resized = sample;

  cv::Mat sample_float;
  if (num_channels_ == 3)
    sample_resized.convertTo(sample_float, CV_32FC3);
  else
    sample_resized.convertTo(sample_float, CV_32FC1);

  cv::Mat sample_normalized;
  cv::subtract(sample_float, mean_, sample_normalized);

  /* This operation will write the separate BGR planes directly to the
   * input layer of the network because it is wrapped by the cv::Mat
   * objects in input_channels. */
  cv::split(sample_normalized, *input_channels);

  CHECK(reinterpret_cast<float*>(input_channels->at(0).data)
        == net_->input_blobs()[0]->cpu_data())
    << "Input channels are not wrapping the input layer of the network.";
}

DEFINE_string(mean_file, "",
    "The mean file used to subtract from the input image.");
DEFINE_string(mean_value, "104,117,123",
    "If specified, can be one value or can be same as image channels"
    " - would subtract from the corresponding channel). Separated by ','."
    "Either mean_file or mean_value should be provided, not both.");
DEFINE_string(file_type, "image",
    "The file type in the list_file. Currently support image and video.");
DEFINE_string(out_file, "",
    "If provided, store the detringtection results in the out_file.");
DEFINE_double(confidence_threshold, 0.01,
    "Only store detections with score higher than the threshold.");

int main() {
  //::google::InitGoogleLogging(argv[0]);
  // Print output to stderr (while still logging)
  FLAGS_alsologtostderr = 1;

#ifndef GFLAGS_GFLAGS_H_
  namespace gflags = google;
#endif

  gflags::SetUsageMessage("Do detection using SSD mode.\n"
        "Usage:\n"
        "    ssd_detect [FLAGS] model_file weights_file list_file\n");
  //gflags::ParseCommandLineFlags(&argc, &argv, true);

  //  if (argc < 4) {
  //    gflags::ShowUsageWithFlagsRestrict(argv[0], "examples/ssd/ssd_detect");
  //    return 1;
  //  }

   LOG(INFO)<<"SSD Time Test 1!";
   int i = 0;

  while(i < 100)
  {
    const string& model_file = "/home/nvidia/tom/ssd-test/deploy.prototxt";
    const string& weights_file = "/home/nvidia/tom/ssd-test/VGG_CSTO_SSD_300x300_iter_166000.caffemodel";
    const string& mean_file = FLAGS_mean_file;
    const string& mean_value = FLAGS_mean_value;
    const string& file_type = FLAGS_file_type;
    const string& out_file = FLAGS_out_file;
    const float confidence_threshold = FLAGS_confidence_threshold;


    // Initialize the network.
    Detector detector(model_file, weights_file, mean_file, mean_value);
    LOG(INFO)<<"SSD Time Test 2!";
    // LOG(INFO)<<"SSD Test!";
    printf("good!");

    // Set the output mode.
    std::streambuf* buf = std::cout.rdbuf();
    std::ofstream outfile;
    if (!out_file.empty()) {
      outfile.open(out_file.c_str());
      if (outfile.good()) {
        buf = outfile.rdbuf();
      }
    }
    std::ostream out(buf);

    // Process image one by one.
    // std::ifstream infile(argv[3]);
    std::string file;
    {
      {
        LOG(INFO)<<"SSD Time Test 3!";
        char * img_path= "/home/nvidia/tom/ssd-test/12.jpg";


        LOG(INFO)<<img_path;

        cv::Mat img = cv::imread(img_path);
        LOG(INFO)<<"SSD Time Test 4!";
        CHECK(!img.empty()) << "Unable to decode image " << file;
        std::vector<vector<float> > detections = detector.Detect(img);
        LOG(INFO)<<"SSD Time Test 5!";

        /* Print the detection results. */
        for (int i = 0; i < detections.size(); ++i)
        {
          const vector<float>& d = detections[i];
          // Detection format: [image_id, label, score, xmin, ymin, xmax, ymax].
          CHECK_EQ(d.size(), 7);
          const float score = d[2];
          if (score >= confidence_threshold)
          {
            out << file << " ";
            out << static_cast<int>(d[1]) << " ";
            out << score << " ";
            out << static_cast<int>(d[3] * img.cols) << " ";
            out << static_cast<int>(d[4] * img.rows) << " ";
            out << static_cast<int>(d[5] * img.cols) << " ";
            out << static_cast<int>(d[6] * img.rows) << std::endl;
          }
        }
        LOG(INFO)<<"SSD Time Test 6!";
      }
    }
   out << std::endl;
   i++;
  }
  LOG(INFO)<<"end!";
  return 0;
}
#else
int main() {
  ;//LOG(FATAL) << "This example requires OpenCV; compile with USE_OPENCV.";
}
#endif  // USE_OPENCV

And the results of RNN sample

nvidia@tegra-ubuntu:~$ sudo ./jetson_clocks.sh
[sudo] password for nvidia: 
nvidia@tegra-ubuntu:~$ sudo ./jetson_clocks.sh --show
SOC family:tegra186  Machine:quill
Online CPUs: 0-5
CPU Cluster Switching: Disabled
cpu0: Gonvernor=schedutil MinFreq=2035200 MaxFreq=2035200 CurrentFreq=2035200
cpu1: Gonvernor=schedutil MinFreq=2035200 MaxFreq=2035200 CurrentFreq=2035200
cpu2: Gonvernor=schedutil MinFreq=2035200 MaxFreq=2035200 CurrentFreq=2035200
cpu3: Gonvernor=schedutil MinFreq=2035200 MaxFreq=2035200 CurrentFreq=2035200
cpu4: Gonvernor=schedutil MinFreq=2035200 MaxFreq=2035200 CurrentFreq=2035200
cpu5: Gonvernor=schedutil MinFreq=2035200 MaxFreq=2035200 CurrentFreq=2035200
GPU MinFreq=1300500000 MaxFreq=1300500000 CurrentFreq=1300500000
EMC MinFreq=40800000 MaxFreq=1866000000 CurrentFreq=1866000000 FreqOverride=1
Fan: speed=255
nvidia@tegra-ubuntu:~$ cd /usr/src/cudnn_samples_v5/RNN 
nvidia@tegra-ubuntu:/usr/src/cudnn_samples_v5/RNN$ ./RNN 100 100 100 64 2
Forward:  80 GFLOPS
Backward: 179 GFLOPS, (131 GFLOPS), (279 GFLOPS)
Segmentation fault (core dumped)
nvidia@tegra-ubuntu:/usr/src/cudnn_samples_v5/RNN$

Hi,

Is there any progress?

Hi,

Not sure why your command is spammed by the system automatically.
Just got your feedback. Will check soon.

Thanks.

Hi,

Could you try this sample? We want to confirm HW or SW issue first.
Please share the ./tegrastats results.
Thanks.

#include <stdio.h>

__global__
void saxpy(int n, float a, float *x, float *y)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n) {
    while(true)  y[i] = a*x[i] + y[i];
  }
}

int main(void)
{
  int N = 1<<20;
  float *x, *y, *d_x, *d_y;
  x = (float*)malloc(N*sizeof(float));
  y = (float*)malloc(N*sizeof(float));

  cudaMalloc(&d_x, N*sizeof(float)); 
  cudaMalloc(&d_y, N*sizeof(float));

  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
  cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);

  // Perform SAXPY on 1M elements
  saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y);

  cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);

  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = max(maxError, abs(y[i]-4.0f));
  printf("Max error: %f\n", maxError);

  cudaFree(d_x);
  cudaFree(d_y);
  free(x);
  free(y);
}
$ nvcc topic_1019020.cu -o test
$ ./test

Hi,

The sample

nvidia@tegra-ubuntu:~/tom$ nvcc topic_1019020.cu  -o test
nvcc warning : The 'compute_20', 'sm_20', and 'sm_21' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
nvidia@tegra-ubuntu:~/tom$ ./test

And the ./tegrastats results

nvidia@tegra-ubuntu:~$ sudo ./tegrastats
RAM 1460/7854MB (lfb 1352x4MB) cpu [0%@1269,0%@1115,0%@1111,0%@1079,0%@1103,0%@1083] EMC 6%@665 APE 150 VDE 1203 GR3D 0%@140
RAM 1460/7854MB (lfb 1352x4MB) cpu [5%@345,0%@362,4%@356,16%@347,4%@347,13%@352] EMC 7%@665 APE 150 VDE 1203 GR3D 0%@140
RAM 1461/7854MB (lfb 1352x4MB) cpu [24%@344,0%@355,6%@355,9%@344,7%@345,8%@345] EMC 7%@665 APE 150 VDE 1203 GR3D 0%@140
RAM 1461/7854MB (lfb 1352x4MB) cpu [13%@345,8%@360,6%@355,11%@345,16%@345,22%@345] EMC 9%@665 APE 150 VDE 1203 GR3D 0%@140
RAM 1462/7854MB (lfb 1352x4MB) cpu [8%@346,0%@361,7%@354,11%@345,4%@346,18%@345] EMC 8%@665 APE 150 VDE 1203 GR3D 0%@140
RAM 1491/7854MB (lfb 1352x4MB) cpu [15%@1266,64%@2052,13%@2047,13%@1269,21%@1269,5%@1269] EMC 3%@1866 APE 150 VDE 1203 GR3D 99%@1032
RAM 1476/7854MB (lfb 1352x4MB) cpu [5%@349,100%@2061,3%@2058,6%@347,9%@347,6%@346] EMC 3%@1866 APE 150 VDE 1203 GR3D 99%@1300
RAM 1476/7854MB (lfb 1352x4MB) cpu [6%@349,100%@2042,3%@2038,4%@345,16%@347,10%@347] EMC 3%@1866 APE 150 VDE 1203 GR3D 99%@1300
RAM 1491/7854MB (lfb 1352x4MB) cpu [7%@1420,100%@2002,2%@1997,10%@1422,7%@1424,16%@1422] EMC 3%@1866 APE 150 VDE 1203 GR3D 99%@1300
RAM 1477/7854MB (lfb 1352x4MB) cpu [10%@351,100%@2059,3%@2058,1%@347,4%@348,11%@347] EMC 3%@1866 APE 150 VDE 1203 GR3D 99%@1300
RAM 1476/7854MB (lfb 1352x4MB) cpu [13%@348,100%@2059,4%@2057,1%@347,14%@347,4%@348] EMC 3%@1866 APE 150 VDE 1203 GR3D 99%@1300
RAM 1491/7854MB (lfb 1352x4MB) cpu [16%@807,100%@2035,3%@2034,4%@809,11%@808,10%@808] EMC 3%@1866 APE 150 VDE 1203 GR3D 99%@1300
RAM 1477/7854MB (lfb 1352x4MB) cpu [22%@349,100%@2057,4%@2059,0%@347,6%@346,3%@347] EMC 3%@1866 APE 150 VDE 1203 GR3D 99%@1300
RAM 1477/7854MB (lfb 1352x4MB) cpu [6%@347,100%@2041,12%@2041,15%@348,14%@347,4%@347] EMC 3%@1866 APE 150 VDE 1203 GR3D 99%@1300
RAM 1493/7854MB (lfb 1351x4MB) cpu [3%@1574,100%@2030,3%@2031,16%@1575,17%@1575,11%@1576] EMC 3%@1866 APE 150 VDE 1203 GR3D 99%@1300
RAM 1477/7854MB (lfb 1351x4MB) cpu [3%@345,100%@2040,3%@2040,8%@348,5%@345,9%@347] EMC 3%@1866 APE 150 VDE 1203 GR3D 99%@1300
RAM 1477/7854MB (lfb 1351x4MB) cpu [3%@350,100%@2041,2%@2039,2%@349,5%@347,14%@348] EMC 3%@1866 APE 150 VDE 1203 GR3D 99%@1300
RAM 1477/7854MB (lfb 1351x4MB) cpu [14%@349,100%@2061,4%@2058,4%@345,2%@347,1%@348] EMC 2%@1866 APE 150 VDE 1203 GR3D 99%@1300
RAM 1476/7854MB (lfb 1351x4MB) cpu [3%@350,100%@2086,2%@2040,14%@345,2%@347,4%@348] EMC 2%@1866 APE 150 VDE 1203 GR3D 99%@1300
RAM 1476/7854MB (lfb 1351x4MB) cpu [4%@349,100%@2060,3%@2057,14%@348,2%@347,4%@347] EMC 2%@1866 APE 150 VDE 1203 GR3D 99%@1300
RAM 1477/7854MB (lfb 1351x4MB) cpu [14%@351,100%@2060,5%@2059,2%@347,2%@347,3%@347] EMC 2%@1866 APE 150 VDE 1203 GR3D 99%@1300
RAM 1476/7854MB (lfb 1351x4MB) cpu [3%@345,100%@2058,3%@2056,2%@348,16%@347,2%@348] EMC 2%@1866 APE 150 VDE 1203 GR3D 99%@1300
RAM 1476/7854MB (lfb 1352x4MB) cpu [10%@345,100%@2056,5%@2058,9%@347,12%@347,30%@347] EMC 3%@1866 APE 150 VDE 1203 GR3D 99%@1300

And jetson_clocks.sh --show

nvidia@tegra-ubuntu:~$ sudo ~/jetson_clocks.sh --show
SOC family:tegra186  Machine:quill
Online CPUs: 0-5
CPU Cluster Switching: Disabled
cpu0: Gonvernor=schedutil MinFreq=345600 MaxFreq=2035200 CurrentFreq=2035200
cpu1: Gonvernor=schedutil MinFreq=345600 MaxFreq=2035200 CurrentFreq=2035200
cpu2: Gonvernor=schedutil MinFreq=345600 MaxFreq=2035200 CurrentFreq=2035200
cpu3: Gonvernor=schedutil MinFreq=345600 MaxFreq=2035200 CurrentFreq=2035200
cpu4: Gonvernor=schedutil MinFreq=345600 MaxFreq=2035200 CurrentFreq=2035200
cpu5: Gonvernor=schedutil MinFreq=345600 MaxFreq=2035200 CurrentFreq=2035200
GPU MinFreq=140250000 MaxFreq=1300500000 CurrentFreq=1300500000
EMC MinFreq=40800000 MaxFreq=1866000000 CurrentFreq=1866000000 FreqOverride=0
Fan: speed=0

Thanks.

Will check with the corresponding team and feedback to you soon.

Hi,

Looks like your GPU works correctly. (99% GPU)
Could you check SSD source code again.
For example, GPU flag, memory, …

Hi, I have checked the Makefile.config .
The source is provided by the author, and I have only made simple changes.
Besides, the same code can work normally on another TX2.
This is my Makefile.config

## Refer to http://caffe.berkeleyvision.org/installation.html
# Contributions simplifying and improving our build system are welcome!

# cuDNN acceleration switch (uncomment to build with cuDNN).
USE_CUDNN := 1

# CPU-only switch (uncomment to build without GPU support).
# CPU_ONLY := 1

# uncomment to disable IO dependencies and corresponding data layers
# USE_OPENCV := 0
# USE_LEVELDB := 0
# USE_LMDB := 0

# uncomment to allow MDB_NOLOCK when reading LMDB files (only if necessary)
#	You should not set this flag if you will be reading LMDBs with any
#	possibility of simultaneous read and write
# ALLOW_LMDB_NOLOCK := 1

# Uncomment if you're using OpenCV 3
# OPENCV_VERSION := 3

# To customize your choice of compiler, uncomment and set the following.
# N.B. the default for Linux is g++ and the default for OSX is clang++
# CUSTOM_CXX := g++

# CUDA directory contains bin/ and lib/ directories that we need.
CUDA_DIR := /usr/local/cuda
# On Ubuntu 14.04, if cuda tools are installed via
# "sudo apt-get install nvidia-cuda-toolkit" then use this instead:
# CUDA_DIR := /usr

# CUDA architecture setting: going with all of them.
# For CUDA < 6.0, comment the lines after *_35 for compatibility.
CUDA_ARCH :=  -gencode arch=compute_62,code=sm_62

# BLAS choice:
# atlas for ATLAS (default)
# mkl for MKL
# open for OpenBlas
 BLAS := atlas
#BLAS := open
# Custom (MKL/ATLAS/OpenBLAS) include and lib directories.
# Leave commented to accept the defaults for your choice of BLAS
# (which should work)!
# BLAS_INCLUDE := /path/to/your/blas
# BLAS_LIB := /path/to/your/blas

# Homebrew puts openblas in a directory that is not on the standard search path
# BLAS_INCLUDE := $(shell brew --prefix openblas)/include
# BLAS_LIB := $(shell brew --prefix openblas)/lib

# This is required only if you will compile the matlab interface.
# MATLAB directory should contain the mex binary in /bin.
# MATLAB_DIR := /usr/local
# MATLAB_DIR := /Applications/MATLAB_R2012b.app

# NOTE: this is required only if you will compile the python interface.
# We need to be able to find Python.h and numpy/arrayobject.h.
PYTHON_INCLUDE := /usr/include/python2.7 \
		/usr/lib/python2.7/dist-packages/numpy/core/include
# Anaconda Python distribution is quite popular. Include path:
# Verify anaconda location, sometimes it's in root.
# ANACONDA_HOME := $(HOME)/anaconda2
# PYTHON_INCLUDE := $(ANACONDA_HOME)/include \
		$(ANACONDA_HOME)/include/python2.7 \
		$(ANACONDA_HOME)/lib/python2.7/site-packages/numpy/core/include \

# Uncomment to use Python 3 (default is Python 2)
# PYTHON_LIBRARIES := boost_python3 python3.5m
# PYTHON_INCLUDE := /usr/include/python3.5m \
#                 /usr/lib/python3.5/dist-packages/numpy/core/include

# We need to be able to find libpythonX.X.so or .dylib.
PYTHON_LIB := /usr/lib
# PYTHON_LIB := $(ANACONDA_HOME)/lib

# Homebrew installs numpy in a non standard path (keg only)
# PYTHON_INCLUDE += $(dir $(shell python -c 'import numpy.core; print(numpy.core.__file__)'))/include
# PYTHON_LIB += $(shell brew --prefix numpy)/lib

# Uncomment to support layers written in Python (will link against Python libs)
# WITH_PYTHON_LAYER := 1

# Whatever else you find you need goes here.
INCLUDE_DIRS := $(PYTHON_INCLUDE) /usr/local/include /usr/include/hdf5/serial
LIBRARY_DIRS := $(PYTHON_LIB) /usr/local/lib /usr/lib /usr/lib/aarch64-linux-gnu/hdf5/serial

# If Homebrew is installed at a non standard location (for example your home directory) and you use it for general dependencies
# INCLUDE_DIRS += $(shell brew --prefix)/include
# LIBRARY_DIRS += $(shell brew --prefix)/lib

# Uncomment to use `pkg-config` to specify OpenCV library paths.
# (Usually not necessary -- OpenCV libraries are normally installed in one of the above $LIBRARY_DIRS.)
# USE_PKG_CONFIG := 1

# N.B. both build and distribute dirs are cleared on `make clean`
BUILD_DIR := build
DISTRIBUTE_DIR := distribute

# Uncomment for debugging. Does not work on OSX due to https://github.com/BVLC/caffe/issues/171
# DEBUG := 1

# The ID of the GPU that 'make runtest' will use to run unit tests.
TEST_GPUID := 0

# enable pretty build (comment to see full commands)
Q ?= @

Hi,

Thanks for the feedback.
Could you remove the SSD binary and clean build it again.

We also found this issue strange.
GPU can reach 99% with CUDA code but shows 0% with SSD.
Not sure if there is something wrong in SSD.

Do you find anything abnormal in this platform?

Hi AastaLLL,

I have made clean and recompiled SSD before I asked for your help, but it didn’t work.

So, any other suggestions?