segfault with TensorRT Uff SSD int8

jrjbertram · March 25, 2019, 10:43pm

Hi,

I’m running into a segfault while trying to run the TensorRT sample_uff_ssd app with the --int8 flag on a Jetson AGX Xavier board.

I’ve successfully run simpler examples such as the Uff MNIST example… this is the first sample I’m trying to run with int8 which requires calibration. Without the --int8 flag, it ran fine in FP32 mode and was able to identify the objects in the sample PPM images. (As part of getting the FP32 mode to work, I downloaded the model, ran the script to convert the frozen graph to Uff, identified which file was the working ssd.prototxt file – non-obvious by the way, etc.)

For the calibration images, I downloaded the COCO 2017 val zip file, unzipped the images into a temporary directory. I then converted from jpg to PPM via a ‘mogrify --format ppm *.jpg’, and moved all resulting ppm files to /workspace/tensorrt/data/ssd. I then created a list.txt file which contained the names of all the PPM files, with the ‘.ppm’ extension removed, with each file on a separate line.

I very recently loaded the board (about 1 week ago) using a fresh install of jetpack. Unfortunately I’m not 100% sure how to report the version that I’m running on the AGX board itself, so if there’s any other helpful info I can collect let me know.

nvidia@jetson-0423418010368:~/tensorrt/bin$ ./sample_uff_ssd --int8
../data/ssd/sample_ssd_relu6.uff
Begin parsing model...
End parsing model...
Begin building engine...


Batch #0
Calibrating with file 000000000139.ppm
Calibrating with file 000000000285.ppm
Calibrating with file 000000000632.ppm
Calibrating with file 000000000724.ppm
Calibrating with file 000000000776.ppm
Calibrating with file 000000000785.ppm
Calibrating with file 000000000802.ppm
Calibrating with file 000000000872.ppm
Calibrating with file 000000000885.ppm
Calibrating with file 000000001000.ppm
Calibrating with file 000000001268.ppm
Calibrating with file 000000001296.ppm
Calibrating with file 000000001353.ppm
Calibrating with file 000000001425.ppm
Calibrating with file 000000001490.ppm
Calibrating with file 000000001503.ppm
Calibrating with file 000000001532.ppm
Calibrating with file 000000001584.ppm
Calibrating with file 000000001675.ppm
Calibrating with file 000000001761.ppm
Calibrating with file 000000001818.ppm
Calibrating with file 000000001993.ppm
Calibrating with file 000000002006.ppm
Calibrating with file 000000002149.ppm
Calibrating with file 000000002153.ppm
Calibrating with file 000000002157.ppm
Calibrating with file 000000002261.ppm
Calibrating with file 000000002299.ppm
Calibrating with file 000000002431.ppm
Calibrating with file 000000002473.ppm
Calibrating with file 000000002532.ppm
Calibrating with file 000000002587.ppm
Calibrating with file 000000002592.ppm
Calibrating with file 000000002685.ppm
Calibrating with file 000000002923.ppm
Calibrating with file 000000003156.ppm
Calibrating with file 000000003255.ppm
Calibrating with file 000000003501.ppm
Calibrating with file 000000003553.ppm
Calibrating with file 000000003661.ppm
Calibrating with file 000000003845.ppm
Calibrating with file 000000003934.ppm
Calibrating with file 000000004134.ppm
Calibrating with file 000000004395.ppm
Calibrating with file 000000004495.ppm
Calibrating with file 000000004765.ppm
Calibrating with file 000000004795.ppm
Calibrating with file 000000005001.ppm
Calibrating with file 000000005037.ppm
Calibrating with file 000000005060.ppm
Segmentation fault (core dumped)

Rerunning the debug version with gdb, I see the following stack trace:

Calibrating with file 000000005060.ppm

Thread 1 "sample_uff_ssd_" received signal SIGSEGV, Segmentation fault.
__memcpy_generic () at ../sysdeps/aarch64/multiarch/../memcpy.S:108
108	../sysdeps/aarch64/multiarch/../memcpy.S: No such file or directory.
(gdb) bt
#0  __memcpy_generic () at ../sysdeps/aarch64/multiarch/../memcpy.S:108
#1  0x0000007fab726ae4 in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_assign(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /usr/lib/aarch64-linux-gnu/libstdc++.so.6
#2  0x0000007fab726e3c in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::operator=(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /usr/lib/aarch64-linux-gnu/libstdc++.so.6
#3  0x00000055555627ec in samplesCommon::readPPMFile<3, 300, 300> (filename="../data/ssd/000000000285.ppm", ppm=...) at ../common/common.h:447
#4  0x000000555555ec8c in BatchStream::update (this=0x7fffffe198) at BatchStreamPPM.h:110
#5  0x000000555555e6f4 in BatchStream::next (this=0x7fffffe198) at BatchStreamPPM.h:51
#6  0x000000555555f478 in Int8EntropyCalibrator::getBatch (this=0x7fffffe190, bindings=0x55b1b26300, names=0x55b1d76a40, nbBindings=1)
    at BatchStreamPPM.h:170
#7  0x0000007fb0974890 in nvinfer1::builder::calibrateEngine(nvinfer1::IInt8Calibrator&, nvinfer1::ICudaEngine&, std::unordered_map<std::string, float, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, float> > >&, bool) ()
   from /usr/lib/aarch64-linux-gnu/libnvinfer.so.5
#8  0x0000007fb0946250 in nvinfer1::builder::buildEngine(nvinfer1::CudaEngineBuildConfig&, nvinfer1::rt::HardwareContext const&, nvinfer1::Network const&) () from /usr/lib/aarch64-linux-gnu/libnvinfer.so.5
#9  0x0000007fb09b02ec in nvinfer1::builder::Builder::buildCudaEngine(nvinfer1::INetworkDefinition&) ()
   from /usr/lib/aarch64-linux-gnu/libnvinfer.so.5
#10 0x000000555555ac7c in loadModelAndCreateEngine (uffFile=0x55558422c0 "../data/ssd/sample_ssd_relu6.uff", maxBatchSize=2, parser=0x5555824730, 
    calibrator=0x7fffffe190, trtModelStream=@0x7fffffdf50: 0x0) at sampleUffSSD.cpp:162
#11 0x000000555555b5dc in main (argc=2, argv=0x7fffffef48) at sampleUffSSD.cpp:539
(gdb)

Let me know if there’s anything else I can provide that might help. I saw a similar topic on the forum from someone who saw this within a docker container, but searching the forum I didn’t see a similar issue. Apologies if I missed it!

Thanks,

Josh.

jrjbertram · March 27, 2019, 2:06pm

Link to related topic:
https://devtalk.nvidia.com/default/topic/1039120/tensorrt/sampleuffssd-int8-calibration-segmentation-fault/?offset=3#5324295

Seems pretty clear from the stack trace that it’s an issue with the PPM parser.

jrjbertram · March 29, 2019, 4:27pm

Played around with it a little more today by removing files from the list.txt that were causing the failure. Seems to happen on the second file regardless of which file it is.

jrjbertram · March 29, 2019, 6:33pm

I put in some debug statements and figured out what’s going wrong. The image magick mogrify command I was using retained the original image size, but the code appears to only support a fixed 300 x 300 image size. Testing out whether doing a -resize 300x300 will work.

A question for NVIDIA engineers – when you are calibrating, how do you normally resize images? I’m concerned that resizing without regard to aspect ratio may introduce issues due to artifacts or warping of the image due to the scaling. I’ve seen in other frameworks that if an image is not the correct size, they will letterbox it by resizing it maintaining aspect ratio and will fill the remaining portion of the image with some known value (e.g., black).

The main issue here related to the segfault is that the buffer is allocated per the template parameters C, H, and W (3, 300, 300) creating a fixed size buffer.

template <int C, int H, int W>
struct PPM
{
    std::string magic, fileName;
    int h, w, max;
    uint8_t buffer[C * H * W];
};

In the readPPMFile function, the read call will read the number of bytes from the PPM header’s w and h parameters (width and height), which means that if the PPM file is larger than 300x300 it will overwrite the buffer potentially leading to a segfault.

template <int C, int H, int W>
inline void readPPMFile(const std::string& filename, samplesCommon::PPM<C, H, W>& ppm)
{
    ppm.fileName = filename;
    std::ifstream infile(filename, std::ifstream::binary);
    assert(infile.is_open() && "Attempting to read from a file that is not open.");
    infile >> ppm.magic >> ppm.w >> ppm.h >> ppm.max;
    infile.seekg(1, infile.cur);
    printf( "PPM C=%d, H=%d, W=%d, buffer size: %d, c=%d, h=%d, w=%d, copy size: %d\n", C, H, W, sizeof(ppm.buffer), 3, ppm.w, ppm.h, ppm.w * ppm.h * 3 );
    std::cout << "flush..." << std::endl;
    assert( sizeof(ppm.buffer) >= ppm.w * ppm.h * 3 && "Is buffer large enough?");
    infile.read(reinterpret_cast<char*>(ppm.buffer), ppm.w * ppm.h * 3);
}

Is it intended that the framework can only handle 300x300 sized images (due for example to neural net input size limitations used during training when the neural net architecture was defined), or is there some embedded resizing code somewhere which should allow arbitrary sized PPM files.

Either way, could the instructions be clarified in how to generate a set of images that will reproduce your results without any ambiguity? (i.e., download the COCO 2017 validation files, run this set of commands to properly convert them to a known-good format, this other set of commands to generate the list.txt, etc.)

I’ve found that the instructions seem to leave out a lot of details that make it difficult to sort out whether it’s something “we” are doing or some bug in the software itself. Reproduction instructions with no ambiguity would I think reduce wasted time for people trying out the framework.

J.

jrjbertram · March 29, 2019, 6:41pm

And, while I’m at it, I just noticed that one of the PPM files is showing up as having a width of 0 and 0 (unknown why, presumably something went wrong for that file during the conversion), so we might want to put in some basic dimension checking assertions.

And a bigger problem that I see is it would be good to understand and define what happens if the image is not 300x300… as you can see in the output below, it’s only copying part of the buffer with the rest of the buffer left uninitialized when one of the dimensions are less than 300. So, I suspect this means you’re feeding the calibration process essentially random data in the uninitialized parts of the buffer… I feel like this would lead to undesired variance in the trained neural net?

This also highlights the fact that I imagemagick actually does maintain the aspect ratio when you request a resize, rather than stretching the image to fit my specified dimensions.

PPM C=3, H=300, W=300, buffer size: 270000, c=3, h=300, w=225, copy size: 202500
flush...
PPM C=3, H=300, W=300, buffer size: 270000, c=3, h=0, w=0, copy size: 0
flush...
PPM C=3, H=300, W=300, buffer size: 270000, c=3, h=300, w=199, copy size: 179100
flush...

jrjbertram · March 29, 2019, 7:38pm

Also… I’m noticing this code in update… it assumes that file names are 7 characters long and uses that to resume where the previous batch left off. Unfortunately, the COCO file names from the 2017 validation set are not 7 characters. It’s an easy “hack” to fix this for my one case, but may want to update this code to deal with variable sized file names. This seems like it wouldn’t be very hard if you could assume that they were loaded into RAM into a list / vector / whatever and you could then extract slices from that list… then you wouldn’t have to reopen the file and seek each time, either.

bool update()
    {
        std::vector<std::string> fNames;

        std::ifstream file(locateFile("list.txt"));
        if (file)
        {
            std::cout << "Batch #" << mFileCount << "\n";
            file.seekg(((mBatchCount * mBatchSize)) * 7);
        }

jrjbertram · April 1, 2019, 10:28pm

Well, just as an update I was able to get the sample int8 calibration to complete. It would be nice to get some answers to the questions I raise above from Nvidia though.

J.