PiecewiseLinear level 3 translation fault with nvcamera socket read error and daemon stopped functio...

Honey_Patouceul · September 8, 2017, 6:28pm

When using the sample I’ve posted for processing NVMM frames with opencv on cuda, I’ve been using nvcamerasrc https://devtalk.nvidia.com/default/topic/1022543/jetson-tx2/gstreamer-nvmm-lt-gt-opencv-gpumat/post/5208232/#5208232. I’m using standard R28.1.

It works fine by launching:

GST_DEBUG=nvivafilter:4 gst-launch-1.0 -ev nvcamerasrc ! 'video/x-raw(memory:NVMM), width=1280, height=720, format=NV12, framerate=120/1' ! nvivafilter customer-lib-name=./lib-gst-custom-opencv_cudaprocess.so cuda-process=true ! 'video/x-raw(memory:NVMM),format=RGBA' ! nvegltransform ! nveglglessink

but on first launch, and only first one after boot [EDIT: it may happen on other launch], I get this message:

WARNING: from element /GstPipeline:pipeline0/GstEglGlesSink:eglglessink0: A lot of buffers are being dropped.
Additional debug info:
gstbasesink.c(2854): gst_base_sink_is_too_late (): /GstPipeline:pipeline0/GstEglGlesSink:eglglessink0:
There may be a timestamping problem, or this computer is too slow.

Seems to work anyway, but when I type Ctrl-C in the shell for stopping, I get this from gst-launch-1.0:

^Chandling interrupt.
Interrupt: Stopping pipeline ...
EOS on shutdown enabled -- Forcing EOS on the pipeline
Waiting for EOS...
Got EOS from element "pipeline0".
EOS received - stopping pipeline...
Execution ended after 0:00:08.065689353
Setting pipeline to PAUSED ...
Setting pipeline to READY ...
<b>Socket read error. Camera Daemon stopped functioning.....</b>
Setting pipeline to NULL ...
Freeing pipeline ...

At the same time, I see this in dmesg:

[  268.302530] PiecewiseLinear[3609]: unhandled level 3 translation fault (11) at 0x7f78644000, esr 0x92000007
[  268.312364] pgd = ffffffc076caa000
[  268.315758] [7f78644000] *pgd=000000025594b003, *pud=000000025594b003, *pmd=000000025594c003, *pte=0000000000000000

[  268.327886] CPU: 3 PID: 3609 Comm: PiecewiseLinear Not tainted 4.4.38-tegra #1
[  268.335152] Hardware name: quill (DT)
[  268.338856] task: ffffffc1a5583e80 ti: ffffffc19a788000 task.ti: ffffffc19a788000
[  268.346422] PC is at 0x7f76c30130
[  268.346423] LR is at 0x7f76c29a3c
[  268.346425] pc : [<0000007f76c30130>] lr : [<0000007f76c29a3c>] pstate: 60000000
[  268.346426] sp : 0000007f54ffe4a0
[  268.346431] x29: 0000007f54ffe9d0 x28: 0000007f54fff1e0 
[  268.346433] x27: 0000007f54ffe630 x26: 0000000000000000 
[  268.346435] x25: 0000007f764238d0 x24: 0000007f7735f000 
[  268.346437] x23: 0000000000000001 x22: 0000007f70fe7ba8 
[  268.346439] x21: 0000007f70c4bde0 x20: 0000000000000000 
[  268.346441] x19: 0000000000000000 x18: 0000007f71c1ed58 
[  268.346442] x17: 0000007f77c94760 x16: 0000007f77297ce0 
[  268.346444] x15: 0000000000000028 x14: 0000000000000000 
[  268.346446] x13: 0000000000000000 x12: 0000000000000001 
[  268.346447] x11: 0000007f771d3fe0 x10: 0000007f54ffe680 
[  268.346449] x9 : 0000000000000001 x8 : 0000000000000000 
[  268.346451] x7 : 0000000000000020 x6 : 0000000000000000 
[  268.346452] x5 : 0000007f54fff8d0 x4 : 0000000000000000 
[  268.346454] x3 : 0000000000000000 x2 : 0000007f78644000 
[  268.346455] x1 : 0000007f54ffe4b0 x0 : 0000007f70c6d770 

[  268.346462] Library at 0x7f76c30130: 0x7f769be000 /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1.1
[  268.346463] Library at 0x7f76c29a3c: 0x7f769be000 /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1.1
[  268.346464] vdso base = 0x7f78693000

It works anyway, next times it has no longer this error.
Boosting the jeston makes no difference, as far as I can see.
I’ve copied my custom lib into eMMC, seems not related to external disk access time.

Any idea what going wrong ?

Honey_Patouceul · September 8, 2017, 7:21pm

Using LD_PRELOAD=<path_to_my_custom_lib> it still says it’s too slow, but doesn’t fault anymore.

[EDIT: Well, this time it faulted on second launch ! Very reproducible, only a few trials should trigger it]

Honey_Patouceul · September 8, 2017, 8:29pm

Furthermore, I’m a bit surprized by the numbering:

/usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1.1

Might someone explain what libcuda 1.1 refers to ?

linuxdev · September 8, 2017, 8:59pm

This may be reaching a bit, but perhaps you can get a slight bit more information by running it under strace to see what the actual system call is at the moment of failure, e.g.:

strace -oTraceLog.txt <command>

Honey_Patouceul · September 8, 2017, 9:20pm

Attached is the strace log.
TraceLog.txt (323 KB)

Honey_Patouceul · September 8, 2017, 9:37pm

Seems not related to my Sobel filter, get the same with @DaneLLL’s original gaussianBlur filter.

Honey_Patouceul · September 8, 2017, 9:45pm

Just have to mention that opencv4tegra has moved to /usr/local/opencv4tegra-2.4.13 on my TX2, but I’m building accordingly and setting LD_LIBRARY_PATH as well.

linuxdev · September 8, 2017, 9:46pm

Not much use, some rt function is failing. If you run under gdb you could perhaps go to the offending frame and then disassemble (“disassemble /m”), and this might lead to some location in the library from “objdump -D /usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1.1”, but I think it is going to require source code to determine if the fault is in libcuda or if the fault is how libcuda was called. You might run “sha1sum -c /etc/nv_tegra_release” just to see if everything is in place.

Honey_Patouceul · September 8, 2017, 9:59pm

sha1sum says everything is OK.
I don’t think I have any debug version of nvcamerasrc for using with gdb…is there one in standard L4T ?
I’m still puzzled with libcuda 1.1 version numbering.

Just for reference, if someone from NVIDIA is interested into this, with opencv-3.3.0 I need more trials to get the fault, but attached is then strace log.
Trace-3.3.0.log.txt (1.61 MB)

linuxdev · September 8, 2017, 10:07pm

I don’t know about the numbering, which is why I was thinking of strace and reverse compile (the “disassemble /m” command in gdb). To disassemble you don’t need source code…and objdump may provide information if you can match the disassembly to the objdump. This could provide knowledge of what specific function/symbol in the library failed. Knowing this, someone from NVIDIA could figure out which arguments might cause this if it is failing because of data/call arguments rather than a bug in the library.

Honey_Patouceul · September 8, 2017, 10:21pm

To disassemble with some understanding, I would need a non stripped library…I’ll check what I can get.
But it looks to me that the problem comes from nvcamerasrc and its deamon in some gstreamer linkage with nvegl…

With opencv-3.3.0, when launching this a few times

GST_DEBUG=nvcamerasrc:3 gst-launch-1.0 -ev nvcamerasrc ! 'video/x-raw(memory:NVMM), width=1280, height=720, format=NV12, framerate=120/1' ! nvivafilter customer-lib-name=./lib-gst-custom-opencv_cudaprocess.so cuda-process=true ! 'video/x-raw(memory:NVMM),format=RGBA' ! nvegltransform ! nveglglessink

I get the error with this logs:

...
0:00:01.235446832  3180       0x5b7c50 INFO             nvcamerasrc gstnvcamerasrc.cpp:2181:send_request_to_camera_daemon:<nvcamerasrc0> Passing FD to server

0:00:01.235507120  3180       0x5b7c50 INFO             nvcamerasrc gstnvcamerasrc.cpp:2199:send_request_to_camera_daemon:<nvcamerasrc0> CameraDaemon_REQ METADATA REGISTER BUFFER

0:00:01.235544849  3180       0x5b7c50 INFO             nvcamerasrc gstnvcamerasrc.cpp:2206:send_request_to_camera_daemon:<nvcamerasrc0> Passing METADATA FD to server

0:00:01.235692433  3180       0x5b7e80 INFO             nvcamerasrc gstnvcamerasrc.cpp:2094:send_request_to_camera_daemon:<nvcamerasrc0> CameraDaemon_REQ ENABLE METADATA MODE = 0

^Chandling interrupt.
Interrupt: Stopping pipeline ...
EOS on shutdown enabled -- Forcing EOS on the pipeline
Waiting for EOS...
Got EOS from element "pipeline0".
EOS received - stopping pipeline...
Execution ended after 0:00:04.911179099
Setting pipeline to PAUSED ...
Setting pipeline to READY ...
0:00:06.071833623  3180       0x589490 INFO             nvcamerasrc gstnvcamerasrc.cpp:2232:send_request_to_camera_daemon:<nvcamerasrc0> CameraDaemon_REQ TERMINATE SESSION

Socket read error. Camera Daemon stopped functioning.....
0:00:06.716165231  3180       0x589490 ERROR            nvcamerasrc gstnvcamerasrc.cpp:1924:gst_nvcamera_close:<nvcamerasrc0> NvCameraSrc: terminate session request failed
Setting pipeline to NULL ...
Freeing pipeline ...

Honey_Patouceul · September 9, 2017, 10:15am

gdb didn’t help much more…The fault happens when exiting gdb.

Looking at syslog, here are the logs when it faults:

Sep  9 11:53:18 tegra-ubuntu nvcamera-daemon[1243]: nvcamera-daemon started new client thread = 547679015392
Sep  9 11:53:18 tegra-ubuntu nvcamera-daemon[1243]: (547679015392) USB Sensors :  0 CSI Sensors :  1
Sep  9 11:53:19 tegra-ubuntu nvcamera-daemon[1243]: (547679015392) getSource successful
Sep  9 11:53:19 tegra-ubuntu nvcamera-daemon[1243]: Available Sensor modes :
Sep  9 11:53:19 tegra-ubuntu nvcamera-daemon[1243]: W= 2592 H= 1944 FR= 30.000000
Sep  9 11:53:19 tegra-ubuntu nvcamera-daemon[1243]: W= 2592 H= 1458 FR= 30.000000
Sep  9 11:53:19 tegra-ubuntu nvcamera-daemon[1243]: W= 1280 H= 720 FR= 120.000000
Sep  9 11:53:19 tegra-ubuntu nvcamera-daemon[1243]: (547679015392) CreateSession Successful
Sep  9 11:53:19 tegra-ubuntu nvcamera-daemon[1243]: (547679015392) Shared Memory: mmap address= 0x7f86592000
Sep  9 11:53:19 tegra-ubuntu nvcamera-daemon[1243]: (547679015392) getSource successful
Sep  9 11:53:19 tegra-ubuntu nvcamera-daemon[1243]: (547679015392) Sensor Metadata: Available :  0 Sensor Metadata: W :  2592 Sensor Metadata: H :  0
Sep  9 11:53:19 tegra-ubuntu nvcamera-daemon[1243]: (547679015392) New Resolution W = 1280 H = 720
Sep  9 11:53:19 tegra-ubuntu nvcamera-daemon[1243]: RECEIVED REGISTER_BUFFER_BAYER fd = 99
Sep  9 11:53:19 tegra-ubuntu nvcamera-daemon[1243]: RECEIVED REGISTER_BUFFER_META fd = 100
Sep  9 11:53:19 tegra-ubuntu nvcamera-daemon[1243]: RECEIVED REGISTER_BUFFER_BAYER fd = 102
Sep  9 11:53:19 tegra-ubuntu nvcamera-daemon[1243]: RECEIVED REGISTER_BUFFER_META fd = 103
Sep  9 11:53:19 tegra-ubuntu nvcamera-daemon[1243]: RECEIVED REGISTER_BUFFER_BAYER fd = 105
Sep  9 11:53:19 tegra-ubuntu nvcamera-daemon[1243]: RECEIVED REGISTER_BUFFER_META fd = 106
Sep  9 11:53:19 tegra-ubuntu nvcamera-daemon[1243]: RECEIVED REGISTER_BUFFER_BAYER fd = 108
Sep  9 11:53:19 tegra-ubuntu nvcamera-daemon[1243]: RECEIVED REGISTER_BUFFER_META fd = 109
Sep  9 11:53:19 tegra-ubuntu nvcamera-daemon[1243]: RECEIVED REGISTER_BUFFER_BAYER fd = 111
Sep  9 11:53:19 tegra-ubuntu nvcamera-daemon[1243]: RECEIVED REGISTER_BUFFER_META fd = 112
Sep  9 11:53:19 tegra-ubuntu nvcamera-daemon[1243]: RECEIVED REGISTER_BUFFER_BAYER fd = 114
Sep  9 11:53:19 tegra-ubuntu nvcamera-daemon[1243]: RECEIVED REGISTER_BUFFER_META fd = 115
Sep  9 11:53:19 tegra-ubuntu nvcamera-daemon[1243]: RECEIVED REGISTER_BUFFER_BAYER fd = 117
Sep  9 11:53:19 tegra-ubuntu nvcamera-daemon[1243]: RECEIVED REGISTER_BUFFER_META fd = 118
Sep  9 11:53:19 tegra-ubuntu nvcamera-daemon[1243]: RECEIVED REGISTER_BUFFER_BAYER fd = 132
Sep  9 11:53:19 tegra-ubuntu nvcamera-daemon[1243]: RECEIVED REGISTER_BUFFER_META fd = 133
Sep  9 11:53:19 tegra-ubuntu nvcamera-daemon[1243]: RECEIVED REGISTER_BUFFER_BAYER fd = 135
Sep  9 11:53:19 tegra-ubuntu nvcamera-daemon[1243]: RECEIVED REGISTER_BUFFER_META fd = 136
Sep  9 11:53:19 tegra-ubuntu nvcamera-daemon[1243]: RECEIVED REGISTER_BUFFER_BAYER fd = 138
Sep  9 11:53:19 tegra-ubuntu nvcamera-daemon[1243]: RECEIVED REGISTER_BUFFER_META fd = 139
Sep  9 11:53:19 tegra-ubuntu nvcamera-daemon[1243]: (547679015392) REQ_NVCAM_ENABLE_METADATA Enabled=0
[b]< Here the kernel fault trace as in post #1 >
Sep  9 11:54:15 tegra-ubuntu nvcamera-daemon[1243]: (547679015392) Closing Client Connection Thread: nbytes -1
Sep  9 11:54:15 tegra-ubuntu systemd[1]: nvcamera-daemon.service: Main process exited, code=killed, status=11/SEGV
Sep  9 11:54:17 tegra-ubuntu systemd[1]: nvcamera-daemon.service: Unit entered failed state.
Sep  9 11:54:17 tegra-ubuntu systemd[1]: nvcamera-daemon.service: Failed with result 'signal'.
Sep  9 11:54:18 tegra-ubuntu systemd[1]: nvcamera-daemon.service: Service hold-off time over, scheduling restart.
Sep  9 11:54:18 tegra-ubuntu systemd[1]: Stopped nvcamera daemon.
[/b]Sep  9 11:54:18 tegra-ubuntu systemd[1]: Started nvcamera daemon.
Sep  9 11:54:18 tegra-ubuntu nvcamera-daemon[6596]: Started nvcamera daemon...
Sep  9 11:54:18 tegra-ubuntu nvcamera-daemon[6596]: nvcamera-daemon listening for clients to connect...

and when it doesn’t fault:

Sep  9 12:03:14 tegra-ubuntu nvcamera-daemon[6596]: nvcamera-daemon started new client thread = 547913777632
Sep  9 12:03:14 tegra-ubuntu nvcamera-daemon[6596]: (547913777632) USB Sensors :  0 CSI Sensors :  1
Sep  9 12:03:14 tegra-ubuntu nvcamera-daemon[6596]: (547913777632) getSource successful
Sep  9 12:03:14 tegra-ubuntu nvcamera-daemon[6596]: Available Sensor modes :
Sep  9 12:03:14 tegra-ubuntu nvcamera-daemon[6596]: W= 2592 H= 1944 FR= 30.000000
Sep  9 12:03:14 tegra-ubuntu nvcamera-daemon[6596]: W= 2592 H= 1458 FR= 30.000000
Sep  9 12:03:14 tegra-ubuntu nvcamera-daemon[6596]: W= 1280 H= 720 FR= 120.000000
Sep  9 12:03:14 tegra-ubuntu nvcamera-daemon[6596]: (547913777632) CreateSession Successful
Sep  9 12:03:14 tegra-ubuntu nvcamera-daemon[6596]: (547913777632) Shared Memory: mmap address= 0x7f94576000
Sep  9 12:03:14 tegra-ubuntu nvcamera-daemon[6596]: (547913777632) getSource successful
Sep  9 12:03:14 tegra-ubuntu nvcamera-daemon[6596]: (547913777632) Sensor Metadata: Available :  0 Sensor Metadata: W :  2592 Sensor Metadata: H :  0
Sep  9 12:03:14 tegra-ubuntu nvcamera-daemon[6596]: (547913777632) New Resolution W = 1280 H = 720
Sep  9 12:03:14 tegra-ubuntu nvcamera-daemon[6596]: RECEIVED REGISTER_BUFFER_BAYER fd = 99
Sep  9 12:03:14 tegra-ubuntu nvcamera-daemon[6596]: RECEIVED REGISTER_BUFFER_META fd = 100
Sep  9 12:03:14 tegra-ubuntu nvcamera-daemon[6596]: RECEIVED REGISTER_BUFFER_BAYER fd = 102
Sep  9 12:03:14 tegra-ubuntu nvcamera-daemon[6596]: RECEIVED REGISTER_BUFFER_META fd = 103
Sep  9 12:03:14 tegra-ubuntu nvcamera-daemon[6596]: RECEIVED REGISTER_BUFFER_BAYER fd = 105
Sep  9 12:03:14 tegra-ubuntu nvcamera-daemon[6596]: RECEIVED REGISTER_BUFFER_META fd = 106
Sep  9 12:03:14 tegra-ubuntu nvcamera-daemon[6596]: RECEIVED REGISTER_BUFFER_BAYER fd = 108
Sep  9 12:03:14 tegra-ubuntu nvcamera-daemon[6596]: RECEIVED REGISTER_BUFFER_META fd = 109
Sep  9 12:03:14 tegra-ubuntu nvcamera-daemon[6596]: RECEIVED REGISTER_BUFFER_BAYER fd = 111
Sep  9 12:03:14 tegra-ubuntu nvcamera-daemon[6596]: RECEIVED REGISTER_BUFFER_META fd = 112
Sep  9 12:03:14 tegra-ubuntu nvcamera-daemon[6596]: RECEIVED REGISTER_BUFFER_BAYER fd = 114
Sep  9 12:03:14 tegra-ubuntu nvcamera-daemon[6596]: RECEIVED REGISTER_BUFFER_META fd = 115
Sep  9 12:03:14 tegra-ubuntu nvcamera-daemon[6596]: RECEIVED REGISTER_BUFFER_BAYER fd = 117
Sep  9 12:03:14 tegra-ubuntu nvcamera-daemon[6596]: RECEIVED REGISTER_BUFFER_META fd = 118
Sep  9 12:03:14 tegra-ubuntu nvcamera-daemon[6596]: RECEIVED REGISTER_BUFFER_BAYER fd = 132
Sep  9 12:03:14 tegra-ubuntu nvcamera-daemon[6596]: RECEIVED REGISTER_BUFFER_META fd = 133
Sep  9 12:03:14 tegra-ubuntu nvcamera-daemon[6596]: RECEIVED REGISTER_BUFFER_BAYER fd = 135
Sep  9 12:03:14 tegra-ubuntu nvcamera-daemon[6596]: RECEIVED REGISTER_BUFFER_META fd = 136
Sep  9 12:03:14 tegra-ubuntu nvcamera-daemon[6596]: RECEIVED REGISTER_BUFFER_BAYER fd = 138
Sep  9 12:03:14 tegra-ubuntu nvcamera-daemon[6596]: RECEIVED REGISTER_BUFFER_META fd = 139
Sep  9 12:03:14 tegra-ubuntu nvcamera-daemon[6596]: (547913777632) REQ_NVCAM_ENABLE_METADATA Enabled=0
[b]Sep  9 12:03:17 tegra-ubuntu nvcamera-daemon[6596]: CLOSING REGISTER_BUFFER fd = 137
Sep  9 12:03:17 tegra-ubuntu nvcamera-daemon[6596]: CLOSING REGISTER_BUFFER_META fd = 139
Sep  9 12:03:17 tegra-ubuntu nvcamera-daemon[6596]: CLOSING REGISTER_BUFFER fd = 134
Sep  9 12:03:17 tegra-ubuntu nvcamera-daemon[6596]: CLOSING REGISTER_BUFFER_META fd = 136
Sep  9 12:03:17 tegra-ubuntu nvcamera-daemon[6596]: CLOSING REGISTER_BUFFER fd = 119
Sep  9 12:03:17 tegra-ubuntu nvcamera-daemon[6596]: CLOSING REGISTER_BUFFER_META fd = 133
Sep  9 12:03:17 tegra-ubuntu nvcamera-daemon[6596]: CLOSING REGISTER_BUFFER fd = 116
Sep  9 12:03:17 tegra-ubuntu nvcamera-daemon[6596]: CLOSING REGISTER_BUFFER_META fd = 118
Sep  9 12:03:17 tegra-ubuntu nvcamera-daemon[6596]: CLOSING REGISTER_BUFFER fd = 113
Sep  9 12:03:17 tegra-ubuntu nvcamera-daemon[6596]: CLOSING REGISTER_BUFFER_META fd = 115
Sep  9 12:03:17 tegra-ubuntu nvcamera-daemon[6596]: CLOSING REGISTER_BUFFER fd = 110
Sep  9 12:03:17 tegra-ubuntu nvcamera-daemon[6596]: CLOSING REGISTER_BUFFER_META fd = 112
Sep  9 12:03:17 tegra-ubuntu nvcamera-daemon[6596]: CLOSING REGISTER_BUFFER fd = 107
Sep  9 12:03:17 tegra-ubuntu nvcamera-daemon[6596]: CLOSING REGISTER_BUFFER_META fd = 109
Sep  9 12:03:17 tegra-ubuntu nvcamera-daemon[6596]: CLOSING REGISTER_BUFFER fd = 104
Sep  9 12:03:17 tegra-ubuntu nvcamera-daemon[6596]: CLOSING REGISTER_BUFFER_META fd = 106
Sep  9 12:03:17 tegra-ubuntu nvcamera-daemon[6596]: CLOSING REGISTER_BUFFER fd = 101
Sep  9 12:03:17 tegra-ubuntu nvcamera-daemon[6596]: CLOSING REGISTER_BUFFER_META fd = 103
Sep  9 12:03:17 tegra-ubuntu nvcamera-daemon[6596]: CLOSING REGISTER_BUFFER fd = 98
Sep  9 12:03:17 tegra-ubuntu nvcamera-daemon[6596]: CLOSING REGISTER_BUFFER_META fd = 100
Sep  9 12:03:17 tegra-ubuntu nvcamera-daemon[6596]: (547913777632) Closing Client Connection Thread: nbytes 820
Sep  9 12:03:17 tegra-ubuntu nvcamera-daemon[6596]: (547913777632) Capture Done: Thread Exiting....[/b]

I seems after REQ_NVCAM_ENABLE_METADATA Enabled=0, in some case it faults, does not close registered buffers and closing connection gives error (nbytes=-1).

Honey_Patouceul · September 9, 2017, 10:57am

This is not related to my custom lib nor to opencv.
This can be reproduced with the standard lib libnvsample_cudaprocess.so.

linuxdev · September 9, 2017, 3:23pm

Just observations to consolidate and add debug information and confirm I get an error from the same gst-launch-1.0 command…

Reproduce (currently logged in as user “nvidia” after fresh reboot):

# DISPLAY is ":0"
export GST_DEBUG=nvivafilter:4
gst-launch-1.0 -ev nvcamerasrc \
   ! 'video/x-raw(memory:NVMM), width=1280, height=720, format=NV12, framerate=120/1'\
   ! nvivafilter customer-lib-name=<b>libnvsample_cudaprocess.so</b> cuda-process=true\
   ! 'video/x-raw(memory:NVMM),format=RGBA'\
   ! nvegltransform ! nveglglessink

On startup all works as expected. Clicking the “close” button in the upper left corner of the display does not close the application. Application command line knows the window attempted to close, but does not honor the close request and continues to display. Text on command line:

ERROR: from element /GstPipeline:pipeline0/GstEglGlesSink:eglglessink0: Output window was closed
Additional debug info:
/dvs/git/dirty/git-master_linux/external/gstreamer/gst-nveglglessink/ext/eglgles/gsteglglessink.c(832): gst_eglglessink_event_thread (): /GstPipeline:pipeline0/GstEglGlesSink:eglglessink0
EOS on shutdown enabled -- waiting for EOS after Error
Waiting for EOS...

Then it waits indefinitely and never stops the camera. Clicking on the “close” button does nothing. Actual close only occurs after hitting control-c on the command line.

Looks like the code handling the detection of the close button click event is incorrect. Perhaps the signal generated by the click event on the close button was caught and not properly handled…possibly there was some code to clean up the application on close, and then the event needs to be resent, but was caught and never resent.

Note that I am naming “libnvsample_cudaprocess.so” since I don’t have the custom version “lib-gst-custom-opencv_cudaprocess.so”. With the original lib I do not get any kind of dmesg kernel error. Since I don’t have lib-gst-custom-opencv_cudaprocess.so I do get a kernel dmesg error when naming the file I don’t have. The close button event catch/rethrow seems to be the real issue.

In terms of the original kernel message I do not get the dmesg error, but there is a good possibility there are minor differences in some of our system files due to updates between the two systems not being an exact match. Your custom lib may also be more sensitive to the ignored close event…the default libnvsample_cudaprocess.so may be doing a better job at handling the broken close event than what lib-gst-custom-opencv_cudaprocess.so does when close is not handled correctly by the parent application.

Honey_Patouceul · September 9, 2017, 4:41pm

It is usual that closing the window has no effect, as a new window will be created for next frame.

With gstreamer, as camera source has no end, I use to ctrl-C from the shell where I’ve launched the gst-launch-1.0 to stop it .
This is where the fault happens with this case.

I’m also facing this with the standard libvnvsample_cudaprocess.so, as mentionned below.
This lib installed in /usr/lib/aarch64-linux-gnu/ is used as default by nvivafilter plugin if it doesn’t find the library passed as argument customer-lib-name.

For reproducing it:

Fresh boot and login (in my case as ubuntu, but I don’t think this makes a difference).
Launch a terminal with sudo dmesg --follow
Launch a terminal and launch gstreamer pipeline: gst-launch-1.0 -ev nvcamerasrc
! ‘video/x-raw(memory:NVMM), width=1280, height=720, format=NV12, framerate=120/1’
! nvivafilter customer-lib-name=libnvsample_cudaprocess.so cuda-process=true
! ‘video/x-raw(memory:NVMM),format=RGBA’
! nvegltransform ! nveglglessink
Let it run for a few seconds (till you see the video). Then Ctrl-C in the terminal where it was launched.
If no kernel trace, repeat steps 3-4. After a few trials you should see the fault.

Can you reproduce it ?

linuxdev · September 9, 2017, 7:04pm

I repeated the gst-launch-1.0 command many times and was never able to get any error on dmesg. It seems like there must be some other interaction which is somewhat different between our installations. This particular one is R28.1 installed via driver package plus sample rootfs on command line, and then updated with apt update and apt-get upgrade.

I do see SIGSEGV on the gst-launch-1.0 command line at exit, but no dmesg/OOPS. Killing with control-c should not produce a seg fault, and there is no definition of how to define behavior in erroneous code with memory errors, so I am not entirely surprised I can’t reproduce this in the exact way your system reacts.

I did find this interesting output from valgrind (running under valgrind will cause failures to change, but some of this is likely still valid):

ARM64 front end: load_store
disInstr(arm64): unhandled instruction 0x69410C45
disInstr(arm64): 0110'1001 0100'0001 0000'1100 0100'0101
==5310== valgrind: Unrecognised instruction at address 0x5d8be34.
==5310==    at 0x5D8BE34: XSetSizeHints (in /usr/lib/aarch64-linux-gnu/libX11.so.6.3.0)
==5310== Your program just tried to execute an instruction that Valgrind
==5310== did not recognise.  There are two possible reasons for this.
==5310== 1. Your program has a bug and erroneously jumped to a non-code
==5310==    location.  If you are running Memcheck and you just saw a
==5310==    warning about a bad jump, it's probably your program's fault.
==5310== 2. The instruction is legitimate but Valgrind doesn't handle it,
==5310==    i.e. it's Valgrind's fault.  If you think this is the case or
==5310==    you are not sure, please let us know and we'll try to fix it.
==5310== Either way, Valgrind will now raise a SIGILL signal which will
==5310== probably kill your program.
==5310== 
==5310== Process terminating with default action of signal 4 (SIGILL)
==5310==  Illegal opcode at address 0x5D8BE34
==5310==    at 0x5D8BE34: XSetSizeHints (in /usr/lib/aarch64-linux-gnu/libX11.so.6.3.0)
==5310== 
==5310== HEAP SUMMARY:
==5310==     in use at exit: 4,108,063 bytes in 12,075 blocks
==5310==   total heap usage: 22,555 allocs, 10,480 frees, 6,444,151 bytes allocated
==5310== 
==5310== 928 bytes in 4 blocks are possibly lost in loss record 6 of 14
==5310==    at 0x4846F14: realloc (in /usr/lib/valgrind/vgpreload_memcheck-arm64-linux.so)
==5310== 
==5310== 1,253 (681 direct, 572 indirect) bytes in 25 blocks are definitely lost in loss record 7 of 14
==5310==    at 0x4844B88: malloc (in /usr/lib/valgrind/vgpreload_memcheck-arm64-linux.so)
==5310== 
==5310== 11,060 bytes in 49 blocks are possibly lost in loss record 9 of 14
==5310==    at 0x4844B88: malloc (in /usr/lib/valgrind/vgpreload_memcheck-arm64-linux.so)
==5310== 
==5310== 286,107 bytes in 49 blocks are possibly lost in loss record 12 of 14
==5310==    at 0x4846CFC: calloc (in /usr/lib/valgrind/vgpreload_memcheck-arm64-linux.so)
==5310== 
==5310== LEAK SUMMARY:
==5310==    definitely lost: 681 bytes in 25 blocks
==5310==    indirectly lost: 572 bytes in 16 blocks
==5310==      possibly lost: 298,095 bytes in 102 blocks
==5310==    still reachable: 3,711,003 bytes in 11,690 blocks
==5310==                       of which reachable via heuristic:
==5310==                         length64           : 280 bytes in 7 blocks
==5310==                         newarray           : 3,696 bytes in 26 blocks
==5310==                         multipleinheritance: 2,296 bytes in 3 blocks
==5310==         suppressed: 0 bytes in 0 blocks
==5310== Reachable blocks (those to which a pointer was found) are not shown.
==5310== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==5310== 
==5310== For counts of detected and suppressed errors, rerun with: -v
==5310== ERROR SUMMARY: 4 errors from 4 contexts (suppressed: 0 from 0)
Killed

There are obvious memory issues, but this one seems more interesting than most:

Illegal opcode at address 0x5D8BE34

I couldn’t tell you exactly what is wrong, but it does seem memory issues are a legitimate problem. When not running under valgrind it doesn’t seem to do any harm on my system since the errors only show up upon exit, but your case shows that there are sometimes issues which don’t always wait for exit of the application. It would be interesting to run this with the debug information for all of the nv pipeline debug symbols! There isn’t much more I can do since I can’t reproduce the error except for the case of exit error.

Honey_Patouceul · September 10, 2017, 11:52am

Thanks @Linuxdev for your help and additional testing.

I have confirmed this is not related to nvivafilter nor egl.
I get the same fault on second try with :

gst-launch-1.0 -ev nvcamerasrc ! 'video/x-raw(memory:NVMM), width=1280, height=720, format=NV12, framerate=120/1' ! fakesink

[EDIT: When using format I420, I cannot get the fault, even trying later with NV12].

Updating/upgrading with apt didn’t change anything.

This board has been flashed with an image prepared by JetPack3.1, I flashed from command line, and post-install has been done with same JetPack.

linuxdev · September 10, 2017, 1:51pm

FYI, I see no error at all from this latter gst-launch-1.0 command. I am a bit puzzled that I can reproduce the on-exit issue, but cannot reproduce the dmesg output from the original pipeline problem. It may be that there is more than one bug and the bugs themselves are interacting.

Honey_Patouceul · September 10, 2017, 8:04pm

The problem in my case seems related to nvcamera-deamon when it starts at boot time. After it has crashed and another one is relaunched, then the fault happens no longer.
Seems also that only NV12 make that, as if I420 is used once, then no longer the fault happens.
If I use GST_DEBUG=nvcamerasrc:(any debug level), seems I don’t get the fault, and once done it no longer happens.
So I’m wondering about some early init problem of camera-deamon.

I’ve found some debug info in /var/crash, attached.
_usr_sbin_nvcamera-daemon.0.crash.txt (81.8 KB)

linuxdev · September 10, 2017, 10:04pm

I’m not where I can work on it right now, but can you post the sha1sum of “/usr/lib/aarch64-linux-gnu/tegra/libcuda.so.1”? If I know it is not only the same version, but compiled from with the same compiler, I might be able to match up a symbol to the start of the fault (I’m assuming the disassembly is from the top of the stack frame where this library is at during the fault).