Argus Daemon Errors - Max Frames Acquired

ben.lemond · May 8, 2019, 7:13pm

While running our custom program based off of the Argus sample codes, I get some Argus errors in my executing-program terminal window. Those errors look like:

(Argus) Error InvalidState:  Max frames acquired (in src/eglstream/FrameConsumerImpl.cpp, function acquireFrame(), line 266)

Coincidentally, when these errors start streaming we get other errors in the syslog:

Apr  5 13:05:10 tegra-ubuntu argus_daemon[1501]: (Argus) Error InvalidState:  (propagating from src/api/ScfCaptureThread.cpp, function run(), line 109)
Apr  5 13:05:10 tegra-ubuntu argus_daemon[1501]: SCF: Error InvalidState: Session has suffered a critical failure (in src/api/Session.cpp, function capture(), line 689)

The errors in syslog come so fast they caused my disk to run out of space in about 45 minutes to 1 hour (while I was at lunch. Syslog grew by ~15GB!!!

Has anyone seen these errors that can explain what the error messages mean and how to prevent those from happening. When the errors are presented the daemon stops producing frames.

ben.lemond · May 8, 2019, 8:01pm

I also noticed, as I scour back through syslog, that the very first couple of errors I get before the repeating “InvalidState” errors are like the following:

May  8 18:46:46 tegra-ubuntu argus_daemon[1597]: SCF: Error NotSupported: AMR Sample data type is error, requested type is IspRawStats* (in src/components/amr/Sample.cpp, function typeError(), line 65)
May  8 18:46:46 tegra-ubuntu argus_daemon[1597]: SCF: Error NotSupported:  (in src/components/amr/Sample.cpp, function get(), line 101)
May  8 18:46:46 tegra-ubuntu argus_daemon[1597]: SCF: Error NotSupported:  (propagating from src/common/Amr.h, function getSampleObject(), line 488)
May  8 18:46:46 tegra-ubuntu argus_daemon[1597]: SCF: Error NotSupported:  (propagating from src/components/ac_stages/AeAfApplyStage.cpp, function translateIspOutStatsToFrd(), line 267)
May  8 18:46:46 tegra-ubuntu argus_daemon[1597]: SCF: Error NotSupported:  (propagating from src/components/ac_stages/AeAfApplyStage.cpp, function doHandleRequest(), line 593)
May  8 18:46:46 tegra-ubuntu argus_daemon[1597]: SCF: Error NotSupported:  (propagating from src/components/stages/OrderedStage.cpp, function doExecute(), line 137)
May  8 18:46:46 tegra-ubuntu argus_daemon[1597]: SCF: Error NotSupported: Sending critical error event (in src/api/Session.cpp, function sendErrorEvent(), line 992)
May  8 18:46:46 tegra-ubuntu argus_daemon[1597]: SCF: Error InvalidState: Session has suffered a critical failure (in src/api/Session.cpp, function capture(), line 689)
May  8 18:46:46 tegra-ubuntu argus_daemon[1597]: (Argus) Error InvalidState:  (propagating from src/api/ScfCaptureThread.cpp, function run(), line 109)
May  8 18:46:46 tegra-ubuntu argus_daemon[1597]: SCF: Error InvalidState: Session has suffered a critical failure (in src/api/Session.cpp, function capture(), line 689)

ben.lemond · May 8, 2019, 9:57pm

Upon further testing, I can get the samples to fail. If you run the samples and then stress the CPU load, you will get these errors.

Download a program called stress (sudo apt-get install stress).

Run

stress --cpu 4 --timeout 120

That will max out the 4 CPUs on load for 120 seconds.

If you go ahead and

tail –f /var/log/syslog

you will end up seeing the exact errors I reference above and the argus-daemon will go crazy.

ShaneCCC · May 9, 2019, 3:23am

Which BSP version do you work on?

ben.lemond · May 9, 2019, 1:15pm

L4T28.2.1

Linux tegra-ubuntu 4.4.38 #1 SMP PREEMPT Mon Apr 8 20:49:37 IST 2019 aarch64 aarch64 aarch64 GNU/Linux

Custom BSP/device-tree/etc. provided by Nvidia partner for custom board.

ShaneCCC · May 10, 2019, 3:00am

Does the problem repo with BSP sample code while run the stress too?

ben.lemond · May 10, 2019, 12:29pm

Yes. The supplied source and programs exhibit the same error problems while running the stress program.

ben.lemond · May 10, 2019, 1:30pm

Shane, I recreated the problem with the tegra_multimedia_api/argus/samples/multiSensor program!!!

The key was to run multiple of those executables. On our board we have 6 cameras. I rebuild that program three times and only changed the DEFAULT_CAMERA_INDEX value each time.

/samples/multiSensor/main.cpp

namespace ArgusSamples
{
// Constants.
static const uint32_t            DEFAULT_CAPTURE_TIME  = 600; // In seconds.
static const Size2D<uint32_t>    PREVIEW_STREAM_SIZE(640, 480);
static const Rectangle<uint32_t> DEFAULT_WINDOW_RECT(0, 0, 640, 480);
static const uint32_t            <u>DEFAULT_CAMERA_INDEX</u> = 4;

Program 1: DEFAULT_CAMERA_INDEX = 0
Program 2: DEFAULT_CAMERA_INDEX = 2
Program 3: DEFAULT_CAMERA_INDEX = 4

That gives me three programs with 6 total cameras.

I ran all of those simultaneously along with the stress program, and I get the same error result from Argus in syslog.

ShaneCCC · May 13, 2019, 7:22am

I think running multiple sensors must need lot of CPU resource, why do you need stress 4 cpu cores.
Is it possible reserve one or two cpu core for camera APP for your case?

ben.lemond · May 13, 2019, 2:41pm

Yes, it does, and the reason we need to is industry/application specific to our product. We have a board that supports 6 cameras over GMSL. It is going to be CPU intensive regardless of what cores we put it on. Nvidia needs to provide an updated library or workaround that 1. doesn’t crash when under high CPU load and 2. doesn’t continue writing gigabytes of syslogs if there is a failure mode.

david.cecil · May 13, 2019, 3:10pm

Nvidia, are you kidding us? There’s no excuse for putting out software that (a) crashes and (b) fills up the disk partition with error messages, regardless of the CPU load. We showed how your very own demo applications can repeatedly reproduce the failure, without any code changes on our end. Please help us fix it, or give us access to the source code so that we can fix it for you (and everyone else).

This is a longstanding defect in the core software. All one has to do is run a search on these forums to see that there are many others who have been having the same problem, and for several years, too. We really need you guys to make this a priority, please.

ShaneCCC · May 14, 2019, 2:55am

@david
Could you try to increase the nvargus-daemon priority for short term solution.

sudo top
type r and enter the pid of nvargus-daemon then type -20 to set nvargus-daemon priority.

ben.lemond · May 14, 2019, 11:25am

@shaneccc,

We can definitely try all short-term solutions, and that will work – short term. We would like a firm response, though, as to whether or not Nvidia is going to be working to find a true solution to the core problem?

david.cecil · May 14, 2019, 2:44pm

Yes, nice-ing the priority of the process improves the situation, but it doesn’t fix it. It just crashes less often. And it still fills up the syslog.

ben.lemond · June 11, 2019, 8:22pm

BUMP. still no help on this. I have noticed that it has something to do with AELock.

Everytime the failure happens, the very first failure to appear are these:

Jun 11 20:22:09 tegra-ubuntu argus_daemon[7238]: SCF: Error Timeout: (propagating from src/components/amr/Snapshot.cpp, function waitForNewerSample(), line 92)
Jun 11 20:22:09 tegra-ubuntu argus_daemon[7238]: SCF_AutocontrolACSync failed to wait for an earlier frame to complete.
Jun 11 20:22:09 tegra-ubuntu argus_daemon[7238]: SCF: Error Timeout: (propagating from src/components/ac_stages/ACSynchronizeStage.cpp, function doHandleRequest(), line 126)
Jun 11 20:22:09 tegra-ubuntu argus_daemon[7238]: SCF: Error Timeout: (propagating from src/components/stages/OrderedStage.cpp, function doExecute(), line 137)
Jun 11 20:22:09 tegra-ubuntu argus_daemon[7238]: SCF: Error Timeout: Sending critical error event (in src/api/Session.cpp, function sendErrorEvent(), line 992)
Jun 11 20:22:09 tegra-ubuntu argus_daemon[7238]: SCF: Error InvalidState: Session has suffered a critical failure (in src/api/Session.cpp, function capture(), line 689)

ShaneCCC · June 12, 2019, 2:13am

Hi All
This issue will be fixed by next release.

ben.lemond · June 12, 2019, 2:18am

ShaneCCC,

Which next release? We are currently on a base k age of L4T 28.2.1 plus our custom device tree. Will Nvidia release a library patch for TX2 28.2.1 as well?

Also, is there an estimate on release date?

ShaneCCC · June 12, 2019, 2:38am

@ben.lemond
The fixed is on r32.2 should be next month.
Due the framework is much bigger gap for r28 and r32 back port to r28.2.1 is not in the plan.

daniel.lorenzin · July 1, 2019, 1:47am

Hi, I’m after this fix for the nano too.

When will it be released?

ShaneCCC · August 8, 2019, 3:23am

Hi All
JP 4.2.1 have release, Please have try with this release.