CAN Bus Stability Issues

Hakiwen · June 16, 2017, 2:12pm

I have recently been working on a CAN program using SocketCAN on the TX2.
I successfully ran several tests of it with MCP2562 transceivers and a Teensy.

However, after working more with the interface, I ran into an issue where now every time I generate traffic on the interface, using either my program or cangen in can-utils, the TX2 freezes and restarts.
I have rebuilt the kernel, which solved the issue last time I encountered the issue, but that did not work this time, so I am thinking it was not an actual solution.

Has anyone run into a similar problem?

Thank you

spatra · June 19, 2017, 4:40am

Hi Hakiwen,

Thank you for the question.

Please share us the detailed scenario (including steps involved in your setup) and kernel logs that would help us to check further.

Hakiwen · June 19, 2017, 2:30pm

I do not currently have the accompanying CAN bus attached. In the past when I generated CAN traffic with no bus attached, the interface went into a bus-off state, but did not crash the system.

My tegra18_defconfig file has the following entries set:
CONFIG_CAN=y
CONFIG_MTTCAN=y

The entire file can be found here:
https://pastebin.com/RpfgYEx2

Currently I am just running the following:

I first log the message “testing logger” to demark the execution in syslog.

I then run the following script:

#! /bin/bash

logger "Setting up CAN interface"
sudo ip link set can0 type can bitrate 250000 restart-ms 100

sudo ip link set up can0

logger "Generating CAN messages"

cangen can0

The syslog output is:

Jun 19 13:52:36 tegra-ubuntu nvidia: testing logger
Jun 19 13:52:36 tegra-ubuntu rsyslogd-2007: action ‘action 9’ suspended, next retry is Mon Jun 19 13:53:06 2017 [v8.16.0 try http://www.rsyslog.com/e/2007 ]
Jun 19 13:52:44 tegra-ubuntu nvidia: Setting up CAN interface
Jun 19 13:53:40 tegra-ubuntu rsyslogd: [origin software=“rsyslogd” swVersion=“8.16.0” x-pid=“729” x-info=“http://www.rsyslog.com”] start
Jun 19 13:53:40 tegra-ubuntu rsyslogd-2222: command ‘KLogPermitNonKernelFacility’ is currently not permitted - did you already set it via a RainerScript command (v6+ config)? [v8.16.0 try http://www.rsyslog.com/e/2222 ]
Jun 19 13:53:40 tegra-ubuntu rsyslogd: rsyslogd’s groupid changed to 115
Jun 19 13:53:40 tegra-ubuntu rsyslogd: rsyslogd’s userid changed to 108
Jun 19 13:53:40 tegra-ubuntu systemd-modules-load[232]: Inserted module ‘bluedroid_pm’
Jun 19 13:53:40 tegra-ubuntu loadkeys[216]: Loading /etc/console-setup/cached.kmap.gz
Jun 19 13:53:40 tegra-ubuntu systemd-modules-load[232]: Module ‘nvhost_vi’ is builtin
Jun 19 13:53:40 tegra-ubuntu kernel: [ 0.000000] Booting Linux on physical CPU 0x100

However, as you can see, the “Generating Can Messages” log is not present, so I ran tail -f syslog while this happened, and it displayed more. I suppose the log buffer is displayed, but not written before the system restarts. I took a picture (it was the only reliable way to record to the output).

You can find it here:
http://imgur.com/A0nYMp2

The kernel logs for the same time period are:
Jun 19 13:48:35 tegra-ubuntu gnome-session-binary[1533]: Entering running state
Jun 19 13:53:40 tegra-ubuntu kernel: [ 0.000000] Booting Linux on physical CPU 0x100

Is there anything else that I could share?

spatra · June 27, 2017, 6:10am

I think the above logs are fine to reproduce the issue with our local setup.

We are checking the issue now.

Mean time can you also check what is the output you are getting, if restart -ms option is discarded during ip link setup, i.e,

logger “Setting up CAN interface”
sudo ip link set can0 type can bitrate 250000
sudo ip link set up can0
logger “Generating CAN messages”
cangen can0

This should work and you would get the respective logger logs.
Please check with this sequence and let us know.

Hakiwen · June 27, 2017, 9:50am

This worked, although I will have to find a workaround for excluding that parameter during normal operation.

Here is the relevant syslog section:

Jun 27 09:32:30 tegra-ubuntu nvidia: testing logger
Jun 27 09:32:39 tegra-ubuntu nvidia: Setting up CAN interface
Jun 27 09:32:39 tegra-ubuntu rsyslogd-2007: action ‘action 9’ suspended, next retry is Tue Jun 27 09:34:09 2017 [v8.16.0 try http://www.rsyslog.com/e/2007 ]
Jun 27 09:32:41 tegra-ubuntu kernel: mttcan c310000.mttcan can0: Bitrate set
Jun 27 09:32:41 tegra-ubuntu kernel: mttcan_controller_config: ctrlmode 0
Jun 27 09:32:41 tegra-ubuntu kernel: mttcan c310000.mttcan can0: Bitrate set
Jun 27 09:32:41 tegra-ubuntu nvidia: Generating CAN messages
Jun 27 09:32:41 tegra-ubuntu kernel: mttcan c310000.mttcan can0: entered error warning state
Jun 27 09:32:41 tegra-ubuntu kernel: mttcan c310000.mttcan can0: entered error passive state
Jun 27 09:32:41 tegra-ubuntu kernel: mttcan c310000.mttcan can0: bus-off
Jun 27 09:32:41 tegra-ubuntu kernel: mttcan c310000.mttcan can0: entered bus off state

Thank you.

Abby_21 · July 3, 2017, 10:52pm

I have same issues, need a solution. Please help…

spatra · July 4, 2017, 3:45am

Hi Abby,

Your setup should work with no restart -ms argument.
Please check and let us know.

Thanks & Regards,
Sandipan

Hakiwen · September 14, 2017, 8:49pm

I was unable to find a workaround with my setup. Is setting restart-ms fundamentally incompatible? If there is a solution that would allow me to set restart-ms, or if the issue will be fixed, I would greatly appreciate it.

Thank you.

Edit
Disregard this, I’ve found another issue.

asawrup · July 11, 2018, 2:27pm

spatra, we are encountering this issue as well. Is Nvidia planning to fix this issue?

spatra · July 11, 2018, 4:00pm

Hi,

Are you getting the issue even after using no restart -ms parameter?

Or please share your complete steps and logs.

Thanks & Regards,
Sandipan

asawrup · July 11, 2018, 6:20pm

spatra, no we don’t see this issue after removing restart-ms. However, we’d like to re-add restart-ms and were wondering if NVIDIA is going to fix this issue given it’s reproducible.

spatra · July 12, 2018, 3:45am

Can you please describe the steps you followed to setup CAN controller and transceiver?
And also sharing logs will be helpful.

cmackenzie · July 12, 2018, 1:56pm

To reproduce this all you need to do is configure any TX2 CAN controller using the restart-ms property. The restart-ms property is a standard parameter in the Linux CAN interface configuration options (see https://www.kernel.org/doc/Documentation/networking/can.txt for further details).

What restart-ms does is allow the CAN controller to recover from the bus-off condition, which is a fault isolation mode a CAN controller may enter when an excessive amount of errors are detected. The bus off condition is something that is defined by the Bosch CAN standard and is not specific to NVIDIA. Every CAN controller I’ve worked with gives you two options for dealing with it 1) permanently staying in the bus off state which prevents all further transmissions from the affected controller until it is power cycled or 2) automatically recovering back to normal operation after enough error free bus operation are observed. We want our design to use option 2 because our CAN buses are regularly exposed to the environment via connectors and spurious errors are a realistic possibility in the field.

The problem here is that when a TX2 CAN controller enters the bus off state (which is usually an exceptional event to be clear) and the restart-ms property is set (to define the time after which the CAN controller should be restarted to recover) the recovery process generates a kernel error which requires a power cycle to recover from. So basically entering the bus-off state is always a permanent failure when that does not necessarily need to be the case.

To test this, all that needs to be required is to configure a TX2 CAN controller with the restart-ms property set and then get it to enter the bus-off state. So for example, you could configure a restart-ms delay of 100 ms on can0 using commands like:

ip link set can0 type can bitrate 1000000 restart-ms 100
ip link set up

To enter the bus-off state after configuring the controller, the easiest thing to do is transmit a CAN frame (any frame, transmitted using any convenient method like a socketCAN socket or something like the cansend tool - GitHub - linux-can/can-utils: Linux-CAN / SocketCAN user space applications) on a disconnected bus. Here I would define a disconnected bus as a bus that contains only the TX2 CAN controller, a single external CAN transceiver connected to the TX2 CAN controller, and a single termination resistor (maybe even without any termination resistors or CAN transceiver to make things even worse electrically). This is not an electrically valid CAN bus, but that is intentional because it’s the easiest way to force the bus-off recovery logic to occur (other options would be using termination resistances that are too large, shorting CANH and CANL, etc).

Hopefully that is enough information for you to reproduce the problem, since you should not require any externally provided code. You just need to explicitly test one of the documented features of the Linux CAN interface and the CAN controller itself.

For reference, when we experience the problem typical output would be as follows (where pld_can is just a convenient alias we have assigned to the c320000.mttcan device using udev).

[root@MKXXXXXXXXXXXXX ~]# cansend pld_can 5A1#11.22.33.44.55.66.77.88
[ 2160.290698] mttcan c320000.mttcan pld_can: entered error warning state
[ 2160.297399] mttcan c320000.mttcan pld_can: entered error passive state
[ 2160.304082] mttcan c320000.mttcan pld_can: entered bus off state
[root@MKXXXXXXXXXXXXX ~]# [ 2160.355457] mttcan_controller_config: ctrlmode 0
[ 2160.360194] mttcan c320000.mttcan pld_can: Bitrate set
[ 2160.365446] IPv6: ADDRCONF(NETDEV_CHANGE): pld_can: link becomes ready
[ 2160.415453] ------------[ cut here ]------------
[ 2160.420066] Kernel BUG at ffffffbffc0f66d0 [verbose debug info unavailable]
[ 2160.427016] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
[ 2160.432490] Modules linked in: mttcan can_dev rffc5071mixer(O) bcmdhd ath_pktlog(PO) umac(O) can_raw ath_ds
[ 2160.455544] CPU: 0 PID: 12134 Comm: kworker/0:0 Tainted: P        W  O    4.4.38-aeryon #22
[ 2160.463879] Hardware name: aeryon-tx2-flyer (DT)
[ 2160.468495] Workqueue: events can_restart_work [can_dev]
[ 2160.473806] task: ffffffc1e5746400 ti: ffffffc1e9ea8000 task.ti: ffffffc1e9ea8000
[ 2160.481275] PC is at can_restart+0xc8/0xe8 [can_dev]
[ 2160.486230] LR is at can_restart_work+0x10/0x18 [can_dev]
[ 2160.491615] pc : [<ffffffbffc0f66d0>] lr : [<ffffffbffc0f6700>] pstate: 60000045
[ 2160.498994] sp : ffffffc1e9eabd30
[ 2160.502301] x29: ffffffc1e9eabd30 x28: 0000000000000000 
[ 2160.507619] x27: 0000000000000000 x26: ffffffc001390000 
[ 2160.512939] x25: 0000000000000000 x24: 0000000000000000 
[ 2160.518257] x23: ffffffc1f5cd2400 x22: ffffffc0702f4830 
[ 2160.523578] x21: ffffffc1f5cccc00 x20: ffffffc1e1c71908 
[ 2160.528898] x19: ffffffc1e1c71000 x18: 0000000000000013 
[ 2160.534219] x17: 0000007f79709490 x16: ffffffc0001e3240 
[ 2160.539539] x15: 0019b52994000000 x14: 0000000000000000 
[ 2160.544859] x13: 00000001f4000000 x12: 0000000000000017 
[ 2160.550179] x11: 00000000000d298e x10: 00000000000008a0 
[ 2160.555497] x9 : ffffffc1e9eabd20 x8 : ffffffc1e5746d00 
[ 2160.560817] x7 : 00000000000003b2 x6 : 000000000059d3ba 
[ 2160.566136] x5 : 0000000000000000 x4 : ffffffc1f5ccd000 
[ 2160.571456] x3 : ffffffc1e5746400 x2 : ffffffc1f5cd2405 
[ 2160.576774] x1 : 0000000000000003 x0 : ffffffc1e1c71000 
[ 2160.582093] 
[ 2160.583580] Process kworker/0:0 (pid: 12134, stack limit = 0xffffffc1e9ea8020)
[ 2160.590784] Call trace:
[ 2160.593228] [<ffffffbffc0f66d0>] can_restart+0xc8/0xe8 [can_dev]
[ 2160.599224] [<ffffffbffc0f6700>] can_restart_work+0x10/0x18 [can_dev]
[ 2160.605654] [<ffffffc0000bc1dc>] process_one_work+0x150/0x448
[ 2160.611388] [<ffffffc0000bc608>] worker_thread+0x134/0x40c
[ 2160.616862] [<ffffffc0000c1ea4>] kthread+0xe0/0xf4
[ 2160.621644] [<ffffffc000084f90>] ret_from_fork+0x10/0x40
[ 2160.626946] ---[ end trace 4491671ec513f65d ]---
[ 2160.632926] ------------[ cut here ]------------
[ 2160.637533] WARNING: at ffffffc0000a91c4 [verbose debug info unavailable]
[ 2160.644304] Modules linked in: mttcan can_dev rffc5071mixer(O) bcmdhd ath_pktlog(PO) umac(O) can_raw ath_ds
[ 2160.667343] 
[ 2160.668831] CPU: 0 PID: 12134 Comm: kworker/0:0 Tainted: P      D W  O    4.4.38-aeryon #22
[ 2160.677163] Hardware name: aeryon-tx2-flyer (DT)
[ 2160.681776] task: ffffffc1e5746400 ti: ffffffc1e9ea8000 task.ti: ffffffc1e9ea8000
[ 2160.689245] PC is at __local_bh_enable_ip+0x68/0xb8
[ 2160.694115] LR is at _raw_spin_unlock_bh+0x20/0x28
[ 2160.698896] pc : [<ffffffc0000a91c4>] lr : [<ffffffc000b32b38>] pstate: 400003c5
[ 2160.706274] sp : ffffffc1e9eab9d0
[ 2160.709579] x29: ffffffc1e9eab9d0 x28: ffffffc1e9ea8000 
[ 2160.714899] x27: 0000000000000000 x26: ffffffc1e5746400 
[ 2160.720219] x25: ffffffc1e5746400 x24: 00000000000003c0 
[ 2160.725537] x23: 0000000000000001 x22: 0000000000000000 
[ 2160.730854] x21: ffffffc000f21a90 x20: ffffffc1e5746400 
[ 2160.736174] x19: ffffffc001412530 x18: 0000000000000013 
[ 2160.741491] x17: 0000007f79709490 x16: ffffffc0001e3240 
[ 2160.746809] x15: 0019b52994000000 x14: 3534303030303036 
[ 2160.752128] x13: 203a657461747370 x12: 0000000000000030 
[ 2160.757445] x11: 0101010101010101 x10: 7f7f7f7f7f7f7f7f 
[ 2160.762763] x9 : fefefefefefefeff x8 : ffffffc1e5746c20 
[ 2160.768081] x7 : feff09432c313232 x6 : ffffffc00146e000 
[ 2160.773401] x5 : ffffffc0012309c0 x4 : ffffffc001215048 
[ 2160.778720] x3 : ffffffc001230920 x2 : 0000000000000000 
[ 2160.784038] x1 : 0000000000000201 x0 : ffffffc00138f000 
[ 2160.789355] 
[ 2160.790842] ---[ end trace 4491671ec513f65e ]---
[ 2160.795448] Call trace:
[ 2160.797887] [<ffffffc0000a91c4>] __local_bh_enable_ip+0x68/0xb8
[ 2160.803793] [<ffffffc000b32b38>] _raw_spin_unlock_bh+0x20/0x28
[ 2160.809616] [<ffffffc00012c7f4>] cgroup_exit+0x58/0xe4
[ 2160.814743] [<ffffffc0000a6d8c>] do_exit+0x29c/0x9a0
[ 2160.819698] [<ffffffc000089c08>] bug_handler.part.3+0x0/0x7c
[ 2160.825344] [<ffffffc000089c48>] bug_handler.part.3+0x40/0x7c
[ 2160.831079] [<ffffffc000089ca0>] bug_handler+0x1c/0x2c
[ 2160.836206] [<ffffffc0000829b8>] brk_handler+0x8c/0xc8
[ 2160.841332] [<ffffffc000081518>] do_debug_exception+0x3c/0xa8
[ 2160.847066] [<ffffffc000084630>] el1_dbg+0x18/0x74
[ 2160.851849] [<ffffffbffc0f6700>] can_restart_work+0x10/0x18 [can_dev]
[ 2160.858275] [<ffffffc0000bc1dc>] process_one_work+0x150/0x448
[ 2160.864007] [<ffffffc0000bc608>] worker_thread+0x134/0x40c
[ 2160.869482] [<ffffffc0000c1ea4>] kthread+0xe0/0xf4
[ 2160.874262] [<ffffffc000084f90>] ret_from_fork+0x10/0x40
[ 2160.879863] Unable to handle kernel paging request at virtual address ffffffffffffffd8
[ 2160.887766] pgd = ffffffc1e7d70000
[ 2160.891160] [ffffffffffffffd8] *pgd=0000000267d76003, *pud=0000000267d76003, *pmd=0000000000000000
[ 2160.900134] Internal error: Oops: 96000005 [#2] PREEMPT SMP
[ 2160.905694] Modules linked in: mttcan can_dev rffc5071mixer(O) bcmdhd ath_pktlog(PO) umac(O) can_raw ath_ds
[ 2160.928741] CPU: 0 PID: 12134 Comm: kworker/0:0 Tainted: P      D W  O    4.4.38-aeryon #22
[ 2160.937075] Hardware name: aeryon-tx2-flyer (DT)
[ 2160.941685] task: ffffffc1e5746400 ti: ffffffc1e9ea8000 task.ti: ffffffc1e9ea8000
[ 2160.949156] PC is at kthread_data+0x4/0xc
[ 2160.953159] LR is at wq_worker_sleeping+0x10/0xc4
[ 2160.957852] pc : [<ffffffc0000c2574>] lr : [<ffffffc0000bd0b4>] pstate: 600002c5
[ 2160.965231] sp : ffffffc1e9eab9a0
[ 2160.968537] x29: ffffffc1e9eab9a0 x28: ffffffc1e9ea8000 
[ 2160.973855] x27: 0000000000000000 x26: ffffffc001215000 
[ 2160.979174] x25: 0000000000000000 x24: ffffffc000b2f178 
[ 2160.984493] x23: 0000000000000000 x22: ffffffc1e5746990 
[ 2160.989813] x21: ffffffc0011e9000 x20: ffffffc1e5746400 
[ 2160.995133] x19: ffffffc1f5ccd500 x18: ffffffc000bb0038 
[ 2161.000450] x17: 000000000000000e x16: 0000000000000007 
[ 2161.005771] x15: ffffffc000b3da60 x14: 00000000fa83b2da 
[ 2161.011091] x13: 0000000000000001 x12: 0000000001f9dbec 
[ 2161.016410] x11: 0000000000000000 x10: 0000000000392d90 
[ 2161.021729] x9 : 0000000000392d90 x8 : 00000000000003b2 
[ 2161.027049] x7 : 0000000000000000 x6 : 0000000001ff6ab8 
[ 2161.032367] x5 : ffffffc1f5ccd500 x4 : ffffffc1f5ccdee0 
[ 2161.037687] x3 : 000000000001af3b x2 : ffffffc1ecc03000 
[ 2161.043006] x1 : 0000000000000000 x0 : 0000000000000000 
[ 2161.048324] 
[ 2161.049811] Process kworker/0:0 (pid: 12134, stack limit = 0xffffffc1e9ea8020)
[ 2161.057016] Call trace:
[ 2161.059458] [<ffffffc0000c2574>] kthread_data+0x4/0xc
[ 2161.064502] [<ffffffc000b2eda0>] __schedule+0x348/0x6dc
[ 2161.069715] [<ffffffc000b2f178>] schedule+0x44/0xa8
[ 2161.074583] [<ffffffc0000a70a0>] do_exit+0x5b0/0x9a0
[ 2161.079538] [<ffffffc000089c08>] bug_handler.part.3+0x0/0x7c
[ 2161.085185] [<ffffffc000089c48>] bug_handler.part.3+0x40/0x7c
[ 2161.090918] [<ffffffc000089ca0>] bug_handler+0x1c/0x2c
[ 2161.096046] [<ffffffc0000829b8>] brk_handler+0x8c/0xc8
[ 2161.101173] [<ffffffc000081518>] do_debug_exception+0x3c/0xa8
[ 2161.106906] [<ffffffc000084630>] el1_dbg+0x18/0x74
[ 2161.111693] [<ffffffbffc0f6700>] can_restart_work+0x10/0x18 [can_dev]
[ 2161.118120] [<ffffffc0000bc1dc>] process_one_work+0x150/0x448
[ 2161.123853] [<ffffffc0000bc608>] worker_thread+0x134/0x40c
[ 2161.129328] [<ffffffc0000c1ea4>] kthread+0xe0/0xf4
[ 2161.134108] [<ffffffc000084f90>] ret_from_fork+0x10/0x40
[ 2161.139410] ---[ end trace 4491671ec513f65f ]---
[ 2161.145427] Fixing recursive fault but reboot is needed!

eb0l2q1 · July 26, 2018, 11:41pm

Spatra / Nvidia - any update? We are facing this same issue as are many others. I would expect a (very easily reproducible) kernel crash for a basic feature of a supported module would be a priority and many users of your products have very clearly described this issues with “restart-ms” on your forums over many months - but no fix exists. The issue is that if the restart-ms option is removed, CAN periodically will not work without a system restart.

Can you please escalate this issue to management and have it prioritized? CAN simply doesn’t work in the current state of development and this is seriously tarnishing Nvidia’s hard-earned brand.

Ltro · August 3, 2018, 3:37pm

Hi,

We temporarily fixed this by using mcp2515 CAN modules but it shouldn’t be necessary when TX2 have 2xCAN built in. Please provide fix