Kernel crash running IPSec tunnel via Strongswan on Jetson TX2

We are running Strongswan 5.3.5 with an IPSec tunnel between 2 Jetson nodes running R28.2. We added the necessary kernel modules as outlined in the Strongswan install instructions and the tunnel comes up fine. However, after some amount of time the tunnel becomes unstable and we see kernel errors in kern.log.

From what I can gather, it appears tegra_se_aes_queue_req is doing some scheduling when it shouldn’t. Has anyone else encountered this issue or is this a legit bug? We will be upgrading to Strongswan 5.6.2 to test if that helps but it appears to be a problem in tegra specific code.

[ 2168.298684] BUG: scheduling while atomic: swapper/5/0/0x00000103
[ 2168.304694] Modules linked in: xfrm6_mode_tunnel xfrm4_mode_tunnel xt_policy nfnetlink_queue nfnetlink_log nfnetlink bluetooth xfrm4_tunnel tunnel4 ipcomp xfrm_ipcomp esp4 ah4 pci_manager(O) dmadriver(PO) fuse ip6table_filter bcmdhd xt_conntrack iptable_filter pci_tegra ip_tables bluedroid_pm
[ 2168.330840] CPU: 5 PID: 0 Comm: swapper/5 Tainted: P           O    4.4.38-tegra #2
[ 2168.338482] Hardware name: quill (DT)
[ 2168.342135] Call trace:
[ 2168.344582] [<ffffffc000089388>] dump_backtrace+0x0/0xe8
[ 2168.349882] [<ffffffc000089484>] show_stack+0x14/0x20
[ 2168.354926] [<ffffffc000379b18>] dump_stack+0xa0/0xc8
[ 2168.359970] [<ffffffc0000c9bd0>] __schedule_bug+0x48/0x60
[ 2168.365358] [<ffffffc000bd0a4c>] __schedule+0x614/0x750
[ 2168.370572] [<ffffffc000bd0bcc>] schedule+0x44/0xb8
[ 2168.375439] [<ffffffc000bd1098>] schedule_preempt_disabled+0x20/0x40
[ 2168.381781] [<ffffffc0000eb234>] mutex_optimistic_spin+0x1a4/0x1e8
[ 2168.387948] [<ffffffc000bd269c>] __mutex_lock_slowpath+0x3c/0x158
[ 2168.394027] [<ffffffc000bd2804>] mutex_lock+0x4c/0x68
[ 2168.399069] [<ffffffc00098316c>] tegra_se_aes_queue_req+0x34/0xa8
[ 2168.405150] [<ffffffc00098338c>] tegra_se_aes_cbc_encrypt+0x2c/0x38
[ 2168.411404] [<ffffffc0003415ec>] crypto_authenc_encrypt+0x114/0x148
[ 2168.417659] [<ffffffc000307ecc>] echainiv_encrypt+0x124/0x148
[ 2168.423396] [<ffffffbffcef3e28>] esp_output+0x320/0x490 [esp4]
[ 2168.429217] [<ffffffc000af35a0>] xfrm_output_resume+0x160/0x3a8
[ 2168.435124] [<ffffffc000af38d4>] xfrm_output+0x44/0xf8
[ 2168.440251] [<ffffffc000ae7858>] xfrm4_output_finish+0x20/0x28
[ 2168.446072] [<ffffffc000ae76ec>] __xfrm4_output+0x34/0x60
[ 2168.451458] [<ffffffc000ae78f0>] xfrm4_output+0x90/0xa0
[ 2168.456674] [<ffffffc000a9502c>] ip_local_out+0x44/0x58
[ 2168.461887] [<ffffffc000a95304>] ip_queue_xmit+0x124/0x388
[ 2168.467362] [<ffffffc000aac93c>] tcp_transmit_skb+0x424/0x920
[ 2168.473095] [<ffffffc000aae908>] tcp_send_ack+0x110/0x170
[ 2168.478483] [<ffffffc000ab0d84>] tcp_delack_timer_handler+0x104/0x210
[ 2168.484910] [<ffffffc000ab0ec4>] tcp_delack_timer+0x34/0xc0
[ 2168.490472] [<ffffffc000107e2c>] call_timer_fn+0x54/0x1d8
[ 2168.495860] [<ffffffc0001081ec>] run_timer_softirq+0x224/0x2a8
[ 2168.501682] [<ffffffc0000a837c>] __do_softirq+0x124/0x350
[ 2168.507069] [<ffffffc0000a8828>] irq_exit+0x88/0xe0
[ 2168.511937] [<ffffffc0000f6450>] __handle_domain_irq+0x60/0xb8
[ 2168.517756] [<ffffffc000081774>] gic_handle_irq+0x64/0xc0
[ 2168.523144] [<ffffffc000084740>] el1_irq+0x80/0xf8
[ 2168.527925] [<ffffffc000864b08>] cpuidle_enter+0x18/0x20
[ 2168.533227] [<ffffffc0000e907c>] call_cpuidle+0x24/0x50
[ 2168.538440] [<ffffffc0000e9318>] cpu_startup_entry+0x270/0x340
[ 2168.544262] [<ffffffc00008e10c>] secondary_start_kernel+0x12c/0x168
[ 2168.550514] [<0000000080081adc>] 0x80081adc
[ 2168.554807] timer: tcp_delack_timer+0x0/0xc0 preempt leak: 00000101 -> ffffffff
[ 2168.562168] ------------[ cut here ]------------
[ 2168.566778] WARNING: at ffffffc000107f94 [verbose debug info unavailable]
[ 2168.573550] Modules linked in: xfrm6_mode_tunnel xfrm4_mode_tunnel xt_policy nfnetlink_queue nfnetlink_log nfnetlink bluetooth xfrm4_tunnel tunnel4 ipcomp xfrm_ipcomp esp4 ah4 pci_manager(O) dmadriver(PO) fuse ip6table_filter bcmdhd xt_conntrack iptable_filter pci_tegra ip_tables bluedroid_pm
[ 2168.599637]
[ 2168.601125] CPU: 5 PID: 0 Comm: swapper/5 Tainted: P        W  O    4.4.38-tegra #2
[ 2168.608765] Hardware name: quill (DT)
[ 2168.612418] task: ffffffc1ece83e80 ti: ffffffc1ecea0000 task.ti: ffffffc1ecea0000
[ 2168.619889] PC is at call_timer_fn+0x1bc/0x1d8
[ 2168.624322] LR is at call_timer_fn+0x1bc/0x1d8
[ 2168.628755] pc : [<ffffffc000107f94>] lr : [<ffffffc000107f94>] pstate: 00000045
[ 2168.636135] sp : ffffffc1ecea3be0
[ 2168.639440] x29: ffffffc1ecea3be0 x28: ffffffc1dfd4e5a8
[ 2168.644758] x27: ffffffc001465060 x26: ffffffc1f5fefc38
[ 2168.650074] x25: ffffffc000bde000 x24: ffffffc1dfd4e180
[ 2168.655394] x23: ffffffc000ab0e90 x22: 0000000000000101
[ 2168.660711] x21: ffffffc1dfd4e5a8 x20: ffffffc0014656a0
[ 2168.666029] x19: ffffffc001464000 x18: 0000000000000000
[ 2168.671347] x17: 0000000000000004 x16: 00000000210b0001
[ 2168.676664] x15: 0000000000000010 x14: 3030203a6b61656c
[ 2168.681983] x13: 2074706d65657270 x12: 20306378302f3078
[ 2168.687301] x11: 302b72656d69745f x10: 6b63616c65645f70
[ 2168.692620] x9 : 000000000001abb2 x8 : ffffffc0002e2c00
[ 2168.697940] x7 : ffffffc00131fd08 x6 : 0000000000000053
[ 2168.703258] x5 : 0000000000000000 x4 : 0000000000000000
[ 2168.708575] x3 : 0000000000000000 x2 : ffffffc1ecea0000
[ 2168.713895] x1 : 00000000ffffffff x0 : 0000000000000043
[ 2168.719213]
[ 2168.720995] ---[ end trace c836e4164d6e79ad ]---
[ 2168.725603] Call trace:
[ 2168.728046] [<ffffffc000107f94>] call_timer_fn+0x1bc/0x1d8
[ 2168.733520] [<ffffffc0001081ec>] run_timer_softirq+0x224/0x2a8
[ 2168.739342] [<ffffffc0000a837c>] __do_softirq+0x124/0x350
[ 2168.744728] [<ffffffc0000a8828>] irq_exit+0x88/0xe0
[ 2168.749596] [<ffffffc0000f6450>] __handle_domain_irq+0x60/0xb8
[ 2168.755418] [<ffffffc000081774>] gic_handle_irq+0x64/0xc0
[ 2168.760806] [<ffffffc000084740>] el1_irq+0x80/0xf8
[ 2168.765589] [<ffffffc000864b08>] cpuidle_enter+0x18/0x20
[ 2168.770891] [<ffffffc0000e907c>] call_cpuidle+0x24/0x50
[ 2168.776104] [<ffffffc0000e9318>] cpu_startup_entry+0x270/0x340
[ 2168.781926] [<ffffffc00008e10c>] secondary_start_kernel+0x12c/0x168
[ 2168.788180] [<0000000080081adc>] 0x80081adc

Anyone have any insight? An IPSec tunnel is important for our system and we are currently unable to move forward with our networking scheme.

We updated to Strongswan 5.6.2 but that did not resolve this issue. After asking for an opinion on the issue in Strongswan IRC channel, the consensus is that this is a kernel issue since IPSEC is handled directly in the kernel. It should also be noted that we can successfully run these tunnels on x86_64 machines running the same version of the kernel. This is looking to be a Jetson specific kernel issue.

Is there a way to get this issue more attention from Nvidia? We are currently blocked and would really like some help on resolving this issue.

assertadev,

Sorry for late response. I’ll help check this issue.

Could you share how to install those sw package? Could you share detail method?
Please note you are running 3rd party module and we may not guarantee it.

Thank you for the response. Here is how we are installing and configuring the Strongswan software.

  1. Rebuild Kernel using instructions found here: https://github.com/jetsonhacks/buildJetsonTX2Kernel and enabling modules outlined by Strongswan here: https://wiki.strongswan.org/projects/strongswan/wiki/KernelModules
  2. Install Ubuntu repository version of Strongswan using 'sudo apt install strongswan'
  3. Use default Strongswan configurations in /etc/strongswan.conf and /etc/strongswan.d/*
  4. Updated IPSEC configurations:

    Node 1

    /etc/ipsec.conf

    config setup
    
    conn %default
            ikelifetime=60m
            keylife=20m
            rekeymargin=3m
            keyingtries=1
            keyexchange=ikev2
            authby=pubkey
            mobike=no
    
    conn node2
            left=192.168.50.50
            leftid=192.168.50.50
            leftsubnet=192.168.50.0/24
            leftcert=node1_cert.pem
            leftsendcert=always
            rightsourceip=192.168.50.51
            right=192.168.50.51
            rightid=192.168.50.51
            rightsubnet=192.168.50.0/24
    
            type=transport
            auto=add
    

    /etc/ipsec.secrets

    192.168.50.50 : RSA node1_key.pem
    

    Node 2

    /etc/ipsec.conf

    config setup
    
    conn %default
            ikelifetime=60m
            keylife=20m
            rekeymargin=3m
            keyingtries=1
            keyexchange=ikev2
            authby=pubkey
            mobike=no
    
    conn node1
            left=192.168.50.51
            leftid=192.168.50.51
            leftsubnet=192.168.50.0/24
            leftcert=node2_cert.pem
            leftsendcert=always
            leftsourceip=192.168.50.51
            right=192.168.50.50
            rightid=192.168.50.50
            rightsubnet=192.168.50.0/24
    
            type=transport
            auto=add
    

    /etc/ipsec.secrets

    192.168.50.51 : RSA node2_key.pem
    
  5. Update Iptables rules

    /etc/iptables/rules.v4

    *filter
    :INPUT DROP [0:0]
    :FORWARD DROP [0:0]
    :OUTPUT ACCEPT [0:0]
    -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
    -A INPUT -p tcp --dport 22 -j ACCEPT
    -A INPUT -i lo -j ACCEPT
    -A INPUT -p udp --dport 500 -j ACCEPT
    -A INPUT -p udp --dport 4500 -j ACCEPT
    -A INPUT -s 192.168.50.0/24 -d 192.168.50.0/24 -m policy --dir in --pol ipsec --proto esp -j ACCEPT
    -A INPUT -p esp -j ACCEPT
    -A INPUT -j LOG --log-prefix "INPUT:DROP: " --log-level 6
    -A FORWARD -s 192.168.50.0/24 -d 192.168.50.0/24 -m policy --dir in --pol ipsec --proto esp -j ACCEPT
    -A FORWARD -p esp -j ACCEPT
    -A FORWARD -j LOG --log-prefix "FORWARD:DROP: " --log-level 6
    COMMIT
    

Sorry that I just checked internally and found it is not supported by any L4T project.

I notice there are some forum users have experience about strongswan.

https://devtalk.nvidia.com/default/topic/1027366/

Could you check if it can shed some light?

Thank you for the reply. I understand that Strongswan itself may not be officially supported. However, as I understand it, ipsec is handled directly in the kernel and not by Strongswan. Strongswan is used for configuring the tunnels but not actually implementing the network protocols. That is left to the kernel. Are you saying that ipsec networking is not supported for the Jetson products?

On a related note, we downgraded our Jetson nodes from JetPack 28.2 to JetPack 28.1 and our ipsec tunnels have remained active for 3 days with no issues. While this is allows us to move forward with our development, we would ultimately like to be able to stay up-to-date with new JetPack releases. Based on our testing, it would appear that changes in 28.2 cause issues for ipsec networking. Is it possible to determine what those changes are and how they may be fixed so we can stay up-to-date with future JetPack releases?

assertadev,

How to reproduce your issue if we don’t use Strongswan but only ipsec?
It looks like an issue if rel-28.1 is working but rel-28.2 cannot.